Building a Grocery Classification Model

Here's the problem: You write a shopping list with a few items, for example, "strawberries" and "chips". You also have the weekly ads, and you can get that data as JSON. Can you match each item on your shopping list to the corresponding item in the ads?

We want a model that can take a deal or coupon from the weekly ads, and match it to one of the predefined categories or items on our shopping list. For me these are broad categories like "chips" as opposed to specific brands like "boulder chips" or "lays chips". Sometimes I buy brand specific items like "Fritos" but not very often. So I want the model to match broad categories, specific types of foods, and sometimes brands.

This post will document some of the strategies and thought processes I used to pursue that goal.

Is this a deterministic problem?

This problem is partly deterministic. For example, you could just search for the words "strawberries" or "chips" in the ads and the results will return strawberries and chips - but it would also return products "chips ahoy" or "strawberries and cream oatmeal" - it also exclude "Fritos" which I would classify as a corn chip. So the text on its own is not a reliable indicator, but it can help narrow things down. For example the search text "strawberries" in combination with a category of "produce" may be a very accurate find strawberries during most of the year. But around valentines day, that might not apply if the ads have "Chocolate Covered Strawberries" categorized as produce.

Here's another example of ambiguity in matching: In the ads you'll see "Brand Name Angus Burgers" - On my shopping list, I would write down "beef", but beef isn't anywhere in the product name.

It's also common for the ads to group deals together. So you might see "Large Ripe Organic Avocados or Organic Red Mangos" in a single deal. This listing should match both avocados or mangos on your shopping list. Another examples: "Tillamook Snack Cheese 10 ct or Noosa Yoghurt, 4-Pack."

So this feels like a problem that could be solved if you uncover enough patterns and then play around with the classification rules. But if that's possible to do, then couldn't a machine learning model also learn to do that? I think so. A random forest or decision tree are basically big if/then/else statements that we train instead of program.

If this works, why does it even matter?

Let's say we've solved the problem and we have a model that can do a fairly good job of matching deals to shopping list items. Here's how this model will fit into my shopping process.

First, I'll download the ad data from 3 grocery stores that I go to. I built a custom chrome extension to make this easy. Second, I'll run a script that will add zero or more labels to the items in the ad data. More details on that later. After the data has been labeled, I'll run a script that will convert the labeled JSON into a HTML page where I can filter and sort and browse all the items from the stores. The items will be sorted with some good default - think of this like a personalized recommendation engine that isn't store-specific and that I fully control. This is the end goal, and I'm partly there because I already have a way to view, filter, and sort the raw data with simple string matching filters. Using the model will just make the filtering process more relevant to me by automatically showing items I am interest in and hiding more items I'm not interested in.

The Classification Script

Ok, so I did build a model, and here's how it works. The classification script takes the raw data and then applies 1 or more labels like "cereal", "meat", "household", "grapes", etc. to the items. The script was written in python. Data analysis and data probing was done in jupyter notebooks. There are 5 general steps. The first required a lot of time and preparation:

  1. feature engineering / acquiring training data
  2. process/transform the data
  3. split the data into training/validation data
  4. train the model
  5. validate the model to determine accuracy

Feature Engineering

(side note: please read further for considerations of "The Bitter Lesson")

The raw data comes in various forms. At one store, I have a schema for coupons, deals, and regular products. Each has title and description fields. Deals and products have a field with one or more store departments. Coupons have a field for product categories. The data from other stores only has a title.

But regardless, it's ok that each store has different data because I take whatever is available and put it all into a single text field I call "tDescription". It has any available information as text: product descriptions, brand information, price information, department/category label, or marketing text. It all goes into tDescription.

On the training data, the "tItem" field contains the labels. The data, tItem, and tDescription fields are a pandas DataFrame.

I then take the corpus of text (tDescription) and run it through 3 CountVectorizers from the scikit-learn library. I set them up with a predefined vocabulary which I've curated: 1, 2, and 3 word phrases (1grams, 2grams, and 3grams). I'll talk more about that curation process later. I then run the tDescription field through the vectorizers to get the vocabulary counts for each row of data. These data features get added to the DataFrame. So for example, I might have a count of 1 for the "strawberries" feature on actual strawberries, but have a 1 for "strawberries" and "chocolate" on chocolate covered strawberries. Strawberries and chocoloate are feature columns on the DataFrame.

Splitting Data + Training Model

After transforming the data, I split it into training and validation data before training the model. After the model is trained, I check the accuracy of the model by using it to predict the labels on the validation data. I compare the predicted labels to the actual labels to determine the accuracy.

My dataset was from 5 months of grocery store ads so there were some seasonal items that only showed up a few times. I added data labels to about 700 items in my training data. I curating the vocabulary from the 5 months of data and also from my own shopping lists. I used a RandomForestClassifier as the base classifier and a OneVsRestClassifier on top of that to support the multi-label classification. After all of that my final model accuracy was 61%. So 61% of the time, the model correctly predicted the labels in the "tItem" column.

Each time I trained the model, I then used the model to classify my next set of training data - just to do the initial labeling - and then I went back and checked the results and also corrected errors. I then put that training data back into the model to further improve it.

Improving the model

For this type of model, I've found there are two key things which improve its accuracy and they go hand in hand. The first is the existence of important vocabulary items (the ngrams). It doesn't really matter how much training data you have if the model doesn't know the words or phrases necessary to map the data to the label. Think of the model as a big if/then/else branch - it needs data points in order to create those branches, and those data points are the vocabulary counts.

As I was labeling, I found myself thinking about primary vs secondary vocabulary. "wine" is a primary vocabulary. It just says what the item is. But most of the deals didn't actually say wine, it said the name of the wine. So a very powerful secondary indicator was all the ngram variations including "750" , "750ml", "750 ml bottle", etc. There's also the "category_adult_beverage" keyword that I add when the category is "Adult Beverage" or the department is "LIQUOR". Those vocabulary items do a lot of heavy lifting and the model learned fairly quickly how to accurately identify wine and beer.

The second thing that improves the model is more training data. The more coupons, deals, and products it sees, the better it can distinguish. As I was labeling data, I would often notice the important "secondary vocabulary" for certain types of items. So labeling and adding to vocabulary was a hand-in-hand process.

ngram curation

To curate the ngrams, I created a script that ran the vectorizer for 1, 2, and 3 word phrases found in my data and spit out the results. I then go through them and keep the ones I think are "high value."

The ones I keep go into the "1-gram-allowed.txt", "2-gram-allowed.txt", and "3-gram-allowed.txt" files. The ones I throw out go into "1-gram-disallowed.txt", "2-gram-disallowed.txt", and "3-gram-disallowed.txt" files. When I process new data that has new n-grams, I put new n-grams into "n-gram-tosort.txt" file which I sort through manually and pick and choose high value features.

With this process, I'm able to ignore what I've already seen and only see new words and phrases as I get more data over time.

Data Labeling

The data labeling was very tedious. I did it manually with a text editor and kept track of the list of labels that I wanted my model to use. I can see why this labeling process is usually done with some kind of specialized software. Maybe in the future I'll build something to make this easier or integrate it into the web view where I browse products, so I can quickly classify items.

Brand-specific "bloat"

There are a lot ads that are what I'd call "fully branded" and don't give any indication of what they are at a categorical level. They are expected to be self evident. For example: Colgate, Gatorade, Pepsi, Smucker's Uncrustables, Cheetos. Over time, brands and companies come and go. But having the brand names in the vocabulary is essential for correct labeling. I can see that being problematic over time, because you'll have a lot of brand specific bloat. You'll have huge amount of brands in the vocabulary, and many of those brands may no longer be around, or maybe they get renamed, or moved into new business models.

This problem of identifying specific brands is called "named entity recognition." You can partly solve it with a library called spacy that will use natural language processing to identify the parts of speech in text and also identify entities. However, this model is only as good as the training data, and if I wanted to use that to accurately identify named entities, I'd need to train a custom spacy model and label my own dataset (which I don't want to do). However, if I did want to do that, this tool named doccano looks like a good open source way to label the training data and build the spacy model.

Product Counts

At least in my dataset, nothing is ever sold in counts of 7, 11, 13, 15, 17, 19, 21, 23. These are the most popular counts

8ct      25
10ct     21
12ct     18
4ct      18
5ct      12
6ct       9
18ct      9
2ct       9
20ct      8
30ct      7
90ct      5
3ct       4
96ct      4
42ct      4

There's also ranges of numbers. The fact that something is counted is valuable information in itself, but the specifics of counting is not important for my model. I will remove the specifics and add a "feature_countable" flag.

Adding this feature didn't seem to change my model's performance all that much.

Product Weight

In my dataset, weight is expressed in ounces or pounds. You'll see pounds in either the expression of a price (like "$11.99 lb") or in the expression of the weight of the product, either as a standalone weight or along with a container. The specifics of the weight I'll leave out, but will capture the unit as "feature_by_pound" or "feature_by_ounce".

There's is also fluid ounces "fl oz" which I will extract as "feature_by_fluid_ounce"

Adding this feature didn't seem to change my model's performance all that much.

Product Containers

Both weight and count are sometimes followed by a container name like bag, tub, box, can, tray, or roll. Example: "2 lb bag", "12 oz box", "12 fl oz"

The type of container is important information that I'll label as "feature_container_bag", "feature_container_tub", etc...

Adding this feature didn't seem to change my model's performance all that much.

Using spacy for Natural Language Processing

I did experiment with using spacy to see if I could pull out any other important features. It was interesting to investigate, but I didn't think it would help with the model I was building.

Using uv for my project, this is how I added spacy...

uv add spacy
uv add pip
uv run -- spacy download en_core_web_sm (the english pipeline for spacy https://huggingface.co/spacy/en_core_web_sm)

The Fundamental Problem

For me this is the fundamental problem...

  1. There is "food in general" - types of food, quality of food, quantity of food, price of food. These are mostly commodities.

  2. Then there is "branded food" - this is food that has a new name added to it. It's advertized and expected to become apart of people's food vocabulary and become recognizable and desirable in its own right. Often times it's exactly the same as the "food in general", but it's presented differently. Sometimes it includes the name of what it fundamentally is, but other times not.

The challenge is looking at a data point and determining which is which. If it's #1, you know what it is. If it's #2, you need more information and context to determine "what it is". Sometimes, branded foods are their own thing that's hard to categorize. What is a cheeto? Is it rice? Is it a chip? It's neither. It's its own thing.

Speeding up Labeling

After reaching 61% accuracy with my model, I thought I would try to reduce the burden of labeling data. There's a lot of alcohol and junk food coupons, which I don't buy so labeling those items just feels like a waste of time. So I built a list of words and phrases that I want the ad items to contain. If they don't have the word or phrase I get rid of them. Here's the critera for the words and phrases I put into that filter list:

  1. I do buy this. Contains general keywords and brands I'm positively interested in.
  2. I want to see these things.
  3. Contains things I buy very often.
  4. I might buy this - Contains general keywords. Nothing brand-specific unless I'm highly likely to engage with the brand. Contains only things I buy rarely. The most important aspect of this list is that it will filter out fully branded products that don't say what they are. Each item in this file I should know what it is and be able to imagine the product and say, "I might buy that".
  5. I don't necessarily want to see these things, but I wouldn't care if I do see them, and in some rare situations I do want see these

Everything outside of that I will throw away so I don't have to label it.

After making this change the accuracy of my model did go slightly down to 58%, but I suppose that's to be expected since I have less training data and the "easy targets" like alcohol are no longer bumping up my numbers. I'll continue to add training data and see how that effects the model in the future.

The "Bitter Lesson"

I should note I became aware of the bitter lesson after doing all that feature engineering.

"The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin... Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation"

"We have to learn the bitter lesson that building in how we think we think does not work in the long run. 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning."

How did my model perform when I just use the huge list of all 1, 2, and 3 n-grams as the features? Only 40% accurate. From this, I am guessing that there is a lot of noise interspersed among the important features. The ratio of signal to noise is too high and requires more labeled data. Will this model perform better than the model with curated features if I have more data? I don't know, maybe. This entire process has made me realize that a lot of machine learning isn't exactly science - it's partly art. It's partly try something and see. Over time you may develop an intuition of what will improve it. This talk from Creikey - Deep Learning and Computer Vision for Game Developers – BSC 2025 is somewhat relevant (especially at timestamp 25:00). Here's the X comment about yolo hyperparameter training at open ai.

Future Changes

I'm going to continue to classify data and improve the model, but here's some other things I might investigate in the future.

  1. If the RandomForestClassifier is actually like an if/then/else then is there a way to interrogate the model and use it to help uncover the classification logic? Can we gain insight from it as opposed to it being a black box? Can we ask the model what ngrams it uses to classify things with label X?

  2. What happens to the model accuracy if we only use the most popular ngrams? How about the least popular ngrams? What happens if we graph the ngram-filters against the performance? Where is the maximum?

  3. What happens if we run our data through spacy and only use the prepositional objects and nouns as the ngrams?