A price comparison between the three major Australian Supermarkets, Aldi, Coles and Woolworths to identify the best strategy for saving money on groceries.

Problem Statement

With the rising cost of living affecting many Australians, many people are turning to budgeting to make their money go further. One of the key areas that people aim to save money is through reducing unnecessary spending on groceries. This report aims to show which major Australian supermarket chains are best to shop both overall and for various food categories. This in turn will help shoppers make the best financial decisions when it comes to their grocery needs.

Objectives & Success Metrics

The objective of this report is to give the best supermarket chain for each food category that maximises savings. In this report, this will be measured by the supermarket that performs the best over of the most food categories.

Data Overview

  • Source: External scraper (grocery-scraper GitHub project) provides raw product data
  • Storage: cleanProductInfo.json - array of product objects with nested price history
  • Schema (each product):
    {
      "name": "Product Name",
      "url": "https://supermarket.com.au/product/...",
      "img": "https://cdn.../image.jpg",
      "quantity": <number>,
      "history": [
        {"daySinceEpoch": <days>, "price": <AUD>}
      ]
    }
    
  • Size: 14705 unique products with historical price observations, broken down as follows:
    • Aldi: 213
    • Coles: 8614
    • Woolworths: 5878

A key limitation of the data is that it firstly does not include IGA, and it contains only a limited number of Aldi products. This data also does not include IGA data. The other important limitation is that the dataset is over two years old. The reasons for this are discussed in Challenges & Lessons Learned. Manual checking over some of the products revealed that prices do not seem to differ substantially at the time of writing this.

Data Cleaning & Preparation

Most of the data cleaning took place in Python. The first issue was to convert any special symbols in the text to a more standard ASCII format (this will be important for feeding this information to a clustering algorithm). The next important issue is to take the JSON file, and normalise it, and explode the history array. This converts the data into a more workable format like so:

name url img quantity daySinceEpoch price

There is no need for the image section of the data so it is discarded. The daySinceEpoch is converted into a more usable date format. Finally, since there is no company category, the url will be used to create a new company section for the data. This renders the following result:

name company quantity daySinceEpoch price url

The final issue to solve is categorising the food items, and to do so consistently across all of the companies. The url and name columns are considered here. Ultimately it is decided only name is needed to complete this process. Since all of the data is combined into one data panel, this later concern is mostly sorted, however Aldi will prove to be somewhat problematic due to the low sample size. A linguistic clustering approach is used to categorise the data as follows:

  • Bakery: 249
  • Beverages: 674
  • Cereals: 1605
  • Condiments: 1487
  • Dairy: 4725
  • Frozen: 1035
  • Grains: 1042
  • Meat: 955
  • Other: 608
  • Produce: 1321
  • Seafood: 343
  • Snacks: 661

The last piece of data necessary for this report is how much products cost for their given weight. This is a trivial calculation involving the price of the product divided by the quantity/weight of the product in kilograms. The rest of the report will focus on the categories:

  • company (text)
  • category (text)
  • quantity (Kg)
  • price (AUD)
  • pricePerKg (AUD)

Assumptions

One of the assumptions of the data that is used for the final visuals is that certain food items make more sense to analyse by weight, and others make more sense to analyse per item. The three items that were chosen for price per kilogram are, meat, produce and seafood. These items typically are charged to the customer by weight while at the check-out counter, and since this report aims to show Australian shoppers how to save money on food items, it makes more sense to display the prices that will appear on the shopping receipt.

Another assumption is that the category “quantity” in the raw data refers to the product weight in kilograms only. This means that it is even more important to limit the food categories where this price type is given, since categories like beverages are usually more relevantly measured in litres or milliliters.

Exploratory Data Analysis (EDA)

Final visualisations: PDF Image screenshot Fullscale PDF link

The first visual to point out is the centre top visual which shows the number of products by company. Clearly, the results for Aldi are likely to be biased by the low sample size. This is confirmed by the lack of seafood items for Aldi as seen in the boxplot and the vertical bar chart. Because of this, any results around Aldi should be taken with a grain of salt.

The next visual in the top left corner shows the mean food price by company and food category. The three food areas that have the highest prices are the price per kilogram category prices, so these will be grouped together. Looking at these three, and excluding Aldi, it seems that Coles and Woolworths are relatively balanced with each other except for the seafood section, which seems to be around two dollars higher. Returning to the other sections, Once again Coles and Woolworths seem to be relatively balanced between each other, with price differences roughly being within a dollar of each other. Aldi, the outlier company varies more wildly. The category with one of the biggest relative differenes is beverages, with Aldi being more than half the price of Woolworths, the second lowest. This could be due to alchoholic beverages not appearing in the skewed Aldi dataset. The next biggest category difference is cereals, where Aldi seems to be considerably more expensive than the other companies, which could once again be due to low sample sizes.

Finally, the last three boxplots at the bottom of the page will be analysed together. Since Aldi has such a low sample size, there are few outliers that could bias the averages in the previous chart. However, for Coles and Woolworths, it is clear that the mean is likely to be significantly biased by the high number of expensive food items for the unit priced items. Due to this, an automatically lower average will not be what determines where is the best place to shop, with additional factors like the range and median (both displayed in the boxplots) being considered also alongside the aforementioned outliers. It is also worth noting how small the outliers are for Aldi’s beverage prices compared to the other supermarkets, which helps give evidence to the alcohol hypothesis mentioned earlier. One last thing to note is that these plots are not scaled relative to each other.

Methodology

This section will go over more techincal techniques and tools and justify why they were used.

Data Cleaning

The tool used for cleanind the data was Python, using the json and pandas packages. The json package was needed to convert the text in ASCII. The pandas library is the standard choice for working with panel data in Python, and has features such as JSON normalisation. Python will also be used for the clustering algorithm, so it is important to have it all in one program.

Categorising Food

In order to categorise the food groups using the food names as discussed in Challenges & Lessons Learned, there are three key approachs that can be used:

  1. Traditional text recognition system that
  2. Using an LLM model to categorise the data based on the most important tokens
  3. Use TF-IDF for cluser labelling

The third option was used along with using Microsoft Copilot in order to generate a based list of category seed examples for the linguistic clustering algorithm. How the algorithm works will not be explained here, but rather the results, and advantages to this approach over the other methods. The traditional approach is highly limiting especially with the size of the data that is being worked with. This type of process involves techniques such as frequency matching user specific words such as “milk”, “meat”, “eggs”, etc. for the various name strings in the data set. Not only is this a time poor approach on the side of the analyst, it also introduces substantial human error through things such as not accounting for various synonyms in the dataset. The second approach that utilises LLMs requires too much overhead, especially for a large dataset, and thus does not scale well beyond small to medium datasets. Thus, a linguistic clustering approach is best suited here. Cosline similarity is then used to select the best categories on the trained model. Category examples are used to help sort the food items (the specific examples can be found in the source file linked in the [GitHub]{#github} section).

The sklearn Python package provides easily accessible methods for training the data and predicting various categories.

Initial Data Analysis & Presentation

Excel was used since features like pivot tables and visualisations make it easy to quickly test various visuals and numbers in a more rough format. While it is possible to present the final results in the final Excel spreadsheet, it is often more difficult to create a more visually engaging visual report in Excel on a dashboard or even as a static printout. Thus Power BI was used for the presentation.

Power BI

To see the finer workings, please see the template provided in the [GitHub]{#github} link. One key area to touch on here is the use of R for the box plots. Since Power BI does not have an in-built boxplot function and installing add-ons requires a work or school account, a simple R script was used to create boxplots.

Results

Based on the [EDA]{#eda} section and considering all of the data (not just the biased mean), along with the consideration of Aldi’s limited sample size, the best place to shop is,

  • Aldi: Beverages (non-alcoholic)
  • Coles: Bakery, Cerals, Condiments, Frozen, Snacks
  • Woolworths: Beverages (including alcohol), Dairy, Seafood
  • Tied Coles & Woolworths: Grains, Meat, Other, Produce

Overall, while Aldi seems to be cheaper, the low sample size makes this conclusion uncertain. Thus, the best store to visit based on number of cheapest food categories is Coles.

Challenges & Lessons Learned

The first major challenge that this project faced was data scraping. Woolworths has recently added harsher anti-scraping protections to their website. Beyond any ethical considerations of scraping a website that is trying to discourage such a practice, it also requires a significant time investment in order to create a new tool or update any existing tools in order to obtain this information. If this tool were to be used and run successfully, it would allow for the usage of the full catelogue of Aldi, Coles, IGA and Woolworths. However, the various technical, legal and ethical set backs to this idea rendered such an approach unfeasible for the current scope of the project.

A huge issue with the project is the limited catelogue of food items that could be analysed for Aldi. While Aldi does have a smaller range, it is more likely that this issue was due to the data source itself.

Future Improvements

The best next step beyond making a more interactive dashboard (only possible with a school or work account) would be to give more metrics, and allow toggling price per kilogram real time. Some other metrics could be a trimmed mean to remove the mainly price outliers seen in the box plot.

Tools & Tech Stack

  • Python
    • pandas
    • numpy
    • sklearn
    • json
    • urllib
    • unidecode
  • Microsoft Excel
  • Microsoft Power BI
  • R (used with Power BI)
    • tidyverse
  • Git
  • Microsoft Copilot
  • ChatGPT

GitHub

portfolio-src/Aus-Supermarket-Comparison/ Note, all data has been stripped from here. The raw Python code and Power BI template can be found here.

Data Source

Data source Raw JSON data source