- Gathering the data, Notebook
- Make an exploratory data analysis, Notebook
- Cleanse and prepare the data, Notebook
- Cluster the data, Notebook
- Interpret the data, Notebook
The goal of this project is to go though an end-to-end clustering project from getting the data to use the generated insights further. An imaginary usage could be to use the clusters to build a recommender system for chocolate: "If you like this chocolate, you might like that ones as well." I've chosen the chocolate context because I have much more contact to chocolates than to customers or end users.
The project is built upon data from the public API of the U.S. Department of Agriculture (https://fdc.nal.usda.gov/api-guide.html). The API documentation can be found here. The data is accessed via REST API and stored in a local PostgreSQL database. But you can see the raw data from the API here.
Next steps are performing an exploratory data analysis as a foundation for the data cleansing step. And the data cleansing itself, including a wide range of adjustments, e.g. extracting data in lists. The cleansed and fully prepared data is stored again in the database. A cleansed but no-encoded version of the data for visualization or other projects can be found here as csv.
Last part is the clustering itself. Two algorithms, KMeans and DBSCAN were used on different subsets of the dataset based on the first clustering results. For both a comprehensive hyperparamter tuning was implemented in order to get a optimal result.
Side Note: I made a python script that includes all steps requesting and cleaning the data in a straight-forward way here.
Unfortunately, both clustering algorithms were not able to define meaningful clusters. In summary, there is one big cluster aka. no cluster. For more information, please see the clustering notebook.
Nevertheless, I still believe one can find interesting information through visualization.