This repository contains the solution and code for the 1st Homework Assignment of the "Clustering Algorithms" graduate course of the MSc Data Science & Information Technologies Master's programme (Bioinformatics - Biomedical Data Science Specialization) of the Department of Informatics and Telecommunications department of the National and Kapodistrian University of Athens (NKUA), under the supervision of professor Konstantinos Koutroumbas, in the academic year 2023-2024.
The assignment explores theoretical and computational approaches to clustering using algorithms like k-means and k-medians, applied to synthetic datasets. The tasks include mathematical proofs, algorithmic derivations, and implementation of clustering methods using MATLAB.
- Theoretical Exercises: Analysis and derivations of clustering cost functions and algorithms, such as hard and possibilistic k-medians.
- Computational Tasks: MATLAB scripts implementing and evaluating clustering algorithms on synthetic datasets.
- Generated Results: Visualizations and quantitative analyses comparing clustering performance metrics.
exercise4a.m
andexercise4b.m
: Scripts for generating datasets and running k-means/k-medians on basic 2D distributions.exercise5a.m
andexercise5b.m
: Extensions of clustering with noise and outliers added to the dataset.
-
Dataset Generation:
- Synthetic datasets of 2D points generated from multivariate normal distributions with predefined means and covariances.
-
Clustering Algorithms:
- k-means: Minimizes squared Euclidean distances.
- k-medians: Minimizes Manhattan (L1) distances, making it robust to outliers.
-
Visual and Quantitative Analysis:
- Points and cluster centers are plotted for each dataset.
- Representative metrics are compared with true distribution means.
-
Handling Noise and Outliers:
- Additional noisy points are introduced to evaluate algorithmic robustness.
- Visualization of cluster assignments and centers.
- Quantitative comparison of representative means to true cluster centers.
- Evaluation of robustness to outliers and initialization strategies.
To clone this repository, run the following command in your terminal:
git clone https://github.com/GiatrasKon/Clustering_Algorithms_Analytical_and_Computational
- MATLAB installed on your system.
- Open MATLAB.
- Navigate to the cloned repository folder.
- Run the desired MATLAB scripts (
exercise4a.m
,exercise4b.m
, etc.) to generate datasets, run clustering algorithms, and visualize results. - Modify dataset parameters (e.g., number of points, distributions) within the MATLAB scripts for custom experiments.
- Execute scripts in MATLAB for automated clustering and visualization.
Refer to the documents
directory for the assignment description and report.