Skip to content

A scholarly Python endeavor examining PCA, TSNE, UMAP impacts on PubMed data clustering 📈, with BBC News/Web Content as optional datasets. It scrutinizes dimensionality reduction's influence on K-means cluster fidelity, aiming for robust analytical insights .

Notifications You must be signed in to change notification settings

AbirOumghar/Dimensionality-Reduction-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Advanced Clustering and Dimensionality Reduction Analysis

This scholarly Python project scrutinizes the efficacy of dimensionality reduction techniques—PCA, TSNE, UMAP—combined with K-means clustering on Pubmed and Web Content datasets. The initiative evaluates these methodologies' capacity to maintain the intrinsic data structure post-reduction, offering critical insights into their applicability in complex data analysis scenarios.

Overview

  • Data Exploration: Detailed examination and preprocessing of datasets to ensure quality and relevance.
  • Methodological Application: Implementation of PCA, TSNE, and UMAP to compress and visualize data dimensions.
  • Clustering Analysis: Utilization of K-means to cluster reduced datasets, evaluating cluster coherence and separation.
  • Performance Evaluation: Analysis of dimensionality reduction and clustering results using metrics like silhouette scores.

Datasets

  1. PubMed 20k RCT: Clinical research article abstracts categorized into stages of clinical study.
  2. Web Content: Varied web text data spanning 16 categories, from education to e-commerce.

Objectives

  • Dimensionality Reduction: Application and critical analysis of PCA, TSNE, and UMAP for insightful data compression.
  • Clustering Quality: Integration with K-means clustering to assess alterations in data cluster fidelity.
  • Hyperparameter Optimization: Examination of parameter impacts on method performance, emphasizing perplexity (TSNE) and neighbor count (UMAP).

Key Findings

  • Method Efficacy: Diverse strengths observed across PCA (linear relationships), TSNE, and UMAP (complex local structures).
  • Optimal Parameterization: Demonstrates the significance of hyperparameter fine-tuning in enhancing lower-dimensional data representations.
  • Quantitative Metrics: Silhouette scores and agreement coefficients facilitated a comparative analysis, showcasing UMAP's superior balance in structure preservation.

About

A scholarly Python endeavor examining PCA, TSNE, UMAP impacts on PubMed data clustering 📈, with BBC News/Web Content as optional datasets. It scrutinizes dimensionality reduction's influence on K-means cluster fidelity, aiming for robust analytical insights .

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published