Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Aims to address bias in a dataset by employing statistical analysis and natural language processing techniques. Initially, the team imports the dataset and creates a graph to visualize the distribution of bias labels. They identify a significant bias towards one label (82.2% to 17.8%) and set out to mitigate it. Through statistical analysis, they determine mean ratings and positive feedback counts for different department categories. Utilizing this information, they update the dataset, reassigning labels based on deviation from department-specific mean values. This statistical approach significantly reduces bias, achieving a more balanced distribution (62% to 38%). Further preprocessing involves text normalization, stemming, and lemmatization to reduce feature space. TF-IDF vectorization is employed to calculate term frequency-inverse document frequency weights, enriching the dataset representation. In conclusion, the code successfully mitigates bias through a comprehensive statistical and NLP-based approach, enhancing the dataset's utility for subsequent analysis and modeling tasks.
- Loading branch information