From the course: AWS Certified Data Analytics – Specialty (DAS-C01) Cert Prep: 4 Analysis and Visualization

Cleaning data

One of the more common tasks that you'll find with dirty data is needing to preprocess it before you can do anything useful with it. Let's take this first scenario here where you have some data and you want to do some exploratory data analysis on it. Before you even get to that step, one of the things you could do is look for missing values. For example, if you ran inside of pandas, look for null values like DF.NaN. If you did find them, it would be up to you as the data scientist to decide, do you want to drop the row, for example? Maybe you have so much data, it's better to just drop those rows or you have a small amount of data and you may want to impute the value. For example, maybe take the median value for the data set and put that value inside. In a second scenario for natural language processing, really common to do preprocessing like tokenization, removing stopwords. In this particular scenario, one of the scenarios that a data scientist would encounter is preprocessing before you do sentiment analysis. So for example, removing things like the, and, or, because they'll remove the effectiveness of something you're doing for NLP.

Contents