From the course: Machine Learning with Python: Foundations

How to normalize data in Python - Python Tutorial

From the course: Machine Learning with Python: Foundations

How to normalize data in Python

- Part of the objective of data preparation, is to transform our data in order to make it more suitable for machine learning. During this step, we often have to restructure some of our data, so right. It conforms to a particular characteristic. This is known as normalization. There are several ways to normalize data in Python, to illustrate how to normalize data let's import and preview a sample vehicles emissions data set into a data frame called vehicles. Our goal is to normalize the CO2 emissions column. So let's get descriptive statistics for that column, vehicles specify the column that we want, which is CO2 emissions. And we call it that describe method to augment our understanding of the summary statistics. Let's also create a histogram that shows the distribution of values for the CO2 emissions column, the histogram visualizes, what we already see in the summary statistics, the carbon emissions values in the dataset have minimum, and maximum values of 29 and 1269.57 respectively. They also have mean and median values of 476.55 and 467.74 Respectively. This scikit-learn package provides several functions for transforming data in Python for min-max normalization. We first import the min-max scaler object from, the SK learn pre-processing sub package, to from SK learn, dot pre-processing, we import the min max scaler object. Next, we use the fit transform method of, the object to normalize our data. So we're going to call on new data CO2 emissions on the score MM. And it's going to be the min-max scaler object, the fit on the squad transform method within the method. We passed away, the data we want to transform, which is vehicles, and we want the CO2 emissions column. And then we output our results. Notice that our result is a nonPareil. We can convert it back to a data frame by using the pandas, data frame, construct a function. So let us right back CO2 emissions, the MM and a call the PD that data frame, construct a function. Then the function we're going to pass to it. Two things, our original data, CO2 emissions, M M, and the value for the columns arguments, which this time around will be just a list of the column name that we want, which is CO2 emissions. And we output our data. Once more. Now we can get summary statistics for our normalized data frame, so to do so we call the data frame CO2 emissions, on the score, MM. And we call the describe method of the data frame. We can also visualize it, which is what we have here based on the summary statistics and the visualization. We see that the minimum value is now zero, while the maximum value is one. That is what we expect for min-max normalization. However, notice that compared to the original data, even though the scale of the X axis changed, the basic structure or shape of the histogram, remains the same, that is also expected for z-score normalization, We import the standard scaler object from the SK learn pre-processing sub package. So from SK learn that pre processing, we import standard scaler. Next, we normalize our data, convert it to a data frame and compute summary statistics like we did. In the previous example, finally, we visualize the data as well. As expected, the basic structure of the histogram remained intact. Even with the change to the scale of the X axis. This time, we also notice that our minimum and maximum values, are negative 3.8 and 6.7 respectively. Also note that the standard deviation is one, and the mean is effectively zero.

Contents