From the course: Machine Learning with Python: Foundations

Normalizing your data - Python Tutorial

From the course: Machine Learning with Python: Foundations

Normalizing your data

- [Narrator] An ideal dataset is one that has no missing values and has no values that deviate from the expected. Such a dataset hardly exists, if at all. In reality, most data sets have to be transformed or have data quality issues that need to be dealt with prior to being used for machine learning. This is what the third stage in the machine learning process is all about, data preparation. Data preparation is a process of making sure that our data is suitable for the machine learning approach that we choose to use. Specifically data preparation involves modifying or transforming the structure of our data in order to make it easier to work with. One of the most common ways to transform this structure of data is known as normalization or standardization. The goal of normalization is to ensure that the values of a feature share a particular property, this often involves scaling the data to fall within a small or specified range. Normalization is required by certain machine learning algorithms, it reduces the complexity of our models and can make our results easier to interpret. There are several ways to normalize data. The first is known as z-score normalization. Z-score normalization, which is also known as zero mean normalization, gets its name from the fact that the approach results in normalized values to have a mean of zero and a standard deviation of one. Given an original value V of feature F the normalized value for the feature V' is computed as V minus the feature mean denoted as F bar, divided by the standard deviation of the feature, Sigma F. To illustrate how z-score normalization works, let's consider the feature with five values shown here. The mean of these values is 40,800, and if the deviation is 33,544. To normalize the fourth value we subtract 40,800 from 40,000 and divide the result by 33,544. This yields negative 0.024, using the same approach for the other values of the feature, the normalized values will now be -0.859, - 0.5, - 0.322, - 0.024 and 1.705. Note that the mean of the normalized values is zero and the standard deviation is one. In most instances, z-score normalization works well. However, some problems and certain machine learning algorithms require that our data have a lower and upper bound such as zero and one. For that we need min-max normalization. With min-max normalization, we transform the original data to a new scale defined by user defined lower and upper bounds. Most often the new boundary and values are zero and one, mathematically this transformation is represented as shown here where V' is a normalized value, V is original value, minF is the minimum value for the feature, maxF is the maximum value for the feature, upperF is a user defined upper bound and lowerF is user defined lower bound. To illustrate how min-max normalization works let's consider the same set of five values. Assuming that we set the upper bound to one and the lower bound to zero, to normalize the third value, 30,000, we apply the min-max normalization formula, which yields 0.209. Using the same approach for the remaining four values, the min-max normalized values for the feature will now be 0, 0.14, 0.209, 0.326 and 1. Z-score and min-max normalization are usually suitable when there are no significant outliers in our data. If there are outliers in our data, a more suitable approach is log transformation. With log transformation we replaced the values of the original data with its logarithm as shown here, where V is original value for the feature and V' is a normalized value. The logarithm used for log transform can be the natural logarithm log base 10 or log base two. This is generally not critical, however, it is important to note that log transformation works only for values that are positive. A plan log transformation to the fifth value in our example data, we get 4.991, applied to the rest of the values we now have 4.079, 4.380, 4.477, 4.602 and 4.991. Notice that this approach minimize the distance between the original outlier values, 12,000 and 98,000 and the rest of the data.

Contents