From the course: Machine Learning with Python: Foundations

How to summarize data in Python - Python Tutorial

From the course: Machine Learning with Python: Foundations

How to summarize data in Python

- [Instructor] During the exploration, one of the best ways to understand the nature of the data at hand is to summarize it by computing aggregations, such as mean, median, maximum and minimum. These aggregations or statistical measures, as they're commonly referred to, are helpful in describing the general and specific characteristics of data. The Pandas data frame provides several easy to use methods that help us describe and summarize data. One of these methods is the info method. Given a data frame called washers, we can get a concise summary of its rows and columns by calling it's info method. From the output, we can tell the washers data frame has 261 rows and 18 columns. Five of the columns hold decimal values, three hold integer values and 10 hold textual data. If we want a sneak peek of the data stored in the data frame, we can call it's head method. As we can see from the output, the head method returns just the first five rows in the data frame. This provides us with a high level view of the data we're working with. Now that we know what the data looks like, we can start to dive a little deeper into the nature of the values. The describe method of a data frame is useful for this, the method returns a statistical summary of each of the columns in a data frame. It's important to note that the descriptive statistics returned by the describe method depends on the data type of a column. For example, let's get the descriptive statistics for the non-numeric brand name column in the washers data frame. To do this, we specify the column that we want, which is brand name. Then we call the describe method for this column. The output tells us that there are 261 non missing values in the brand name column. It also tells us that there are 22 unique washer brands in the data. Of the unique brands, LG is the most occurring with 50 washers listed under the LG brand. To illustrate how the describe method works for numeric columns, let's get a descriptive statistics for the volume column in the washers data frame. So we start by selecting the column that we want, volume. Then we call the describe method. From the statistics, we can tell that the average, minimum and maximum volumes of the washers in the data are 4.4, 1.9 and 6.2 cubic feet respectively. Instead of getting a pre-packaged list of statistical measures, we can also compute specific aggregations for certain columns in a data frame. The Pandas package provides several data frame methods to do this. For example, we can get a count of each unique washer brand in the data frame. We start by specifying the column that we want, this time around we also want the brand name. We call the value counts method, and this gives us a list. Sometimes it's more useful to get a percentage rather than a count. To do so, we modify the code we ran in the previous cell. Within the value counts method, we specify a value for normalize to true. Now we get a percentage representational distribution for each brand of washer. The output tells us that 18% of the washers in the data are Samsung washers. For numeric columns, we can get specific aggregations as well. For example, we can compute the average volume of washers in the dataset. So let us get the washers, volume column, and we compute the mean. 4.37. We can also get specific aggregations at the group level. For example, we can compute the average volume of washers by brand. To do this, we specify the data frame washers, call the group by method and pass to it the column on which we want to group by, which is brand name. Next, we specify the column we want to aggregate by, which is volume. And we call the mean method. This result is sorted by brand to help us better compare the average washer volumes across brands. Let's sort the data in ascending order of average volume. To do this, we make a slight modification to our code. We add a sort values method and say sort by volume. Now our washer mean volumes by brand is sorted by the mean volumes for each washer brand. Now we can clearly see that on an average, Beko washers have the smallest volume, while Midea washers have the largest volume. We can also compute more than one specific aggregation at once. For example, let's compute the average median minimum and maximum washer volume for each brand. To help us along the way, part of the code has already been written. What we need to do now is to use the dot ag method and within the method we specify the aggregations that we want. So we want the mean, we also want median. We want min and we want max. The methods introduced here are just the tip of the iceberg. To explore some of the other methods which are useful in summarizing data, visit the Pandas documentation site.

Contents