From the course: Machine Learning with Python: Foundations

Visualize your data

- [Instructor] Data exploration is a second of the six stages of steps in the machine learning process. Data exploration is a process of describing, visualizing, and analyzing data in order to better understand it. By exploring our data, we can answer questions such as. How many rows and columns are in the data? What type of data do we have? Are there missing, inconsistent, or duplicate values in the data? During data exploration, even after using sophisticated statistical techniques to analyze data, certain patterns are best understood when represented with a visualization. Like the popular saying goes, "A picture is worth a thousand words." Visualizations serve as a great tool for asking and answering questions about data. Depending on the type of question we are trying to answer, there are four major types of visualizations we could use. The first is a comparison visualization. Comparison visualizations are used to illustrate the difference between two or more items at a given point in time or over a period of time. One of the most commonly used comparison visualizations is a box plot. Using a box plot, we can compare the distribution of values for a continuous feature against the values of a categorical feature. For example, this box plot compares carbon dioxide emissions values across vehicle class. Based on the visualization, we can tell that on average pickups and vans have higher carbon emissions than compact and midsize cars. Comparison visualizations provide insights such as the significance of a feature, the variation in the median or mean value of a feature across subgroups, and the existence of outliers in the values of a feature. Relationship visualizations are used to illustrate the correlation between two or more continuous variables. Scatter plots and line charts are two of the most commonly used relationship visualizations. They show how one variable changes in response to a change in another. For example, this scatter plot highlights the negative relationship between vehicle emissions levels and city mileage. Specifically, vehicles with higher city mileage ratings emit less carbon. Besides illustrating how two features interact with each other, relationship visualizations also provide insight into the significance of a feature and the existence of outliers within the values of a feature. The third type of visualization is a distribution visualization. As the name suggests, these types of visualizations illustrate the statistical distribution of the values of a feature. One of the most commonly used distribution visualizations is the histogram. With a histogram, we can figure out the most common values of a feature. For example, this histogram shows that most vehicles in the dataset have carbon emissions values between 300 and 700 grams per mile. Histograms visualize the spread or skewness in the values of a feature. They also highlight the presence of outliers in the data. A composition visualization shows the component makeup of our data. Stacked bar charts, grouped bar charts, and pie charts are three of the most commonly used composition visualizations. Stacked bar charts show how much a subgroup contributes to the hole. For example, based on this stacked bar chart, we can figure out the proportion of vehicles each year that are front wheel drive, all wheel drive, and rear wheel drive within the dataset. Besides illustrating how much a subgroup contributes to the total, composition visualizations can also illustrate the relative or absolute change in a subgroup composition over time.

Contents