diff --git a/EDA.Rmd b/EDA.Rmd index 3f6d1f698..322d6a92c 100644 --- a/EDA.Rmd +++ b/EDA.Rmd @@ -59,12 +59,12 @@ The rest of this chapter will look at these two questions. I'll explain what var "cell", each variable in its own column, and each observation in its own row. -So far, all the data you've seen so far has been tidy. In real-life, most data isn't tidy, so we'll come back to these ideas again in [tidy data]. +So far, all of the data that you've seen has been tidy. In real-life, most data isn't tidy, so we'll come back to these ideas again in [tidy data]. ## Variation **Variation** is the tendency of the values of a variable to change from measurement to measurement. You can see variation easily in real life; if you measure any continuous variable twice, you will get two different results. This is true even if you measure quantities that are constant, like the speed of light. Each of your measurements will include a small amount of error that varies from measurement to measurement. Categorical variables can also vary if you measure across different subjects (e.g. the eye colors of different people), or different times (e.g. the energy levels of an electron at different moments). -Every variable has its own pattern of variation, which can reveal interesting information. The best way to understand that pattern is to visualise the distribution of variable's values. +Every variable has its own pattern of variation, which can reveal interesting information. The best way to understand that pattern is to visualise the distribution of the variable's values. ### Visualising distributions @@ -96,7 +96,7 @@ diamonds %>% count(cut_width(carat, 0.5)) ``` -A histogram divides the x-axis into equally spaced bins and then uses the height of bar to display the number of observations that fall in each bin. In the graph above, the tallest bar shows that almost 30,000 observations have a `carat` value between 0.25 and 0.75, which are the left and right edges of the bar. +A histogram divides the x-axis into equally spaced bins and then uses the height of a bar to display the number of observations that fall in each bin. In the graph above, the tallest bar shows that almost 30,000 observations have a `carat` value between 0.25 and 0.75, which are the left and right edges of the bar. You can set the width of the intervals in a histogram with the `binwidth` argument, which is measured in the units of the `x` variable. You should always explore a variety of binwidths when working with histograms, as different binwidths can reveal different patterns. For example, here is how the graph above looks when we zoom into just the diamonds with a size of less than three carats and choose a smaller binwidth. @@ -153,7 +153,7 @@ Clusters of similar values suggest that subgroups exist in your data. To underst * Why might the appearance of clusters be misleading? -The histogram shows the length (in minutes) of 272 eruptions of the Old Faithful Geyser in Yellowstone National Park. Eruption times appear to be clustered into two groups: there are short eruptions (of around 2 minutes) and long eruptions (4-5 minutes), but little in between. +The histogram below shows the length (in minutes) of 272 eruptions of the Old Faithful Geyser in Yellowstone National Park. Eruption times appear to be clustered into two groups: there are short eruptions (of around 2 minutes) and long eruptions (4-5 minutes), but little in between. ```{r} ggplot(data = faithful, mapping = aes(x = eruptions)) +