Skip to content

Commit

Permalink
Make table3 match tidyr version, adjust ex accordingly
Browse files Browse the repository at this point in the history
  • Loading branch information
mine-cetinkaya-rundel committed Apr 13, 2023
1 parent d2b27da commit b6277d0
Showing 1 changed file with 2 additions and 12 deletions.
14 changes: 2 additions & 12 deletions data-tidy.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -44,16 44,6 @@ You can represent the same underlying data in multiple ways.
The example below shows the same data organized in three different ways.
Each dataset shows the same values of four variables: *country*, *year*, *population*, and number of documented *cases* of TB (tuberculosis), but each dataset organizes the values in a different way.

```{r}
#| echo: false
table2 <- table1 |>
pivot_longer(cases:population, names_to = "type", values_to = "count")
table3 <- table2 |>
pivot_wider(names_from = year, values_from = count)
```

```{r}
table1
Expand Down Expand Up @@ -136,7 126,7 @@ ggplot(table1, aes(x = year, y = cases))

1. For each of the sample tables, describe what each observation and each column represents.

2. Sketch out the process you'd use to calculate the `rate` for `table2` and `table3`.
2. Sketch out the process you'd use to calculate the `rate` from `table2`.
You will need to perform four operations:

a. Extract the number of TB cases per country per year.
Expand Down Expand Up @@ -360,7 350,7 @@ There are two columns that are already variables and are easy to interpret: `cou
They are followed by 56 columns like `sp_m_014`, `ep_m_4554`, and `rel_m_3544`.
If you stare at these columns for long enough, you'll notice there's a pattern.
Each column name is made up of three pieces separated by `_`.
The first piece, `sp`/`rel`/`ep`, describes the method used for the diagnosis, the second piece, `m`/`f` is the `gender` (coded as a binary variable in this dataset), and the third piece, `014`/`1524`/`2534`/`3544`/`4554`/`5564/``65` is the `age` range (`014` represents 0-14, for example).
The first piece, `sp`/`rel`/`ep`, describes the method used for the diagnosis, the second piece, `m`/`f` is the `gender` (coded as a binary variable in this dataset), and the third piece, `014`/`1524`/`2534`/`3544`/`4554`/``` 5564/``65 ``` is the `age` range (`014` represents 0-14, for example).

So in this case we have six pieces of information recorded in `who2`: the country and the year (already columns); the method of diagnosis, the gender category, and the age range category (contained in the other column names); and the count of patients in that category (cell values).
To organize these six pieces of information in six separate columns, we use `pivot_longer()` with a vector of column names for `names_to` and instructors for splitting the original variable names into pieces for `names_sep` as well as a column name for `values_to`:
Expand Down

0 comments on commit b6277d0

Please sign in to comment.