From the course: Machine Learning with Python: Foundations

Sampling your data

- [Instructor] As we prepare our data for machine learning, we sometimes have to reduce the number of rows in our data or split the data into two or more partitions. We do this because the data we have is too large or too complex to use in it's current form, or because we need to hold on to some of our data for later use. In Supervised Machine Learning, our goal is to create a model that maps a given input, which we call independent variables, to the given output, which we call the dependent variable. In order to properly evaluate whether our model is learning, we have to get an unbiased estimation of its performance using data that it has not previously seen. To do this, we must first split our previously labeled historical data into training and test datasets. We hold out the test data and use the training data to build or train our model. Then we evaluate our models performance using the test data. There are several ways to split data for this purpose. The most common approach is known as sampling. Sampling is a process of selecting a subset of the instances in a dataset as a proxy for the whole. In statistical terms, the original data set is known as the population, while the subset is known as a sample. Sampling comes in several flavors. To illustrate some of them, let's use a fictional population of 20 students. 12 women, and eight men. From this population, we intend to create a sample of five students. The first sampling approach we illustrate is random sampling without replacement. In this approach, we begin by randomly selecting a student from the population. For example, we select student number 11. Then we select student nine. Notice that as we select students from the population, they are no longer part of the pool of students from which we can select. Next, we select student 15, student 19, and finally student three. This is random sampling without replacement. The next type of sampling is known as random sampling with replacement. In this approach, we also randomly select students from the population. However, there is one key difference. Before I tell you what it is, let's see if you notice on your own. Let's begin by randomly selecting the first four students. The first student we select is student 11, then student nine, student five, and then student 19. Have you noticed a difference yet? I'm sure you have. As we select students from the population, they remain part of the pool of students from which we can select in subsequent trials. This is the replacement part of random sampling with replacement. As you can imagine, this means that we could potentially select the same student more than once for our sample. This is exactly what happens here. Student nine is selected twice for the sample. This may seem like an odd way to sample data, but it actually is a very important technique in machine learning known as bootstrapping. Bootstrapping is often used to evaluate and estimate the feature performance of a supervised machine learning model when we have very little data. The next sampling approach is known as stratified random sampling. Stratified random sampling is a modification of the simple random sampling approach and ensures that the distribution of values for a particular feature within the sample matches the distribution of values for the same feature in the overall population. To do this, the instances in the original data, the population, are first divided into homogenous subgroups known as strata. In our example, let's assume that we intend to stratify based on gender. This means that we first need to group our population by gender, recall that our goal is to create a sample of five students out of the 20 students in our population. In other words, our sample should be a fourth of the students in the population. This also means that our sample should be a fourth of the students from each stratum or group. Since we have 12 women in the population, we randomly select three women for our sample. And since we have eight men in the population, we randomly select two men for our sample. Notice that the sample has the same three to two gender distribution of women to men as the overall population. That is the benefit of this sampling approach.

Contents