From the course: Machine Learning with Python: Foundations

Things to consider when collecting data - Python Tutorial

From the course: Machine Learning with Python: Foundations

Things to consider when collecting data

- [Instructor] Data collection is the first of the six stages or steps in the machine learning process. During the data collection stage, our primary objective is to identify and gather the data we intend to use for machine learning. As we collect this data, there are five key considerations to keep in mind. The first is accuracy. For supervised machine learning problems, we use historical data that has outcome labels or response values to train the model. Ensuring that this data is accurate is critically important to the success of their approach. Supervised learning algorithms use this data as a baseline for the learning process. It serves as a source of truth upon which patterns are learned in order to make future predictions. If this data is inaccurate, then the algorithm's future predictions cannot be trusted. This is why this data is often referred to as ground truth data. Ground truth data can either come with an existing label based on a prior event, such as whether a bank customer defaulted on a loan or not, or it can require that a label be initially assigned to it by domain experts, such as whether an email is spam or not. Regardless of whether labels already exist or need to be assigned, we should always have a plan to validate ground truth data after it has been acquired. The next key consideration is relevance. The type of data we collect to describe an observation should be relevant in explaining the label or the response associated with the observation. For example, collecting data on the shoe size of bank card customers has no relevance in explaining whether a particular borrower will or will not default on the loan. Conversely, excluding information about a customer's income could have an adverse impact on the effectiveness of a model that attempts to predict loan outcomes. The amount of data needed to successfully train a model depends on the type of machine learning approach chosen. This is a third consideration, quantity. Some machine learning algorithms work well with little data while others require a large amount of data to provide meaningful results. Understand the characteristics of the machine learning algorithm we intend to use can provide us with guidance on how much data we need to collect. Besides quantity, variability in the data collected is also important. For example, if we intend to consider the income of a borrower as a predictor of loan outcome, then our ground truth data should include customers of sufficiently different income levels. By doing this, we allow our model to gain a broader understanding of how income level impacts loan outcomes. The fifth consideration is one that is often overlooked, ethics. There are several ethical issues to consider during the data collection process. They include privacy, security, informed consent, and bias. It is important that processes and mitigating steps be put in place to address these issues as part of the process of acquiring ground truth data. If bias exists in the data used to train a model, then the model would also replicate the bias in its predictions. As one can imagine, bias predictions could prove quite harmful, especially in situations where unfavorable decisions are being made based on a machine learning model. Bias in ground truth data is often non-intentional. It sometimes stems from implicit human bias in the data collection process or from the absence of existing data on certain subpopulations. Let's recap. The data we collect for machine learning should be accurate and relevant. We must ensure that we have enough data and that the data we capture is different and captures different use cases. Finally, we must be ethical in how we collect, manage, and use data.

Contents