Training data definition
Training data is a dataset used to train a machine learning model — to teach it to make predictions or decisions without being explicitly programmed to perform that task. The training data provides examples from which the model learns.
See also: machine learning
Commons characteristics of training data
- Labeled vs. unlabeled.
- Labeled data includes both the input and the corresponding desired output. It's used in supervised learning where the goal is for the model to learn a mapping from inputs to outputs. For instance, in a spam detection model, labeled training data would consist of many emails and labels indicating whether each email is “spam” or “not spam.”
- Unlabeled data consists of the input without any corresponding desired output. It's used in unsupervised learning where the goal might be to find patterns or structures in the data, like clustering or dimensionality reduction.
- Quality and relevance. A machine learning model's accuracy and effectiveness rely on the training data's quality. Good training data is:
- Representative of the real-world scenario where the model will be deployed.
- Free from biases that might skew the model's decisions.
- As clean as possible — free from errors, inconsistencies, and irrelevant or noisy points.
- Size. Deep learning models often require vast amounts of data to train effectively. However, more data isn't always better if it isn't high quality or relevant to the task.
- Data augmentation. In scenarios where collecting more data might be challenging, existing training data can be modified or augmented to create new data points. For example, an image can be rotated, zoomed, or flipped to produce variations.