Your IP: Unknown · Your Status: ProtectedUnprotectedUnknown

Skip to main content

Training data

Training data

Training data definition

Training data is a dataset used to train a machine learning model — to teach it to make predictions or decisions without being explicitly programmed to perform that task. The training data provides examples from which the model learns.

See also: machine learning

Commons characteristics of training data

  • bullet
    Labeled vs. unlabeled.
    • bullet
      Labeled data includes both the input and the corresponding desired output. It’s used in supervised learning where the goal is for the model to learn a mapping from inputs to outputs. For instance, in a spam detection model, labeled training data would consist of many emails and labels indicating whether each email is “spam” or “not spam.”
    • bullet
      Unlabeled data consists of the input without any corresponding desired output. It’s used in unsupervised learning where the goal might be to find patterns or structures in the data, like clustering or dimensionality reduction.
  • bullet
    Quality and relevance. A machine learning model’s accuracy and effectiveness rely on the training data’s quality. Good training data is:
    • bullet
      Representative of the real-world scenario where the model will be deployed.
    • bullet
      Free from biases that might skew the model’s decisions.
    • bullet
      As clean as possible — free from errors, inconsistencies, and irrelevant or noisy points.
  • bullet
    Size. Deep learning models often require vast amounts of data to train effectively. However, more data isn’t always better if it isn’t high quality or relevant to the task.
  • bullet
    Data augmentation. In scenarios where collecting more data might be challenging, existing training data can be modified or augmented to create new data points. For example, an image can be rotated, zoomed, or flipped to produce variations.

Further reading

Ultimate digital security