Data perturbation

Data perturbation definition

Data perturbation is the intentional modification of sensitive information in a dataset to protect the privacy of individuals without compromising the analytical value of the data. Data perturbation adds noise to the original data in a controlled manner, making it more challenging for unauthorized entities to extract sensitive information.

Data perturbation is often used in the context of differential privacy — a privacy-preserving framework that aims to provide strong privacy guarantees for individuals in datasets.

Common data perturbation methods

  • Adding random values to numerical data to introduce noise. The degree of perturbation can be controlled to balance privacy and utility.
  • Using Laplace distribution to add random noise to numerical data.
  • Shuffling or permuting the values of categorical data to obscure the association between individuals and their categorical attributes.
  • Modifying temporal data (such as timestamps) — for example, by introducing random time shifts or adding noise to the time values.
  • Swapping values between different records in the dataset.
  • Introducing synonyms or changing the word order to modify text data. This protects the privacy of textual information while maintaining its general meaning.
  • Rounding or binning values of numerical data. However, this method also reduces the precision of the data.