Data perturbation definition
Data perturbation is the intentional modification of sensitive information in a dataset to protect the privacy of individuals without compromising the analytical value of the data. Data perturbation adds noise to the original data in a controlled manner, making it more challenging for unauthorized entities to extract sensitive information.
Data perturbation is often used in the context of differential privacy — a privacy-preserving framework that aims to provide strong privacy guarantees for individuals in datasets.
See also: differential privacy, sensitive information, data analytics
Common data perturbation methods
- Adding random values to numerical data to introduce noise. The degree of perturbation can be controlled to balance privacy and utility.
- Using Laplace distribution to add random noise to numerical data.
- Shuffling or permuting the values of categorical data to obscure the association between individuals and their categorical attributes.
- Modifying temporal data (such as timestamps) — for example, by introducing random time shifts or adding noise to the time values.
- Swapping values between different records in the dataset.
- Introducing synonyms or changing the word order to modify text data. This protects the privacy of textual information while maintaining its general meaning.
- Rounding or binning values of numerical data. However, this method also reduces the precision of the data.