March, 1998.
The proliferation of information on the Internet and access to fast computers with large
storage capacities has increased the volume of information collected and disseminated about
individuals. The existence of these data sources makes it much easier to re-identify
individuals whose private information is released in data believed to be anonymous. At the
same time, increasing demands are made on organizations to release individualized data
rather than aggregate statistical information. Even when explicit identifiers, such as name
and phone number, are removed or encrypted when releasing individualized data, other
characteristic data, which we term quasi-identifiers, can exist which allow the data recipient
to re-identify individuals to whom the data refer.
In this paper, we provide a computational disclosure technique for releasing information from
a private table such that the identity of any individual to whom the released data refer cannot
be definitively recognized. Our approach protects against linking to other data. It is based on
the concepts of generalization, by which stored values can be replaced with semantically
consistent but less precise alternatives, and of k-anonymity. A table is said to provide
k-anonymity when the contained data do not allow the recipient to associate the released
information with a set of individuals smaller than k. We introduce the notions of generalized
table and of minimal generalization of a table with respect to a k-anonymity requirement. As
an optimization problem, the objective is to minimally distort the data while providing
adequate protection. We describe an algorithm that, given a table, efficiently computes a
preferred minimal generalization to provide anonymity.