Privacy and Anonymization

Justified privacy concerns exist for all research data whose generation involves the collection of personal data. On the one hand, scientific research should be fostered by storing and interconnecting data, but on the other hand legal regulations  prescribe the deletion of personal data after achieving the purpose of a research project (at least in Germany). To further complicate this situation, data protection laws differ across countries, which is a burden in data management especially for international research projects. We will focus on the German legislation in this section and use the term personal data in accordance with EU legislation. However, it is important to note that correctly anonymized data are no longer personal data in a legal sense and are, therefore, not subject of data protection laws. In other words, this data may be shared without the participants’ consent (if not stated otherwise).

Legal Framework

The most important policies and legal guidelines that psychological researchers should consider in the German judicial area are:

  • The right of informational self-determination assures the participants to freely determine the use of their personal data. Practically, researchers respect the right of informational self-determination by obtaining informed consent.
  • The  Bundesdatenschutzgesetz (BDSG) clarifies the framework conditions of obtaining informed consent. Thus, the BDSG provides a definition of personal data, as well as the circumstances under which informed consent has to be obtained when collecting and analyzing/processing personal data. Furthermore it states under which circumstances  the right of informational self-determination can be superposed by research interests.
  • The APA ethical principles and the DGPs ethical guidelines provide guidance for problems specifically related to psychology, such as obtaining informed consent in research based on deception (see the knowledge base’s section on ethics).
  • The EU’s General Data Protection Regulation will unify European data protection laws.

Personal Data and Anonymization

Important terminologies concerning privacy are listed below; please note that these legal conditions will likely have to be adapted if your research project does not fall under German law.

  • Direct identifier: Direct identifiers are variables like names, initials, facial photographs, dates of birth or e-mail addresses which can be used relatively easy to re-identify subjects. Direct identifiers should never be published without explicit consent of subjects concerned and have no substantial scientific value of their own in most psychological research projects. For example, a research team performing a longitudinal study will be interested in a participant’s exact address for administrative reasons only.
  • Indirect identifier: Indirect identifiers are variables like place of treatment or health professional responsible for care, sex, or the presence of a rare disease or treatment which do not identify persons on their own, but may constitute a risk of re-identification when combined. Psychological studies often include such indirect identifiers like DSM-diagnosis, age, educational degree, race. In some cases data of psychometric tests could possibly serve as indirect identifier (e.g. intelligence). Hence, you should consider carefully, which data could be used to re-identify the participants of your study, and apply appropriate techniques for treating these indirect identifiers when publishing your data.
  • Personal data: Following directive 95/46/EG “personal data” includes all information about a specific or determinable natural person (“person concerned”). Following this directive, “a person is seen as identifiable when direct or indirect identification is possible, especially through assignment of a code number or one or more specific elements which are expressions of physical, physiological, psychological, economical, cultural or social identity”. Essentially, this means that all data which contain direct identifiers or a set of indirect identifiers, that allow a re-identification of subjects, are personal data. Therefore, all psychological research data concerning the physiological or psychological identity of participants are subject to data protection laws and these data may only be published or shared (1) with explicit consent, (2) if the data was appropriately anonymized/de-identified or (3) if special circumstances apply (e.g. if the general public interests outweighs personal interests). As the BDSG outlines, personal data should never be processed or published unless explicit informed consent for this purpose was obtained (if no special circumstances apply). Moreover, if personal data is collected, it has to be removed as soon as possible (if not otherwise stated in the informed consent).
  • Anonymization or De-Identification: “Anonymizing” is defined by the BDSG as transforming data in a way that makes it impossible to relate data to a person/or allows this only possible by inappropriate effort (BDSG § 3 (6)). Hrynaszkiewicz and colleagues propose in their guidance on this issue to remove all direct identifiers and to leave less than three indirect identfiers in a dataset in order to de-identify data.  Nonetheless, it is best practice to inform subjects on the intended further usage of their anonymized data. If privacy issues cannot be resolved for your data, you should contact an ethics committee.
  • Special kinds of personal data: German data protection laws demand  a higher level of privacy for special kinds of personal data (BDSG § 3 (9)). An extensive list with examples of special kinds of personal data (only German) was compiled by the Datenschutz-Wiki. To collect special kinds of personal data, researchers need to obtain an informed that outlines the specific special kinds of personal data which are collected. Following the Datenschutz-Wiki-list, psychological data, such as character traits or education, are not special kinds of personal data, while information about diseases/disorders, diagnoses and disease/disorder severity are special kinds of personal data.
  • Pseudonymization: Pseudonymized data are data that has been altered by using a distinct assignment pattern. Re-identification is only possible through knowledge of the pattern. German data protection laws require, that “characteristics, which can be assigned to individual details of personal or factual relation of a certain or determinable person have to be stored separately. They should only be merged with individual details if it is required by the research purpose.” (BDSG § 40 (2)). To comply with this requirement, pseudonymization can be used, as it ensures that direct identifiers, that are needed to contact subjects are separated from data used for statistical analyses. Thus, files can be merged without knowledge of personal data (e.g. when new data is added) if the assignment pattern is known. Moreover, it is still possible to contact individual subjects (e.g. in order to inform subjects on abnormal values or on a new assessment wave that is planned).


Anonymization techniques

Quantitative data

The ICPSR (n.d.) lists the following techniques for treating indirect identifiers:

  • Removal — entirely eliminating the variable from the dataset .
  • Top-coding — restricting the upper range of a variable.
  • Collapsing and/or combining variables — combining values of a single variable or merging data recorded in two or more variables into a new aggregated variable.
  • Sampling — rather than providing all of the original data, a random sample of sufficient size to yield reasonable inferences is released.
  • Swapping — unique cases are matched based on indirect identifiers, then values of key variables are exchanged between these cases. This retains the analytic utility of the dataset while protecting participants’ privacy.
  • Disturbing — adding random variation or stochastic errors to variables. This procedure retains statistical relationships of the variable and its covariates, while preventing the linking of records.

Qualitative data

Qualitative data need their own anonymization techniques which are usually much more complex than anonymization techniques employed for quantitative data. Faces and voices have to be alienated in video and audio recordings since they introduce a high risk of re-identification. In transcripts, names have to be replaced with pseudonyms or generic terms (e.g. father, son).

If anonymization would strongly diminish the analytic value of the data, there are some other possibilities of assuring privacy when data are archived or reused. However, these measures have to be mentioned in the informed consent.

  • Pseudonymization of data and storing the assignment pattern at an independent institution.
  • Confidentiality agreements that the data user needs to accept in order to obtain data access.
  • Restricting data access to safe rooms that do not allow to remove data or to a virtual platform that only allows downloading aggregated data.

Sharing Sensitive Data

If sensitive data is collected which can not be anonymized you might opt to create separate scientific use-files to protect confidentiality while still sharing at least some information with fellow researches. An example, for this kind of onion shaped access model with different levels of anonymization is the NEPS (National Educational Panel Study). Further aspects that have to be considered before sharing sensitive data can be found in the knowledge base’s section on privacy issues in data storing and publication

Guidance on Privacy Prior to Data Collection

The German Psychological Association (DGPs) recommends in its guidelines on Data Management in Psychological Science based on the proposal form of the ethics committee of the Department of Psychology and Education at the LMU Munich to consider the following facets of data protection when planning your research project:

  • Document the types of personal data that will be collected, all video or audio recordings that will take place and planned anonymization or pseudonymization measures.
    • Will subjects be allowed to request the deletion of their data?
  • Provide information on planned deletion procedures for personal data and how the deletion of personal data will be documented.
  • Inform subjects on planned data publication procedures in the information sheet of your informed consent form (even if published data will be fully anonymized).
  • Document who is responsible for procedures described above.

Tip: In general, you should only collect the degree of personal data you need in order to answer your research question (for example, you may not need the date of birth but only the year of birth or the age of the subject).


Further Resources