Licentiate seminar: Thomas Vakili

Seminar

Date: Monday 15 May 2023

Time: 10.00 – 12.00

Location: Room M20, DSV, 3rd floor, Borgarfjordsgatan 12, Kista

Welcome to a licentiate seminar at DSV! Thomas Vakili presents his licentiate thesis on how sensitive data can be protected.

Portrait photo of Thomas Vakili, DSV at Stockholm University.
Thomas Vakili. Photo: Åse Karlén.

On May 15, 2023, Thomas Vakili will present his licentiate thesis at the Department of Computer and Systems Sciences (DSV), Stockholm University. The title of the thesis is “Attacking and Defending the Privacy of Clinical Language Models”

The defence takes place at DSV in Kista, and on Zoom. It starts at 10:00 am. Send an e-mail to Hercules Dalianis if you would like to have the Zoom link.

Respondent: Thomas Vakili, DSV
Opponent: Aurélie Névéol, LISN, CNRS, Université Paris-Saclay, France
Examiner: Panagiotis Papapetrou, DSV
Chair: Tony Lindgren, DSV
Main supervisor: Hercules Dalianis, DSV
Supervisor: Aron Henriksson, DSV

The thesis can be downloaded from Diva

Contact Thomas Vakili

Interview with Thomas Vakilis (in Swedish)

 

Abstract

The state-of-the-art methods in natural language processing (NLP) increasingly rely on large pre-trained transformer models. The strength of the models stems from their large number of parameters and the enormous amounts of data used to train them. The datasets are of a scale that makes it difficult, if not impossible, to audit them manually. When unwieldy amounts of potentially sensitive data are used to train large machine learning models, a difficult problem arises: the unintended memorization of the training data.

All datasets—including those based on publicly available data—can contain sensitive information about individuals. When models unintentionally memorize these sensitive data, they become vulnerable to different types of privacy attacks. Very few datasets for NLP can be guaranteed to be free from sensitive data. Thus, to varying degrees, most NLP models are susceptible to privacy leakage. This susceptibility is especially concerning in clinical NLP, where the data typically consist of electronic health records. Unintentionally leaking publicly available data can be problematic, but leaking data from electronic health records is never acceptable from a privacy perspective. At the same time, clinical NLP has great potential to improve the quality and efficiency of healthcare.

This licentiate thesis investigates how these privacy risks can be mitigated using automatic de-identification. This is done by exploring the privacy risks of pre-training using clinical data and then evaluating the impact on the model accuracy of decreasing these risks. A BERT model pre-trained using clinical data is subjected to a training data extraction attack. The same model is also used to evaluate a membership inference attack that has been proposed to quantify the privacy risks associated with masked language models. Then, the impact of automatic de-identification on the performance of BERT models is evaluated for both pre-training and fine-tuning data.

The results show that extracting training data from BERT models is non-trivial and suggest that the risks can be further decreased by automatically de-identifying the training data. Automatic de-identification is found to preserve the utility of the data used for pre-training and fine-tuning BERT models, resulting in no reduction in performance compared to models trained using unaltered data. However, we also find that the current state-of-the-art membership inference attacks are unable to quantify the privacy benefits of automatic de-identification. The results show that automatic de-identification reduces the privacy risks of using sensitive data for NLP without harming the utility of the data, but that these privacy benefits may be difficult to quantify.