Stockholms universitet

Thomas VakiliDoktorand

Om mig

Jag är civilingenjör och doktorerar vid Institutionen för data- och systemvetenskap där jag handleds av professor Hercules Dalianis och docent Aron Henriksson. Min forskning handlar om skärningspunkten mellan språkteknologi och integritetsskydd.

De senaste åren har det språkteknologska fältet genomgått en revolution i och med skapandet av förtränade språkmodeller, t.ex. BERT. På DSV har vi framgångsrikt använt denna typ av modeller för att skapa medicinsk språkteknologi genom att använda patientjournaldata.

En viktig anledning till att dessa modeller är så framgångsrika är att de är väldigt stora och tränade på enorma mängder data. Detta har dock skapat ett stort problem: modellerna läcker information om sitt träningsdata. Min forskning handlar om att skapa modeller som skyddar integriteten hos de människor som nämns i träningsdatat.

Jag presenterade min licentiatavhandling i maj 2023 och planerar att disputera i slutet av 2025. Du kan läsa mer om min forskning på min akademiska webbplats.

Undervisning

Jag undervisar på flera kurser och handleder även kandidat- och masteruppsatser. Jag undervisar eller har undervisat på följande kurser:

Forskningsprojekt

Publikationer

I urval från Stockholms universitets publikationsdatabas

  • When Is a Name Sensitive? Eponyms in Clinical Text and Implications for De-Identification

    2024. Thomas Vakili (et al.). Proceedings of the Workshop on Computational Approaches to Language Data Pseudonymization (CALD-pseudo 2024), 76-80

    Konferens

    Clinical data, in the form of electronic health records, are rich resources that can be tapped using natural language processing. At the same time, they contain very sensitive information that must be protected. One strategy is to remove or obscure data using automatic de-identification. However, the detection of sensitive data can yield false positives. This is especially true for tokens that are similar in form to sensitive entities, such as eponyms. These names tend to refer to medical procedures or diagnoses rather than specific persons. Previous research has shown that automatic de-identification systems often misclassify eponyms as names, leading to a loss of valuable medical information. In this study, we estimate the prevalence of eponyms in a real Swedish clinical corpus. Furthermore, we demonstrate that modern transformer-based de-identification systems are more accurate in distinguishing between names and eponyms than previous approaches.

    Läs mer om When Is a Name Sensitive? Eponyms in Clinical Text and Implications for De-Identification
  • Attacking and Defending the Privacy of Clinical Language Models

    2023. Thomas Vakili.

    Avhandling (Lic)

    The state-of-the-art methods in natural language processing (NLP) increasingly rely on large pre-trained transformer models. The strength of the models stems from their large number of parameters and the enormous amounts of data used to train them. The datasets are of a scale that makes it difficult, if not impossible, to audit them manually. When unwieldy amounts of potentially sensitive data are used to train large machine learning models, a difficult problem arises: the unintended memorization of the training data.

    All datasets—including those based on publicly available data—can contain sensitive information about individuals. When models unintentionally memorize these sensitive data, they become vulnerable to different types of privacy attacks. Very few datasets for NLP can be guaranteed to be free from sensitive data. Thus, to varying degrees, most NLP models are susceptible to privacy leakage. This susceptibility is especially concerning in clinical NLP, where the data typically consist of electronic health records. Unintentionally leaking publicly available data can be problematic, but leaking data from electronic health records is never acceptable from a privacy perspective. At the same time, clinical NLP has great potential to improve the quality and efficiency of healthcare.

    This licentiate thesis investigates how these privacy risks can be mitigated using automatic de-identification. This is done by exploring the privacy risks of pre-training using clinical data and then evaluating the impact on the model accuracy of decreasing these risks. A BERT model pre-trained using clinical data is subjected to a training data extraction attack. The same model is also used to evaluate a membership inference attack that has been proposed to quantify the privacy risks associated with masked language models. Then, the impact of automatic de-identification on the performance of BERT models is evaluated for both pre-training and fine-tuning data.

    The results show that extracting training data from BERT models is non-trivial and suggest that the risks can be further decreased by automatically de-identifying the training data. Automatic de-identification is found to preserve the utility of the data used for pre-training and fine-tuning BERT models, resulting in no reduction in performance compared to models trained using unaltered data. However, we also find that the current state-of-the-art membership inference attacks are unable to quantify the privacy benefits of automatic de-identification. The results show that automatic de-identification reduces the privacy risks of using sensitive data for NLP without harming the utility of the data, but that these privacy benefits may be difficult to quantify.

    Läs mer om Attacking and Defending the Privacy of Clinical Language Models
  • Using Membership Inference Attacks to Evaluate Privacy-Preserving Language Modeling Fails for Pseudonymizing Data

    2023. Thomas Vakili, Hercules Dalianis. 24th Nordic Conference on Computational Linguistics (NoDaLiDa), 318-323

    Konferens

    Large pre-trained language models dominate the current state-of-the-art for many natural language processing applications, including the field of clinical NLP. Several studies have found that these can be susceptible to privacy attacks that are unacceptable in the clinical domain where personally identifiable information (PII) must not be exposed.

    However, there is no consensus regarding how to quantify the privacy risks of different models. One prominent suggestion is to quantify these risks using membership inference attacks. In this study, we show that a state-of-the-art membership inference attack on a clinical BERT model fails to detect the privacy benefits from pseudonymizing data. This suggests that such attacks may be inadequate for evaluating token-level privacy preservation of PIIs.

    Läs mer om Using Membership Inference Attacks to Evaluate Privacy-Preserving Language Modeling Fails for Pseudonymizing Data
  • Downstream Task Performance of BERT Models Pre-Trained Using Automatically De-Identified Clinical Data

    2022. Thomas Vakili (et al.). Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), 4245-4252

    Konferens

    Automatic de-identification is a cost-effective and straightforward way of removing large amounts of personally identifiable information from large and sensitive corpora. However, these systems also introduce errors into datasets due to their imperfect precision. These corruptions of the data may negatively impact the utility of the de-identified dataset. This paper de-identifies a very large clinical corpus in Swedish either by removing entire sentences containing sensitive data or by replacing sensitive words with realistic surrogates. These two datasets are used to perform domain adaptation of a general Swedish BERT model. The impact of the de-identification techniques is assessed by training and evaluating the models using six clinical downstream tasks. The results are then compared to a similar BERT model domain-adapted using an unaltered version of the clinical corpus. The results show that using an automatically de-identified corpus for domain adaptation does not negatively impact downstream performance. We argue that automatic de-identification is an efficient way of reducing the privacy risks of domain-adapted models and that the models created in this paper should be safe to distribute to other academic researchers.

    Läs mer om Downstream Task Performance of BERT Models Pre-Trained Using Automatically De-Identified Clinical Data
  • Utility Preservation of Clinical Text After De-Identification

    2022. Thomas Vakili, Hercules Dalianis. Proceedings of the 21st Workshop on Biomedical Language Processing, 383-388

    Konferens

    Electronic health records contain valuable information about symptoms, diagnosis, treatment and outcomes of the treatments of individual patients. However, the records may also contain information that can reveal the identity of the patients. Removing these identifiers - the Protected Health Information (PHI) - can protect the identity of the patient. Automatic de-identification is a process which employs machine learning techniques to detect and remove PHI. However, automatic techniques are imperfect in their precision and introduce noise into the data. This study examines the impact of this noise on the utility of Swedish de-identified clinical data by using human evaluators and by training and testing BERT models. Our results indicate that de-identification does not harm the utility for clinical NLP and that human evaluators are less sensitive to noise from de-identification than expected.

    Läs mer om Utility Preservation of Clinical Text After De-Identification
  • Cross-Clinic De-Identification of Swedish Electronic Health Records: Nuances and Caveats

    2022. OIle Bridal, Thomas Vakili, Marina Santini. Proceedings of the Language Resources and Evaluation Conference, 49-52

    Konferens

    Privacy preservation of sensitive information is one of the main concerns in clinical text mining. Due to the inherent privacy-keeping problems that arise when handling clinical data, the clinical corpora used to create the clinical Named Entity Recognition (NER) models underlying clinical de-identification systems cannot be shared. This implies that clinical NER models are trained and tested on data coming from the same institution because it is rarely possible to evaluate them on data belonging to a different institution. Given this sharing restrictions, it is very to assess whether a clinical NER model has overfitted the data or if it is driven by undetected biases. In this paper we present the results of the first-ever cross-institution evaluation of a Swedish de-identification system on Swedish clinical data. Alongside the encouraging results, we present a discussion about differences and similarities across EHR naming conventions and NER tagsets.

    Läs mer om Cross-Clinic De-Identification of Swedish Electronic Health Records
  • Are Clinical BERT Models Privacy Preserving? The Difficulty of Extracting Patient-Condition Associations

    2021. Thomas Vakili, Hercules Dalianis. Proceedings of the AAAI 2021 Fall Symposium on Human Partnership with Medical AI

    Konferens

    Language models may be trained on data that contain personal information, such as clinical data. Such sensitive data must not leak for privacy reasons. This article explores whether BERT models trained on clinical data are susceptible to training data extraction attacks. Multiple large sets of sentences generated from the model with top-k sampling and nucleus sampling are studied. The sentences are examined to determine the degree to which they contain information associating patients with their conditions. The sentence sets are then compared to determine if there is a correlation between the degree of privacy leaked and the linguistic quality attained by each generation technique. We find that the relationship between linguistic quality and privacy leakage is weak and that the risk of a successful training data extraction attack on a BERT-based model is small.

    Läs mer om Are Clinical BERT Models Privacy Preserving? The Difficulty of Extracting Patient-Condition Associations
  • Evaluation of LIME and SHAP in Explaining Automatic ICD-10 Classifications of Swedish Gastrointestinal Discharge Summaries

    2022. Alexander Dolk (et al.). Proceedings of the 18th Scandinavian Conference on Health Informatics, 166-173

    Konferens

    A computer-assisted coding tool could alleviate the burden on medical staff to assign ICD diagnosis codes to discharge summaries by utilising deep learning models to generate recommendations. However, the opaque nature of deep learning models makes it hard for humans to trust them. In this study, the explainable AI models LIME and SHAP have been applied to the clinical language model SweDeClin-BERT to explain ICD-10 codes assigned to Swedish gastrointestinal discharge summaries. The explanations have been evaluated by eight medical experts, showing a statistically higher significant difference in explainable performance for SHAP compared to LIME.

    Läs mer om Evaluation of LIME and SHAP in Explaining Automatic ICD-10 Classifications of Swedish Gastrointestinal Discharge Summaries

Visa alla publikationer av Thomas Vakili vid Stockholms universitet