Stockholm university

Beata MegyesiProfessor

About me

I hold the position of a Professor in Computational Linguistics. My main research areas include natural language processing and digital philology and my scholarly pursuits center around cross-disciplinary research aimed at facilitating quantitative studies utilizing AI for the humanities and social sciences. Currently, I am working on historical cryptology to analyze and break ciphers and codes.  

Throughout the years, I have actively taken on a range of academic roles:

  • Chair of the Linguistics Review Panel at the Swedish Research Council 2024 and Member since 2021;
  • Member of the board of the National Research School in Digital Philology (DigPhil), Sweden, 2023-;
  • Member of the board at the Center for Digital Humanities, Uppsala University, Sweden, 2020-2023;
  • President of the Northern European Association for Language Technology (NEALT), 2020-2021 and vice president 2018-2019;
  • Head of Department of Linguistics and Philology, 2009-2018;
  • Director of the English Park Campus, Uppsala University, 2017-2018;

For additional insights into my research and teaching endeavors, please refer to the details provided below.

Teaching

I teach regularly at the undergraduate and advanced level, primarily in computational linguistics. I am program responsible for the international master's program in AI and Language. I am the main supervisor for two PhD students and co-supervisor for one PhD student. 

Throughout the years, I have been taught courses at three universities: the Dept. of Linguistics at Stockholm University (SU), the Dept. of Linguistics and Philology at Uppsala University (UU), and the Dept. of Speech, Music and Hearing at KTH. I have been given various courses in computational linguistics (CL) and general linguistics (GL) from basic to advanced levels, as well as some PhD courses. 

Basic level courses:
  • Corpus linguistics, 7.5 ECTS: 2023 (SU)
  • Computational grammar II, 7.5 ECTS: 2004 (UU)
  • Corpus linguistics, 7.5 ECTS: 2005, 2006, 2007 (UU)
  • Introduction to Language Technology: 2015  (UU)
  • Languages, computers, and text processing (in Swedish): 2016  (UU)
  • Techniques for large scale parsing (parts): 2009  (UU)
  • Advisor for Language Technology Project, 7.5 ECTS: 2011-2016  (UU)
  • BA thesis supervision (SU, KTH, UU)
Advanced level courses:
  • Corpus-based methods, 7.5 ECTS: 2023 (SU)
  • Research and development, 15 ECTS: 2021 (UU)
  • Digital philology, 7.5 ECTS: 2018-2019 (UU)
  • Computer-based tools for research in humanities, 7.5 ECTS: 2007-2013 (UU)
  • Thesis work in language technology, 30 ECTS: 2005, 2006, 2007 (UU)
  • Advanced course in corpus linguistics, 7.5 ECTS: 2005 (UU)
  • Advisor for Language Technology Project, 7.5 ECTS: 2011-2016 (UU)
  • Master thesis supervision (UU)
PhD education:
  • I am the main supervisor of Micaella Bruton (SU) and Crina Tudor (SU), and co-supervisor of Oreen Yousuf (UU)
  • I was co-supervisor: Eva Petterson and Mojgan Seraji
  • Natural Language Processing, GSLT, 2008
  • Infrastructural tools for the study of linguistic variation: PhD course at Oslo University, June 2009

Research

I have always been interested in how human language is processed by humans, and how it can be processed by machines. My research focuses on the automatic analysis of historical handwritten documents on one hand, and large-scale text analysis for research within the humanities and social sciences on the other hand. I collaborate both nationally and internationally in Sweden, Germany, Hungary, Norway, Spain, and the USA. Over the past 10 years, my research has received external funding exceeding 4 million Euros, and my scientific work has resulted in over 100 scientific articles published in international fora.

Some projects that I led and/or contributed to: 

  • DECRYPT: Decryption of Historical Manuscripts: PI, Swedish Research Council, 2018-2024 
  • DECODE: Automatic Decoding of Historical Manuscript: PI, Swedish Research Council, 2015-2017
  • HistoCrypt: A scientific forum for historical cryptology 2018-
  • HistCorp: A collection of historical texts for 17 European languages 2015-
  • SWEGRAM: Automatic Annotation and Analysis of Swedish texts, PI; part of the Swe-CLARIN project, Swedish Research Council, 2014-2024
  • SWeLL: Research Infrastructure for Swedish as a second language: co-applicant, RJ, 2017-2019
  • Multilingual Parallel Corpora, Swedish Research Council: member, 2006-2010
  • Methods and Tools for Automatic Grammar Extraction: Swedish Research Council: member, 2005-2007
  • An Infrastructure for Swedish language technology: member,  Swedish Research Council, 2007-2008

My work has been published in the media as well, see for example: 

You can find details about my research in my publications. 

I have also served on numerous committees for doctoral theses and mid-term evaluations, regularly act as a reviewer for conferences and workshops, and have undertaken numerous expert assignments for appointments in both Sweden and abroad. Additionally, I have served as an assessor for projects funded by the Swedish Research Council and the Wallenberg Foundation.

Research projects

Publications

Beáta Megyesi's publications per year and per type.

A selection from Stockholm University publication database

  • Historical Cryptology

    2024. Beáta Megyesi (et al.). Learning and Experiencing Cryptography with CrypTool and SageMath

    Chapter

    Historical cryptology studies (original) encrypted manuscripts, often handwritten sources, produced in our history. These historical sources can be found in archives, often hidden without any indexing and therefore hard to locate. Once found they need to be digitized and turned into a machine-readable text format before they can be deciphered with computational methods. The focus of historical cryptology is not primarily the development of sophisticated algorithms for decipherment, but rather the entire process of analysis of the encrypted source from collection and digitization to transcription and decryption. The process also includes the interpretation and contextualization of the message set in its historical context. There are many challenges on the way, such as mistakes made by the scribe, errors made by the transcriber, damaged pages, handwriting styles that are difficult to interpret, historical languages from various time periods, and hidden underlying language of the message. Ciphertexts vary greatly in terms of their code system and symbol sets used with more or less distinguishable symbols. Ciphertexts can be embedded in clearly written text, or shorter or longer sequences of cleartext can be embedded in the ciphertext. The ciphers used mostly in historical times are substitutions (simple, homophonic, or polyphonic), with or without nomenclatures, encoded as digits or symbol sequences, with or without spaces. So the circumstances are different from those in modern cryptography which focuses on methods (algorithms) and their strengths and assumes that the algorithm is applied correctly. For both historical and modern cryptology, attack vectors outside the algorithm are applied like implementation flaws and side-channel attacks. In this chapter, we give an introduction to the field of historical cryptology and present an overview of how researchers today process historical encrypted sources.

    Read more about Historical Cryptology
  • An STS analysis of a digital humanities collaboration: trading zones, boundary objects, and interactional expertise in the DECRYPT project

    2024. Benedek Láng, Beáta Megyesi. Humanities and Social Sciences Communications 11 (1)

    Article

     A widely shared recognition over the past decade is that the methodology and the basic concepts of science and technology studies (STS) can be used to analyze collaborations in the cross-disciplinary field of digital humanities (DH). The concepts of trading zones (Galison, 2010), boundary objects (Star and Griesemer, 1989), and interactional expertise (Collins and Evans, 2007) are particularly fruitful for describing projects in which researchers from massively different epistemic cultures (Knorr Cetina, 1999) are trying to develop a common language. The literature, however, primarily concentrates on examples where only two parties, historians and IT experts, work together. More exciting perspectives open up for analysis when more than two, more nuanced and different epistemic cultures seek a common language and common research goals. In the DECRYPT project funded by the Swedish Research Council, computational linguists, historians, computer scientists and AI experts, cryptologists, computer vision specialists, historical linguists, archivists, and philologists collaborate with strikingly different methodologies, publication patterns, and approaches. They develop and use common resources (including a database and a large collection of European historical texts) and tools (among others a code-breaking software, a hand-written text recognition tool for transcription), researching partly overlapping topics (handwritten historical ciphers and keys) to reach common goals. In this article, we aim to show how the STS concepts are illuminating when describing the mechanisms of the DECRYPT collaboration and shed some light on the best practices and challenges of a truly cross-disciplinary DH project.

    Read more about An STS analysis of a digital humanities collaboration: trading zones, boundary objects, and interactional expertise in the DECRYPT project
  • Keys with nomenclatures in the early modern Europe

    2024. Beáta Megyesi (et al.). Cryptologia 48 (2), 97-139

    Article

    We give an overview of the development of European historical cipher keys originating from early Modern times. We describe the nature and the structure of the keys with a special focus on the nomenclatures. We analyze what was encoded and how and take into account chronological and regional differences. The study is based on the analysis of over 1,600 cipher keys, collected from archives and libraries in 10 European countries. We show that historical cipher keys evolved over time and became more secure, shown by the symbol set used for encoding, the code length and the code types presented in the key, the size of the nomenclature, as well as the diversity and complexity of linguistic entities that are chosen to be encoded.

    Read more about Keys with nomenclatures in the early modern Europe
  • Supporting Historical Cryptology: The Decrypt Pipeline

    2024. Mihály Héder (et al.). Proceedings of the 7th International Conference on Historical Cryptology (HistoCrypt 2024)

    Conference

    We present a set of resources and tools to support research and development in the field of historical cryptology. The tools aim to support transcription and decipherment of ciphertexts, developed to work together in a pipeline. It encompasses cataloging these documents into the Decode database, which houses ciphers dating from the 14th century to 1965, transcription using both manual and AI-assisted methods, cryptanalysis, and subsequent historical and linguistic analysis to contextualize decrypted content. The project encounters challenges with the accuracy of automated transcription technologies and the necessity for significant user involvement in the transcription and analysis processes. These insights highlight the critical balance between technological innovation and the indispensable input of domain expertise in advancing the field of historical cryptology.

    Read more about Supporting Historical Cryptology: The Decrypt Pipeline
  • A Typology for Cipher Key Instructions in Early Modern Times

    2024. Beáta Megyesi (et al.). Proceedings of the 7th International Conference on Historical Cryptology (HistoCrypt 2024)

    Conference

    We present an empirical study on instructions found in historical cipher keys dating back to early modern times in Europe. The study reveals that instructions in historical cipher keys are prevalent, covering a wide range of themes related to the practical application of ciphers. These include general information about the structure or usage of the cipher key, as well as specific instructions on their application. Being a hitherto neglected genre, these texts provide insight into the practice of cryptographic operations.

    Read more about A Typology for Cipher Key Instructions in Early Modern Times
  • Exploring the Alignment of Transcriptions to Images of Encrypted Manuscripts

    2024. García Goio (et al.). Proceedings of the 7th International Conference on Historical Cryptology (HistoCrypt 2024)

    Conference

    The automatic transcription of encrypted manuscripts is a challenge due to the different handwriting styles and the often invented symbol alphabets. Many transcription methods require annotated sources, including symbol locations. However, most existing transcriptions are provided at line or page level, making it necessary to find the bounding boxes of the transcribed symbols in the image, a process referred to as alignment. So, in this work, we develop several alignment methods, and discuss their performance on encrypted documents with various symbol sets.

    Read more about Exploring the Alignment of Transcriptions to Images of Encrypted Manuscripts

Show all publications by Beata Megyesi at Stockholm University