Computational linguistics – corpora and resources

On this page, you will find resources and corpora developed by our researchers in computational linguistics. Most of them are freely available to the public.

To see our Natural Language Processing tools, please turn to this page:

Computational linguistics – tools

 

Corpora and resources

Here you will find our various corpora, including the Swedish Blog Sentences (2.7 billion tokens), the Stockholm Umeå Corpus (1 million words), and the Stockholm University Strindberg Corpus (400,000 tokens).

LONG-MINGLE is a longitudinal corpus of child-directed speech. The corpus consists of ortographic transcripts of audio and video recordings of naturalistic free play sessions. The participating parents, all native speakers of Swedish, were instructed to play with these toys as they normally would at home.

This corpus consist of 57 transcripts from longitudinal dyads with 13 children between 2 and 33 months of age. A subset of this corpus, called MINGLE-3, has been multimodally annotated with eye gaze, gestures, and object-related actions (Nilsson Björkenstam & Wirén, 2014).

LONG-MINGLE is available for research and is is distributed as text files.

References

Nilsson Björkenstam, K. & Wirén, M. (2014). Multimodal Annotation of Synchrony in Longitudinal Parent–Child Interaction. In: MMC 2014 Multimodal Corpora: Combining applied and basic research targets: Workshop at LREC 2014. Paper presented at The 9th edition of the Language Resources and Evaluation Conference, 26-31 May, Reykjavik, Iceland. European Language Resources Association.

The SIC project aims to create a freely available corpus of Swedish Internet texts, manually annotated with Part of Speech (PoS) and Named Entity tags. So far, a small corpus (13,562 tokens) of blog texts has been created. The tagset and data format is adapted from the Stockholm–Umeå Corpus (SUC) version 3, on which the corpus is modelled.

SIC is primarily intended for researchers developing and testing Natural Language Processing (NLP) tools working with Internet texts. 

SIC uses a permissive license (CC BY-SA 3.0), allowing researchers to modify and redistribute the corpus. 

Download SIC (zip) (173 Kb)

SMULTRON is a parallel treebank that contains around 1000 sentences in English, German and Swedish. The sentences have been PoS-tagged and annotated with phrase structure trees. The trees have been aligned on sentence, phrase and word level. Additionally, the German and Swedish monolingual treebanks contain lemma information.

SMULTRON is being further developed by the Department of Computational Linguistics at the University of Zurich.

To SMULTRON (cl.uzh.ch)

Stockholm—Umeå Corpus (SUC) is a collection of Swedish texts, totalling one million words. SUC has been released in three versions: SUC 1.0 (1997), SUC 2.0 (2006) and SUC 3.0 (2012).

Each word in SUC has been annotated with information about part-of-speech, morphological features and citation form. SUC is a balanced corpus, which means that it consists of texts from a wide variety of genres in carefully selected proportions. The texts in SUC were written in the 1990s.

Licensing of SUC has been delegated to Språkbanken at the University of Gothenburg.

To SUC in Språkbanken

SUC-CORE is a 20 000 word subset of the Stockholm-Umeå Corpus (SUC 2.0) annotated with coreference relations between noun phrases. The corpus covers a wide range of genres and domains, and is available for research.

The annotation was done manually by two annotators using  BRAT, a web-based tool for collaborative annotation. The annotation task was restricted to three types of referring expressions:

  • Name mentions (NAM): proper names and other named entities.
  • Nominal mentions (NOM): NPs with a lexical noun, or a nominalized adjective or a participle as head.
  • Pronominal mentions (PRO): personal pronouns, demonstrative pronouns, and reflexive pronouns. We also include possessives and genitives in this category.

If you have any inquiries regarding SUC-CORE, please contact Robert Östling, section representative for computational linguistics: robert@ling.su.se

The Stockholm University Strindberg Corpus (SUSC) consists of seven novels by August Strindberg:

Tjänstekvinnans son (The son of a servant, 1886-87)
Han och hon (He and she, 1919)
Inferno (Inferno, 1897)
Legender and Jakob brottas (Legends and Jacob wrestles, 1898)
Fagervik och Skamsund (Fair haven and Foulstrand, 1902)
Ensam (Alone, 1903)

The novels are annotated for parts-of-speech with morphological analysis and lemmas. The corpus is freely available.

More about SUSC (DiVA)

The Stockholm University Strindberg Corpus: FIL.tar.gz (45811 Kb)

SWE-AOA is a freely available resource for research on age-of-acquisition in Swedish, that is, the age at which the average child learns a given word. 

The resource is distributed as a zip-file consisting of:

  • An excel spread sheet with AoA ratings
  • The linguistic data in text file (csv) format

SWE-AoA-means-206words-ADS-CS (590 Kb)

Source to cite

Please cite the following paper if you use this resource:
Wikse Barrow, C., Nilsson Björkenstam, K., & Strömbergsson, S. (2018.) Subjective ratings of age-of-acquisition: exploring issues of validity and rater reliability. Journal of Child Language, 45(7). doi.org/10.1017/S0305000918000363

This is a collection of sentences from Swedish blog posts from November 2010 until December 2014. For copyright reasons, the order of these sentences has been randomly rearranged so that the original texts can not be recreated. In many applications this is still very useful, and we publish this resource hoping that it will be of use to the field of Swedish Natural Language Processing.

License and citation

This datais free to use as long as proper credit is given, and that any modifications are shared under the same conditions.

Please cite the following paper if you use the Swedish Blog Sentences:

Östling, R. & Wirén, M. (2013). Compounding in a Swedish Blog Corpus. In: Laura Álvarez López, Charlotta Seiler Brylla & Philip Shaw (Ed.), Computer mediated discourse across languages: (pp. 45-63). Stockholm: Acta Universitatis Stockholmiensis.
Fulltext in DiVA

Download here

The SNEC corpus contains the National Edition of August Strindberg's Collected Works, provided in a plain text version and a linguistically annotated CoNLL version. 

The National Edition of August Strindberg's Collected Works was published as 72 printed volumes 1981–2013, and contains Strindberg's complete works along with critical commentaries. SNEC includes a plain text version and a linguistically annotated CoNLL‑U version of the works, as well as a plain text version of the critical commentaries. The plain text versions include minimal formatting in which one blank line corresponds to a paragraph break or a chapter break. In the CoNNL-U version, a blank line corresponds to a sentence break and two blank lines correspond to a paragraph or chapter break.

Here you can find quantitative word order data for 1295 languages.

Parallel text typology dataset (zenodo.org)

On this page