Representations of Language Varieties Are Reliable Given Corpus Similarity Measures

Type of content
Conference Contributions - Published
Publisher's DOI/URI
Thesis discipline
Degree name
Publisher
Association for Computational Linguistics
Journal Title
Journal ISSN
Volume Title
Language
Date
2021
Authors
Dunn, Jonathan
Abstract

This paper measures similarity both within and between 84 language varieties across nine languages. These corpora are drawn from digital sources (the web and tweets), allowing us to evaluate whether such geo-referenced corpora are reliable for modelling linguistic variation. The basic idea is that, if each source adequately represents a single underlying language variety, then the similarity between these sources should be stable across all languages and countries. The paper shows that there is a consistent agreement between these sources using frequency-based corpus similarity measures. This provides further evidence that digital geo-referenced corpora consistently represent local language varieties.

Description
Citation
Dunn J (2021). Representations of Language Varieties Are Reliable Given Corpus Similarity Measures. Proceedings of the EACL 2021 Eighth Workshop on NLP for Similar Languages, Varieties and Dialects. Proceedings of the EACL 2021 Eighth Workshop on NLP for Similar Languages, Varieties and Dialects.
Keywords
Ngā upoko tukutuku/Māori subject headings
ANZSRC fields of research
Field of Research::20 - Language, Communication and Culture::2004 - Linguistics::200402 - Computational Linguistics
Field of Research::20 - Language, Communication and Culture::2004 - Linguistics::200406 - Language in Time and Space (incl. Historical Linguistics, Dialectology)
Rights
All rights reserved unless otherwise stated