Representations of Language Varieties Are Reliable Given Corpus Similarity Measures (2021)
This paper measures similarity both within and between 84 language varieties across nine languages. These corpora are drawn from digital sources (the web and tweets), allowing us to evaluate whether such geo-referenced corpora are reliable for modelling linguistic variation. The basic idea is that, if each source adequately represents a single underlying language variety, then the similarity between these sources should be stable across all languages and countries. The paper shows that there is a consistent agreement between these sources using frequency-based corpus similarity measures. This provides further evidence that digital geo-referenced corpora consistently represent local language varieties.
CitationDunn J (2021). Representations of Language Varieties Are Reliable Given Corpus Similarity Measures. Proceedings of the EACL 2021 Eighth Workshop on NLP for Similar Languages, Varieties and Dialects. Proceedings of the EACL 2021 Eighth Workshop on NLP for Similar Languages, Varieties and Dialects.
This citation is automatically generated and may be unreliable. Use as a guide only.
ANZSRC Fields of Research20 - Language, Communication and Culture::2004 - Linguistics::200402 - Computational Linguistics
20 - Language, Communication and Culture::2004 - Linguistics::200406 - Language in Time and Space (incl. Historical Linguistics, Dialectology)
RightsAll rights reserved unless otherwise stated
Showing items related by title, author, creator and subject.
Dunn, Jonathan; Adams, Ben (2019)This paper evaluates large georeferenced corpora, taken from both web-crawled and social media sources, against ground-truth population and language-census datasets. The goal is to determine (i) which dataset best ...
Dunn J (2018)This paper formulates and evaluates a series of multi-unit measures of directional association, building on the pairwise ΔP measure, that are able to quantify association in sequences of varying length and type ...
Tayyar Madabushi H; Dunn, Jonathan (Association for Computational Linguistics, 2021)This paper measures the impact of increased exposure on whether learned construction grammars converge onto shared representations when trained on data from different registers. Register influences the frequency of ...