Representations of Language Varieties Are Reliable Given Corpus Similarity Measures (2021)
This paper measures similarity both within and between 84 language varieties across nine languages. These corpora are drawn from digital sources (the web and tweets), allowing us to evaluate whether such geo-referenced corpora are reliable for modelling linguistic variation. The basic idea is that, if each source adequately represents a single underlying language variety, then the similarity between these sources should be stable across all languages and countries. The paper shows that there is a consistent agreement between these sources using frequency-based corpus similarity measures. This provides further evidence that digital geo-referenced corpora consistently represent local language varieties.
CitationDunn J (2021). Representations of Language Varieties Are Reliable Given Corpus Similarity Measures. Proceedings of the EACL 2021 Eighth Workshop on NLP for Similar Languages, Varieties and Dialects. Proceedings of the EACL 2021 Eighth Workshop on NLP for Similar Languages, Varieties and Dialects.
This citation is automatically generated and may be unreliable. Use as a guide only.
ANZSRC Fields of Research20 - Language, Communication and Culture::2004 - Linguistics::200402 - Computational Linguistics
20 - Language, Communication and Culture::2004 - Linguistics::200406 - Language in Time and Space (incl. Historical Linguistics, Dialectology)
RightsAll rights reserved unless otherwise stated
Showing items related by title, author, creator and subject.
Dunn J (2020)
The Chinese writer as empty signifier: a corpus-based analysis of the English-language reporting of the 2012 Nobel Prize in Literature Xin, J.; Matheson, D. (University of Canterbury. School of Language, Social and Political SciencesUniversity of Canterbury. Communication DisordersUniversity of Canterbury. Media and Communications, 2015)This study examines the English-language reporting of the award in 2012 of the Nobel Prize in Literature to the Chinese author, Mo Yan. Through the corpus-based analysis of news reporting in four countries, the study found ...
Adams, Ben; Janowicz, Krzysztof; Raubal, Martin (Springer, 2010)Semantic similarity measurement is a key methodology in various domains ranging from cognitive science to geographic information retrieval on the Web. Meaningful notions of similarity, however, cannot be determined ...
Petersen, C.J. (University of Canterbury. School of Sport & Physical Education, 2013)Improved technology and increased competition has resulted in more cost-effective (~40% cheaper) global positioning system (GPS) technology options and widespread GPS usage amongst sports coaches. Coaches now routinely ...
Vijayan, Sruthy (University of Canterbury, 2021)Background: Ultrasound has not progressed to standard clinical practice despite the fact that it offers a radiation-free and non-invasive procedure for swallowing assessment and promising validity and reliability for ...
The development of a web-based database of rate of heat release measurements using a mark-up language Spearpoint, M.J. (University of Canterbury. Civil Engineering., 2001)The application of most computer-based fire models is dependent on the user supplying the rate of heat release data that describes the design fire for a chosen scenario. Having access to a database of rate of heat release ...