Mapping Languages and Demographics with Georeferenced Corpora

Dunn, Jonathan; Adams, Ben

Mapping Languages and Demographics with Georeferenced Corpora

Files

GeoComputation_19.pdf (922.92 KB)

Type of content

Conference Contributions - Published

UC permalink

http://hdl.handle.net/10092/17132

Date

2019

Authors

Dunn, Jonathan

Adams, Ben

Abstract

This paper evaluates large georeferenced corpora, taken from both web-crawled and social media sources, against ground-truth population and language-census datasets. The goal is to determine (i) which dataset best represents population demographics; (ii) in what parts of the world the datasets are most representative of actual populations; and (iii) how to weight the datasets to provide more accurate representations of underlying populations. The paper finds that the two datasets represent very different populations and that they correlate with actual populations with values of r = 0.60 (social media) and r = 0.49 (web-crawled). Further, Twitter data makes better predictions about the inventory of languages used in each country.

Citation

Dunn J, Adams B (2019). Mapping Languages and Demographics with Georeferenced Corpora. Proceedings of Geocomputation 2019.

Keywords

user-generated content, crowdsourcing, language, demographics, population

ANZSRC fields of research

Fields of Research::47 - Language, communication and culture::4704 - Linguistics::470406 - Historical, comparative and typological linguistics
Field of Research::20 - Language, Communication and Culture::2004 - Linguistics::200402 - Computational Linguistics
Field of Research::16 - Studies in Human Society::1603 - Demography::160399 - Demography not elsewhere classified
Field of Research::16 - Studies in Human Society::1604 - Human Geography::160403 - Social and Cultural Geography

Collections

Arts: Journal Articles

Full item page