Title: Yr Amliadur: Frequency Lists for Contemporary Welsh (Version 1.0.0)

Citation
Knight D, Morris S, Tovey-Walsh B, et al. (2020). Yr Amliadur: Frequency Lists for Contemporary Welsh (Version 1.0.0). Cardiff University. http://doi.org/10.17035/d.2020.0120164107


Access Rights: Data can be made freely available subject to attribution
Access Method: Click to email a request for this data to opendata@cardiff.ac.uk

Cardiff University Dataset Creators

Dataset Details
Publisher: Cardiff University
Date (year) of data becoming publicly available: 2020
Data format: .xls, .pdf
Estimated total storage size of dataset: Less than 100 megabytes
Number of Files In Dataset: 4
DOI: 10.17035/d.2020.0120164107

Description

Yr Amliadur contains the following sample frequency lists of contemporary Welsh language usage:

  • All frequency data, sorted alphabetically (excel file)
  • All frequency data, in frequency order (excel file)
  • The most-frequent 5000 words, with separate sheets for each 500-word frequency band (excel file)
  • PDF file with the following lists in:
    • Top 100 words in CorCenCC (rank ordered list) 
    • Top 1000 words in CorCenCC (ordered alphabetically) 
    • Top 100 lemmas in CorCenCC (rank ordered list) 
    • Top 1000 lemmas in CorCenCC (ordered alphabetically) 
    • Top 100 lemmas in CorCenCC (open-class words only)
    • Top 1000 words in CorCenCC (open-class words only; ordered alphabetically)  
    • Top 500 nouns in CorCenCC (rank ordered list) 
    • Top 500 verbs in CorCenCC (rank ordered list) 
    • Top 500 adjectives in CorCenCC (rank ordered list) 
    • Top 50 adverbs in CorCenCC (rank ordered list)
    • Top 50 interjections in CorCenCC (rank ordered list) 
    • Top 100 open-class words in the written component of CorCenCC (rank ordered list) 
    • Top 100 open-class words in the spoken component of CorCenCC (rank ordered list) 
    • Top 100 open-class words in the e-language component of CorCenCC (rank ordered list) 

The sample frequency lists are based on the CorCenCC (Corpws Cenedlaethol Cymraeg Cyfoes - National Corpus of Contemporary Welsh, Knight et al., 2020 which includes 14,338,149 tokens (circa 11.2-million-words). The data in CorCenCC represents a wide range of contexts, genres and topics and has, as far as possible, been anonymised using a combination of manual and automated techniques, and fully tagged in terms of part-of-speech (POS) and semantic categories. 

The research on which this frequency list dataset is based was funded by the UK Economic and Social Research Council (ESRC) and Arts and Humanities Research Council (AHRC) as the Corpws Cenedlaethol Cymraeg Cyfoes (The National Corpus of Contemporary Welsh): A community driven approach to linguistic corpus construction project (Grant Number ES/M011348/1).

All outputs from the CorCenCC project are licensed under Creative Commons CC-BY-SA v4 and thus are freely available for use by professional communities and individuals with an interest in language. Bespoke applications and instructions are provided for each tool. When reporting information derived by using the CorCenCC corpus data and/or tools, CorCenCC should be appropriately acknowledged.


Related Projects

Last updated on 2020-10-11 at 12:30

Share link