Yr Amliadur: Frequency Lists for Contemporary Welsh (Version 1.0.0) - Knight D, Morris S, Tovey-Walsh B, et al. (2020). Cardiff University. 10.17035/d.2020.0120164107. Computational/Corpus Linguistics Linguistics (General) Lexicon Linguistics - Porth Ymchwil

Teitl: Yr Amliadur: Frequency Lists for Contemporary Welsh (Version 1.0.0)

Dyfyniad
Knight D, Morris S, Tovey-Walsh B, et al. (2020). Yr Amliadur: Frequency Lists for Contemporary Welsh (Version 1.0.0). Cardiff University. https://doi.org/10.17035/d.2020.0120164107

Hawliau Mynediad: Creative Commons Attribution Share Alike 4.0 International

Dull Mynediad: I anfon cais i gael y data hwn, ebostiwch opendata@caerdydd.ac.uk

Crewyr y Set Ddata o Brifysgol Caerdydd

Knight, Dawn

Manylion y Set Ddata

Cyhoeddwr: Cardiff University

Dyddiad (y flwyddyn) pryd y daeth y data ar gael i'r cyhoedd: 2020

Fformat y data: .xls, .pdf

Amcangyfrif o gyfanswm maint storio'r set ddata: Llai na 100 megabeit

Nifer y ffeiliau yn y set ddata: 4

DOI : 10.17035/d.2020.0120164107

DOI URL: http://doi.org/10.17035/d.2020.0120164107

Related URL: https://www.corcencc.org

Disgrifiad

Yr Amliadur contains the following sample frequency lists of contemporary Welsh language usage:

All frequency data, sorted alphabetically (excel file)
All frequency data, in frequency order (excel file)
The most-frequent 5000 words, with separate sheets for each 500-word frequency band (excel file)
PDF file with the following lists in:
- Top 100 words in CorCenCC (rank ordered list)
- Top 1000 words in CorCenCC (ordered alphabetically)
- Top 100 lemmas in CorCenCC (rank ordered list)
- Top 1000 lemmas in CorCenCC (ordered alphabetically)
- Top 100 lemmas in CorCenCC (open-class words only)
- Top 1000 words in CorCenCC (open-class words only; ordered alphabetically)
- Top 500 nouns in CorCenCC (rank ordered list)
- Top 500 verbs in CorCenCC (rank ordered list)
- Top 500 adjectives in CorCenCC (rank ordered list)
- Top 50 adverbs in CorCenCC (rank ordered list)
- Top 50 interjections in CorCenCC (rank ordered list)
- Top 100 open-class words in the written component of CorCenCC (rank ordered list)
- Top 100 open-class words in the spoken component of CorCenCC (rank ordered list)
- Top 100 open-class words in the e-language component of CorCenCC (rank ordered list)

The sample frequency lists are based on the CorCenCC (Corpws Cenedlaethol Cymraeg Cyfoes - National Corpus of Contemporary Welsh, Knight et al., 2020 which includes 14,338,149 tokens (circa 11.2-million-words). The data in CorCenCC represents a wide range of contexts, genres and topics and has, as far as possible, been anonymised using a combination of manual and automated techniques, and fully tagged in terms of part-of-speech (POS) and semantic categories.

The research on which this frequency list dataset is based was funded by the UK Economic and Social Research Council (ESRC) and Arts and Humanities Research Council (AHRC) as the Corpws Cenedlaethol Cymraeg Cyfoes (The National Corpus of Contemporary Welsh): A community driven approach to linguistic corpus construction project (Grant Number ES/M011348/1).

All outputs from the CorCenCC project are licensed under Creative Commons CC-BY-SA v4 and thus are freely available for use by professional communities and individuals with an interest in language. Bespoke applications and instructions are provided for each tool. When reporting information derived by using the CorCenCC corpus data and/or tools, CorCenCC should be appropriately acknowledged.

Meysydd Ymchwil

Prosiectau Cysylltiedig

Corpws Cenedlaethol Cymraeg Cyfoes (The National Corpus of Contemporary Welsh): A community driven approach to linguistic corpus construction (01.03.2016 - 30.11.2020)