This repository contains frequency dictionaries in the form of text files, with one word per line.
The repository is organized into two folders:
freq_dicts_dirty
: Contains dictionaries with words that may not appear in a "standard" dictionary.freq_dicts_clean
: Contains dictionaries that have been cleaned and supplemented to include only words found in a "standard" dictionary.
The files in this folder were derived from the LuminosoInsight/wordfreq project. These dictionaries were converted into .txt
files with one word per line, ordered by frequency (most frequent words come first). Only words longer than two characters were retained.
The conversion process involved:
- Using the jakm/msgpack-cli tool to convert
.msgpack
files to.json
format. - Transforming the
.json
files into.txt
files with one word per line usingsed
andgrep
.
The files in this folder were created by cleaning the dictionaries in the freq_dicts_dirty
folder. This process involved removing words not found in the corresponding dictionaries from titoBouzout/Dictionaries.
- Files named
short_xx.txt
retain their original names. - Files originally named
long_xx.txt
have been renamed tomedium_xx.txt
. - New
long_xx.txt
files are created frommedium_xx.txt
(orshort_xx.txt
when applicable). These are supplemented by appending, in alphabetical order, all words present in the "standard" dictionary but absent from the "frequency" dictionary.
This repository is licensed under the Apache License, Version 2.0. See the LICENSE file for details.
This repository is based on two primary sources:
- The
rspeer/wordfreq
project by Robyn Speer. - Dictionaries from the
titoBouzout/Dictionaries
repository, originally derived from the OpenOffice dictionary list.
- Robyn Speer must be credited as specified in NOTICE.md.
- For a detailed list of data sources and their licenses, see the original
//wordfreq
NOTICE.md
. - Data from
wordfreq/wordfreq
is redistributed under terms compatible with their original licenses, including the Creative Commons Attribution-ShareAlike 4.0 license.
- The dictionaries included in this repository are derived from the OpenOffice dictionary list, as referenced in
titoBouzout/Dictionaries
. - While no formal license is provided in the source, credits to the original contributors are acknowledged in the respective
LANG.txt
files in thetitoBouzout/Dictionaries
repository. - For more details about the dictionaries' origins and attribution requirements, see NOTICE.md.
The combined content of this repository complies with the terms of the Apache License 2.0 and respects the attribution requirements of the original sources. See NOTICE.md for further details.