-
Notifications
You must be signed in to change notification settings - Fork 4
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
2706f08
commit bee5948
Showing
1 changed file
with
11 additions
and
11 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,24 +1,24 @@ | ||
# Unicode Cookbook for Linguists | ||
|
||
Steven Moran <bambooforest@gmail.com> & Michael Cysouw <cysouw@mac.com> | ||
[Steven Moran](http://www.comparativelinguistics.uzh.ch/de/moran.html) & [Michael Cysouw](http://cysouw.de/home/index.html) | ||
|
||
## Update | ||
## Getting the cookbook | ||
|
||
The Unicode Cookbook for Linguists has been accepted for publication in the [Translation and Multilingual Natural Language Processing](http://langsci-press.org/catalog/book/176) series by [Language Science Press](http://langsci-press.org/). | ||
The Unicode Cookbook for Linguists is published in the [Translation and Multilingual Natural Language Processing](http://langsci-press.org/catalog/book/176) series by [Language Science Press](http://langsci-press.org/). | ||
|
||
## About the cookbook | ||
The cookbook is available in its most up-to-date form in this directory as [unicode-cookbook.pdf](https://github.com/unicode-cookbook/cookbook/blob/master/unicode-cookbook.pdf). | ||
|
||
The cookbook is available in its most up-to-date form in this directory as [unicode-cookbook.pdf](https://github.com/unicode-cookbook/cookbook/blob/master/unicode-cookbook.pdf). | ||
[![License: CC BY 4.0](https://licensebuttons.net/l/by/4.0/80x15.png)](http://creativecommons.org/licenses/by/4.0/) | ||
|
||
The text is meant as a practical guide for linguists, and programmers, who work with data in multilingual computational environments. We introduce the basic concepts needed to understand how writing systems and character encodings function, and how they work together. | ||
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.1296780.svg)](https://doi.org/10.5281/zenodo.1296780) | ||
|
||
The intersection of the Unicode Standard and the International Phonetic Alphabet is often not met without frustration by users. Nevertheless, the two standards have provided language researchers with a consistent computational architecture needed to process, publish and analyze data from many different languages. We bring to light common, but not always transparent, pitfalls that researchers face when working with Unicode and IPA. | ||
## Preface | ||
|
||
Our research uses quantitative methods to compare languages and uncover and clarify their phylogenetic relations. However, the majority of lexical data available from the world's languages is in author- or document-specific orthographies. Having identified and overcome the pitfalls involved in making writing systems and character encodings syntactically and semantically interoperable (to the extent that they can be), we created a suite of open-source Python and R tools to work with languages using profiles that adequately describe their orthographic conventions. Using orthography profiles and these tools allows users to segment text, analyze it, identify errors, and to transform it into different written forms. | ||
This text is meant as a practical guide for linguists and programmers who work with data in multilingual computational environments. We introduce the basic concepts needed to understand how writing systems and character encodings function, and how they work together. | ||
|
||
We welcome comments and corrections of this text or the code in the case studies. Please use the issue tracker in this book's repository or email us directly. | ||
The intersection of the Unicode Standard and the International Phonetic Alphabet is often met with frustration by users. Nevertheless, the two standards have provided language researchers with the computational architecture needed to process, publish and analyze data from many different languages. We bring to light common, but not always transparent, pitfalls that researchers face when working with Unicode and IPA. | ||
|
||
[![License: CC BY 4.0](https://licensebuttons.net/l/by/4.0/80x15.png)](http://creativecommons.org/licenses/by/4.0/) | ||
In our research, we use quantitative methods to compare languages to uncover and clarify their phylogenetic relationships. However, the majority of lexical data available from the world's languages is in author- or document-specific orthographies. Having identified and overcome the pitfalls involved in making writing systems and character encodings syntactically and semantically interoperable (to the extent that they can be), we have created a suite of open-source Python and R software packages to work with languages using profiles that adequately describe their orthographic conventions. Using these tools in combination with orthography profiles allows users to tokenize and transliterate text from diverse sources, so that they can be meaningfully compared and analyzed. | ||
|
||
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.290662.svg)](https://doi.org/10.5281/zenodo.290662) | ||
We welcome comments and corrections regarding this book, our source code, and the [supplemental case studies](https://github.com/unicode-cookbook/) that we provide online. Please use the [issue tracker](https://github.com/unicode-cookbook/cookbook/issues/), email us directly, or make suggestions on [PaperHive](https://paperhive.org/). | ||
|