This Java package implements the conversion between Unicode Tibetan text, and Extended Wylie transliteration (EWTS). It also has convenience conversion methods from Diacritics Transliteration Scheme (DTS) and ALA-LC romanization to EWTS, and from EWTS to ALA-LC romanization.
It is based on the equivalent Perl module, Lingua::BO::Converter.
See Change log for version notes.
Using maven:
<dependency>
<groupId>io.bdrc.ewtsconverter</groupId>
<artifactId>ewts-converter</artifactId>
<version>1.4.2</version>
</dependency>
We provide one maven option: -DperformRelease=true
, which will make the jar file gpg-signed.
import io.bdrc.ewtsconverter.EwtsConverter;
EwtsConverter wl = new EwtsConverter();
System.out.println(wl.toUnicode("sems can thams cad"));
System.out.println(wl.toWylie("\u0f66\u0f44\u0f66\u0f0b\u0f62\u0f92\u0fb1\u0f66\u000a"));
You can pass some options to the constructor:
EwtsConverter(boolean check, boolean check_strict, boolean print_warnings, boolean fix_spacing, Mode mode)
check
: generate warnings for illegal consonant sequences; default istrue
.check_strict
: stricter checking, examine the whole stack; default istrue
.print_warnings
: print generated warnings toSystem.out
; default isfalse
.fix_spacing
: remove spaces after newlines, collapse multiple tseks into one, fix case, etc; default istrue
.mode
: anEwtsConverter.Mode
value, one ofEWTS
(default),ALALC
(alalc transliteration scheme) orDTS
(close to alalc, not publicly documented).
Converts from Converter (EWTS) to Unicode.
Converts from Converter (EWTS) to Unicode; puts the generated warnings in the list.
Converts from Unicode to Converter. Anything that is not Tibetan Unicode is converted to EWTS comment blocks [between brackets].
Converts from Unicode to Converter. Puts the generated warnings in the list. If escape is false, anything that is not Tibetan Unicode is just passed through as it is.
Returns a string normalizing common errors in EWTS.
Returns true
if the character is a Tibetan combining character.
Converts a string from DTS to EWTS.
Converts a string from ALA-LC to EWTS.
Converts a string from EWTS to ALA-LC (in NFKD, lower-case). If sloppy is true
, also normalizes common errors in EWTS.
This code should perform quite decently. When converting from Ewts to Unicode, the entire string is split into tokens, which are themselves strings. If this takes too much memory, consider converting your text in smaller chunks. With today's computers, it should not be a problem to convert several megabytes of tibetan text in one call. Otherwise, it could be worthwhile to tokenize the input on the fly, rather than all at once.
This class is entirely thread-safe. In a multi-threaded environment, multiple threads can share the same instance without any problems.
For simplicity reasons, we distribute our modifications only under the Apache 2.0 License, but the original version had this statement:
This library is Free Software. You can redistribute it or modify it, under
the terms of, at your choice, the GNU General Public License (version 2 or
higher), the GNU Lesser General Public License (version 2 or higher), the
Mozilla Public License (any version) or the Apache License version 2 or
higher.
Please contact the author if you wish to use it under some terms not covered
here.