Skip to content

Latest commit

 

History

History
129 lines (97 loc) · 4.12 KB

HOWTO-unicode.adoc

File metadata and controls

129 lines (97 loc) · 4.12 KB

HOWTO: Updating character attributes to the latest Unicode standard

Gauche uses some of character attributes defined in Unicode standard. They are extracted from files provided as Unicode Character Database (UCD), but we have it in packed format for the space efficiency.

We use two stage build process for it:

             [UCD data files]
                    |
                    |
              make char-data
                    |
                    v
           [src/unicode-data.scm]
                    |
                    |
                   make
                    |
                    v
             [src/char_attr.c]
      [src/gauche/priv/unicode_attr.h]

The first step reads UCD data files and generates src/unicode-data.scm, which only includes necessary information. This step is only needed to be done when Unicode is updated; so we don’t require developers to do it. We check in src/unicode-data.scm to the repo.

The second step is run when you build Gauche from the repository source tree. When creating tarball, though, we pre-generate the final char_attr.c and unicode_attr.h so that those who build from the tarball won’t need preinstalled gosh.

The rest of this document explains when you need to regenerate src/unicode-data.scm.

Preparing UCD data files

You need to obtain the following files from the Unicode site.

UnicodeData.txt
SpecialCasing.txt
PropList.txt
EastAsianWidth.txt
auxiliary/GraphemeBreakProperty.txt
auxiliary/WordBreakProperty.txt

Save those files in some directory (we call it $UNICODEDIR). Note that you need to keep subdirectory (auxiliary/).

The list of files may grow in future versions of Gauche.

For your convenience, the script src/gen-unicode.scm can download those files:

$ gosh src/gen-unicode.scm --fetch $UNICODEDIR [$UNICODE_VERSION]

This downloads those files into $UNICODEDIR. By default it downloads the latest version; however, if the changes in newer versions of Unicode trip the build process, you can give version number in the second argument to download older versions of the data files.

Review the changes

Property values may be added in the newer Unicode version, so you want to check src/gauche/char_attr.h and the beginning of src/gen-unicode.scm to see if we need to update the code. We take advantage of property value distribution over codepoints for space-efficient encoding (e.g. using flat tables on some ranges and binary trees on others). You may need to tweak the code if new ranges of characters are added.

Check changes in TR29, too. The state-transition table is hardcoded in ext/text/unicode.scm (they can’t be extracted from UCD files). Any change in TR29 must be manually reflected to the code.

Generate unicode-data.scm

There’s a make rule for that.

$ cd src
$ make UNICODEDATA=$UNICODEDIR char-data

And it’s done!

Dealing with unicode-data.scm

The module text.unicode.ucd provides the utilities to read/write unicode-data.scm and access information in it. Its format is described in the source. However, it may be changed as Unicode evloves and/or Gauche’s Unicode support is enhanced, so do not count on the current format.

In case if you need to hack the unicode data extraction step, here’s a brief description of text.unicde.ucd.

The file unicode-data.scm is a serialized format of unichar-db record. You can read the file with (ucd-load-db <iport>) to get a unichar-db instance, and write the serialized format to the current output port with (ucd-save-db <unichar-db>).

Once you get a unichar-db record, you can query

  • (ucd-get-entry <db> <codepoint>) returns a ucd-entry instance, that holds main character attributes such as character’s general category, case mapping, etc.

  • (ucd-get-break-property <db> <codepoint>) returns a ucd-break-property instance, that holds character’s Grapheme_Break and Word_Break categories.

  • (ucd-get-east-asian-width <db> <codepoint>) returns a symbol of character’s East Asian Width property.