Named after Nisaba — the Sumerian goddess of writing and scribe of the gods (𒀭𒉀).
Collection of finite-state transducer-based (FST) tools for visual normalization, well-formedness, transliteration and NFC normalization of various scripts from South Asia and beyond. Nisaba provides these APIs in Python and C++. Currently supported script families:
- Brahmic scripts (documentation).
- Alphabets and abjads (documentation).
- Natural transliteration for Brahmic scripts (documentation).
Nisaba primarily relies on OpenGrm Pynini, which is a Python toolkit for finite-state grammar development. OpenGrm Pynini, like its C++ counterpart Thrax, compiles grammars expressed as strings, regular expressions, and context-dependent rewrite rules into weighted finite-state transducers (WFSTs). It uses the OpenFst library and its Python extension to create, access and manipulate compiled grammars.
This library will build on any system that supports Bazel versatile multiplatform build and test tool. The following examples assume Debian Linux distribution, but should also apply with minor modifications to other Linux and non-Linux platforms that Bazel supports.
Your operating system may permit an easy installation of pre-built Bazel package, like the Debian-specific example below shows:
sudo apt-get install bazel
Alternatively, e.g., on macOS, a user-friendly Bazel launcher called Bazelisk can be installed:
BAZEL=bazelisk-darwin-amd64
curl -LO "https://github.com/bazelbuild/bazelisk/releases/latest/download/$BAZEL"
chmod +x $BAZEL
When using Bazelisk, simply replace the command bazel
in the examples below
with $BAZEL
.
Nisaba requires a modern C++ compiler that supports C++17 standard (e.g., the GCC 10 release series) and Python3. Assuming these are already present, the required dependencies are the Python3 development headers and the Python3 package installer pip.
sudo apt-get install python3-dev
sudo apt-get install python3-pip
Example Debian configuration: gcc (10.2.0), bazel (3.7.2), python3 (3.8.6) and pip (20.1.1).
-
Locally, make sure you are in some sort of a virtual environment (
venv
,virtualenv
,conda
, etc). -
Clone the repository (please note, this example does not clone the fork of the main repository, but a forked repo can be used as well):
git clone https://github.com/google-research/nisaba.git cd nisaba
-
Build all the targets using Bazel (this example uses optimized mode):
bazel build -c opt ...
The above command will build Nisaba artifacts using all the remote repository dependencies, including OpenFst, Pynin and Thrax, that are specified in the Bazel WORKSPACE file. The resulting artifacts are located in
bazel-bin/nisaba
directory.If the above command fails due to missing Python prerequisites, please install them using
pip
Python package manager and try again:pip3 install --upgrade pip pip3 install -r requirements.txt
-
Make sure the small unit tests are passing:
bazel test -c opt --test_size_filters=-large,-enormous ...
The above command should produce something along the following lines:
... //nisaba/scripts/brahmic:cc_test PASSED in 0.4s //nisaba/scripts/brahmic:far_cc_test PASSED in 0.2s //nisaba/scripts/brahmic:far_test PASSED in 2.0s //nisaba/scripts/brahmic:fixed_test PASSED in 0.2s //nisaba/scripts/brahmic:fst_properties_test PASSED in 2.3s //nisaba/scripts/brahmic:iso_test PASSED in 0.3s //nisaba/scripts/brahmic:nfc_test PASSED in 0.2s //nisaba/scripts/brahmic:nfc_utf8_test PASSED in 0.2s //nisaba/scripts/brahmic:py_test PASSED in 2.1s //nisaba/scripts/brahmic:util_test PASSED in 1.9s //nisaba/scripts/brahmic:visual_norm_test PASSED in 0.3s //nisaba/scripts/brahmic:visual_norm_utf8_test PASSED in 0.3s //nisaba/scripts/brahmic:wellformed_test PASSED in 0.2s //nisaba/scripts/brahmic:wellformed_utf8_test PASSED in 0.2s ...
You may also want to run all the tests, but depending on your host configuration these may take a long time:
bazel test -c opt ...
NOTE: We don't accept pull requests (PRs) at the moment.
Nisaba is licensed under the terms of the Apache license. See LICENSE for more information.
If you use this software in a publication, please cite the accompanying paper from EACL 2021:
@inproceedings{nisaba-eacl2021,
title = {Finite-state script normalization and processing utilities: The {N}isaba {B}rahmic library},
author = {Cibu Johny and Lawrence Wolf-Sonkin and Alexander Gutkin and Brian Roark},
booktitle = {16th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2021): System Demonstrations},
address = {[Online], Kyiv, Ukraine},
month = apr,
year = {2021},
pages = {14--23},
publisher = {Association for Computational Linguistics},
doi = {10.18653/v1/2021.eacl-demos.3},
url = {https://www.aclweb.org/anthology/2021.eacl-demos.3},
}
This is not an official Google product.