Skip to content
This repository was archived by the owner on Mar 9, 2023. It is now read-only.

Feature/default dict package #46

Merged
merged 19 commits into from
Jul 7, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion MANIFEST.in
Original file line number Diff line number Diff line change
@@ -1,2 +1 @@
include README.md LICENSE requirements.txt
recursive-include resources *.def *.json *.dic
103 changes: 88 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,42 +7,75 @@ Sudachi & SudachiPy are developed in [WAP Tokushima Laboratory of AI and NLP](ht

**Warning: SudachiPy is still under development, and some of the functions are still not complete. Please use it at your own risk.**

## Breaking changes
### v0.3.0

## Setup
- `resources/` directory was moved to `sudachipy/`.

### V0.2.2

- Distribute SudachiPy package via PyPI
- `pip install SudachiPy`

### v0.2.0

- User dictionary feature added


## Easy Setup

SudachiPy requires Python3.5+.

SudachiPy is not registered to PyPI just yet, so you may not install it via `pip` command at the moment.
You can install SudachiPy and SudachiDict_core packages together from PyPI.

```bash
$ pip install SudachiPy
```
$ pip install -e git+git://github.com/WorksApplications/SudachiPy@develop#egg=SudachiPy
```
The dictionary file is not included in the repository. You can get the built dictionary from [Releases · WorksApplications/Sudachi](https://github.com/WorksApplications/Sudachi/releases). Please download either `sudachi-x.y.z-dictionary-core.zip` or `sudachi-x.y.z-dictionary-full.zip`, unzip and rename it to `system.dic`, then place it under `SudachiPy/resources/`. In the end, we would like to make a flow to get these resources via the code, like [NLTK](https://www.nltk.org/data.html) (e.g., `import nltk; nltk.download()`) or [spaCy](https://spacy.io/usage/models) (e.g., `$python -m spacy download en`).

SudachiPy(>=v0.3.0) refers to system.dic of SudachiDict_core package by default.

## Usage

### As a command

After installing SudachiPy, you may also use it in the terminal via command `sudachipy`.
`sudachipy` has 3 subcommands (in default `tokenize`)

You can excute `sudachipy` with standard input by this way:
```bash
$ sudachipy
```

`sudachipy` has 4 subcommands (in default `tokenize`)

```bash
$ sudachipy tokenize -h
usage: sudachipy tokenize [-h] [-r file] [-m {A,B,C}] [-o file] [-a] [-d]
file [file ...]
usage: sudachipy tokenize [-h] [-r file] [-m {A,B,C}] [-o file] [-a] [-d] [-v]
[file [file ...]]

Tokenize Text

positional arguments:
file text written in utf-8
file text written in utf-8

optional arguments:
-h, --help show this help message and exit
-r file the setting file in JSON format
-m {A,B,C} the mode of splitting
-o file the output file
-a print all of the fields
-d print the debug information
-h, --help show this help message and exit
-r file the setting file in JSON format
-m {A,B,C} the mode of splitting
-o file the output file
-a print all of the fields
-d print the debug information
-v, --version print sudachipy version
```
```bash
$ sudachipy link -h
usage: sudachipy link [-h] [-t {small,core,full}] [-u]

Link Default Dict Package

optional arguments:
-h, --help show this help message and exit
-t {small,core,full} dict dict
-u unlink sudachidict
```
```bash
$ sudachipy build -h
Expand Down Expand Up @@ -126,6 +159,46 @@ tokenizer_obj.tokenize("シュミレーション", mode)[0].normalized_form()
# => 'シミュレーション'
```

## Install dict packages

You can download and install the built dictionaries from [Python packages · WorksApplications/SudachiDict](https://github.com/WorksApplications/SudachiDict#python-packages).

```bash
$ pip install SudachiDict_full-20190531.tar.gz
```

You can change the default dict package by executing link command.

```bash
$ sudachipy link -t full
```

You can remove default dict setting.

```bash
$ sudachipy link -u
```

## Customized dictionary

If you need to apply customized `system.dic`,
place [sudachi.json](https://github.com/WorksApplications/Sudachi/blob/develop/src/main/resources/sudachi.json) to anywhere you like,
and overwrite `systemDict` value with the relative path from `sudachi.json` to your `system.dic`.

```
{
"systemDict" : "relative/path/to/system.dic",
...
}
```

Then you can specify `sudachi.json` with `-r` option.
```bash
$ sudachipy -r path/to/sudachi.json
```

In the end, we would like to make a flow to get these resources via the code, like [NLTK](https://www.nltk.org/data.html) (e.g., `import nltk; nltk.download()`) or [spaCy](https://spacy.io/usage/models) (e.g., `$python -m spacy download en`).

## For developer

### Code format
Expand Down
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,4 @@ sortedcontainers >= 2.1.0, < 2.2.0
# flake8 >= 3.7.7, < 3.8.0
# flake8-import-order >= 0.18.1, < 0.19.0
# flake8-buitins >= 1.4.1, < 1.5.0
https://object-storage.tyo2.conoha.io/v1/nc_2520839e1f9641b08211a5c85243124a/sudachi/SudachiDict_core-20190531.tar.gz
10 changes: 7 additions & 3 deletions setup.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
from setuptools import setup, find_packages
from sudachipy import SUDACHIPY_VERSION

setup(name="SudachiPy",
version="0.2.1",
version=SUDACHIPY_VERSION,
description="Python version of Sudachi, the Japanese Morphological Analyzer",
long_description=open('README.md').read(),
long_description_content_type="text/markdown",
Expand All @@ -10,9 +11,12 @@
author="Works Applications",
author_email="takaoka_k@worksap.co.jp",
packages=find_packages(include=["sudachipy", "sudachipy.*"]),
package_data={"": ["resources/*.json", "resources/*.dic", "resources/*.def"]},
entry_points={
"console_scripts": ["sudachipy=sudachipy.command_line:main"],
},
install_requires=["sortedcontainers>=2.1.0,<2.2.0"],
include_package_data=True,
install_requires=[
"sortedcontainers>=2.1.0,<2.2.0",
"SudachiDict_core @ https://object-storage.tyo2.conoha.io/v1/nc_2520839e1f9641b08211a5c85243124a/sudachi/SudachiDict_core-20190531.tar.gz",
],
)
2 changes: 2 additions & 0 deletions sudachipy/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
from . import utf8inputtextbuilder
from . import tokenizer
from . import config

SUDACHIPY_VERSION = '0.3.0'
40 changes: 33 additions & 7 deletions sudachipy/command_line.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@

from . import dictionary
from . import tokenizer
from . import SUDACHIPY_VERSION
from .config import set_default_dict_package, unlink_default_dict_package
from .dictionarylib import BinaryDictionary
from .dictionarylib import SYSTEM_DICT_VERSION, USER_DICT_VERSION_2
from .dictionarylib.dictionarybuilder import DictionaryBuilder
Expand Down Expand Up @@ -70,10 +72,6 @@ def _system_dic_checker(args, print_usage):


def _input_files_checker(args, print_usage):
if not args.in_files:
print_usage()
print('{}: error: no input files'.format(__name__))
exit()
for file in args.in_files:
if not os.path.exists(file):
print_usage()
Expand Down Expand Up @@ -111,7 +109,26 @@ def _command_build(args, print_usage):
builder.build(args.in_files, rf, wf)


def _command_link(args, print_usage):
output = sys.stdout
if args.unlink:
unlink_default_dict_package(output=output)
return

dict_package = 'sudachidict_' + args.dict_type
try:
return set_default_dict_package(dict_package, output=output)
except ImportError:
print_usage()
print('{} not installed'.format(dict_package))
exit()


def _command_tokenize(args, print_usage):
if args.version:
print_version()
return

_input_files_checker(args, print_usage)

if args.mode == "A":
Expand Down Expand Up @@ -140,23 +157,32 @@ def _command_tokenize(args, print_usage):
output.close()


def print_version():
print('sudachipy v{}'.format(SUDACHIPY_VERSION))


def main():
parser = argparse.ArgumentParser(description="Japanese Morphological Analyzer")

subparsers = parser.add_subparsers(description='')

parser.add_argument("-v", "--version", action="version", version="%(prog)s v0.2.0")

# root, tokenizer parser
parser_tk = subparsers.add_parser('tokenize', help='(default) see `tokenize -h`', description='Tokenize Text')
parser_tk.add_argument("-r", dest="fpath_setting", metavar="file", help="the setting file in JSON format")
parser_tk.add_argument("-m", dest="mode", choices=["A", "B", "C"], default="C", help="the mode of splitting")
parser_tk.add_argument("-o", dest="fpath_out", metavar="file", help="the output file")
parser_tk.add_argument("-a", action="store_true", help="print all of the fields")
parser_tk.add_argument("-d", action="store_true", help="print the debug information")
parser_tk.add_argument("in_files", metavar="file", nargs=argparse.ONE_OR_MORE, help='text written in utf-8')
parser_tk.add_argument("-v", "--version", action="store_true", dest="version", help="print sudachipy version")
parser_tk.add_argument("in_files", metavar="file", nargs=argparse.ZERO_OR_MORE, help='text written in utf-8')
parser_tk.set_defaults(handler=_command_tokenize, print_usage=parser_tk.print_usage)

# link default dict package
parser_ln = subparsers.add_parser('link', help='see `link -h`', description='Link Default Dict Package')
parser_ln.add_argument("-t", dest="dict_type", choices=["small", "core", "full"], default="core", help="dict dict")
parser_ln.add_argument("-u", dest="unlink", action="store_true", help="unlink sudachidict")
parser_ln.set_defaults(handler=_command_link, print_usage=parser_ln.print_usage)

# build dictionary parser
parser_bd = subparsers.add_parser('build', help='see `build -h`', description='Build Sudachi Dictionary')
parser_bd.add_argument('-o', dest='out_file', metavar='file', default='system.dic',
Expand Down
60 changes: 57 additions & 3 deletions sudachipy/config.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,59 @@
from importlib import import_module
import json
import os
from pathlib import Path
from typing import List

DEFAULT_SETTINGFILE = os.path.join(os.path.dirname(os.path.abspath(__file__)), os.pardir, "resources/sudachi.json")
DEFAULT_RESOURCEDIR = os.path.join(os.path.dirname(os.path.abspath(__file__)), os.pardir, "resources")
DEFAULT_SETTINGFILE = os.path.join(os.path.dirname(os.path.abspath(__file__)), "resources/sudachi.json")
DEFAULT_RESOURCEDIR = os.path.join(os.path.dirname(os.path.abspath(__file__)), "resources")


def unlink_default_dict_package(output):
try:
dst_path = Path(import_module('sudachidict').__file__).parent
except ImportError:
print('sudachidict not exists', file=output)
return

if dst_path.is_symlink():
print('unlinking sudachidict', file=output)
dst_path.unlink()
print('sudachidict unlinked', file=output)
if dst_path.exists():
raise IOError('unlink failed (directory exists)')


def set_default_dict_package(dict_package, output):
unlink_default_dict_package(output)

src_path = Path(import_module(dict_package).__file__).parent
dst_path = src_path.parent / 'sudachidict'
dst_path.symlink_to(src_path)
print('default dict package = {}'.format(dict_package), file=output)

return dst_path


def create_default_link_for_sudachidict_core(output):
try:
dict_path = Path(import_module('sudachidict').__file__).parent
except ImportError:
try:
import_module('sudachidict_core')
except ImportError:
raise KeyError('`systemDict` must be specified if `SudachiDict_core` not installed')
try:
import_module('sudachidict_full')
raise KeyError('Multiple packages of `SudachiDict_*` installed. Set default dict with link command.')
except ImportError:
pass
try:
import_module('sudachidict_small')
raise KeyError('Multiple packages of `SudachiDict_*` installed. Set default dict with link command.')
except ImportError:
pass
dict_path = set_default_dict_package('sudachidict_core', output=output)
return dict_path / 'resources' / 'system.dic'


class _Settings(object):
Expand Down Expand Up @@ -37,7 +87,11 @@ def __contains__(self, item):
def system_dict_path(self) -> str:
if 'systemDict' in self.__dict_:
return os.path.join(self.resource_dir, self.__dict_['systemDict'])
raise KeyError('`systemDict` not defined in setting file')
else:
with open(os.devnull, 'w') as f:
dict_path = create_default_link_for_sudachidict_core(output=f)
self.__dict_['systemDict'] = dict_path
return dict_path

def char_def_path(self) -> str:
if 'characterDefinitionFile' in self.__dict_:
Expand Down
File renamed without changes.
File renamed without changes.
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
{
"systemDict" : "system.dic",
"characterDefinitionFile" : "char.def",
"inputTextPlugin" : [
{ "class" : "sudachipy.plugin.input_text.DefaultInputTextPlugin" }
Expand Down
File renamed without changes.
1 change: 1 addition & 0 deletions tests/mock_grammar.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ def mocked_get_character_category():
test_resources_dir = os.path.join(
os.path.dirname(os.path.abspath(__file__)),
os.pardir,
'sudachipy',
'resources')
try:
cat.read_character_definition(os.path.join(test_resources_dir, 'char.def'))
Expand Down