-
-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
漢字から平仮名への変換処理の改善 #32
Comments
Janome と pykakasi の差分
Notes: Janome, pykakasi どちらも悪いケースを検知できていないことに注意 ❗ $ cat diff_janome_pykakasi.txt | wc -l
503 (501 + Markdown table header 2 lines)
$ cat diff_janome_pykakasi.uniq.txt | wc -l
283 (281 + Markdown table header 2 lines)
$ rg ':white_check_mark:' diff_janome_pykakasi.uniq.txt | wc -l
179
$ rg ':heavy_check_mark:' diff_janome_pykakasi.uniq.txt | wc -l
17
$ rg ':x:' diff_janome_pykakasi.uniq.txt | wc -l
85
$ python
>>> (179 + 17) / 283
0.692 diff_janome_pykakasi.txt 👉 Click to expand diff_janome_pykakasi.txt
|
cf. (very experimental) NEologd 辞書を内包した janome をビルドする方法 · mocobeta/janome Wiki Janome を neologd でビルドする時の構成と手順。時間掛かるので FROM python:3.8-slim-buster
ENV DEBIAN_FRONTEND=noninteractive
ENV LC_ALL=C.UTF-8
ENV LANG=C.UTF-8
RUN apt-get update && \
apt-get -y install --no-install-recommends \
wget \
jq \
git curl make zip xz-utils file sudo mecab libmecab-dev mecab-ipadic-utf8 && \
apt-get autoclean && \
apt-get clean && \
apt-get autoremove -y && \
rm -rf /var/lib/apt/lists/*
RUN git clone --depth=1 https://github.com/neologd/mecab-ipadic-neologd.git /build/mecab-ipadic-neologd
WORKDIR /build/mecab-ipadic-neologd
RUN ./bin/install-mecab-ipadic-neologd -n -a -y
WORKDIR /build
COPY requirements.txt ./requirements.txt
RUN python3 -m pip install --no-cache-dir --upgrade pip && \
python3 -m pip install --no-cache-dir -r ./requirements.txt && \
python3 -m pip check
RUN git clone --depth=1 https://github.com/mocobeta/janome.git -b 0.4.1 /build/janome
WORKDIR /build/janome
RUN python setup.py develop && \
cd ipadic && \
./build.sh /build/mecab-ipadic-neologd/build/mecab-ipadic-2.7.0-20070801-neologd-???????? utf8
RUN rm -rf /build
RUN wget -q 'https://raw.githubusercontent.com/yagays/emoji-ja/20190726/data/emoji_ja.json' \
-O /root/emoji_ja.json
WORKDIR /src mecab-ipadic-neologd/README.ja.md at master · neologd/mecab-ipadic-neologd docker run --rm -i -t --cpus 6 -v ${PWD}:/src docker.pkg.github.com/peaceiris/emoji-ime-dictionary/emoji-dev:latest bash
apt-get update && apt-get -y install --no-install-recommends wget jq git curl make zip xz-utils file patch sudo mecab libmecab-dev mecab-ipadic-utf8
git clone --depth=1 https://github.com/neologd/mecab-ipadic-neologd.git /build/mecab-ipadic-neologd
cd /build/mecab-ipadic-neologd
./bin/install-mecab-ipadic-neologd -n -a -y
git clone --depth=1 https://github.com/mocobeta/janome.git -b 0.4.1 /build/janome
cd /build/janome
python setup.py develop && cd ipadic
./build.sh /build/mecab-ipadic-neologd/build/mecab-ipadic-2.7.0-20070801-neologd-???????? utf8 &
jobs
# コンテナから抜けて待つ 1.5h くらい
vim janome/version.py
# JANOME_VERSION = '0.4.1-neologd-20200910'
python setup.py sdist
cp dist/Janome-0.4.1-neologd-20200910.tar.gz /src/
pip install dist/Janome-0.4.1-neologd-20200910.tar.gz --no-compile $ echo "異星人" | janome
異 接頭詞,名詞接続,*,*,*,*,異,イ,イ
星人 名詞,固有名詞,人名,名,*,*,星人,セイジン,セイジン |
Janome + neologd と pykakasi の差分 $ cat diff_janome-neologd_pykakasi.txt | wc -l
593 (591 + Markdown table header 2 lines)
$ cat diff_janome-neologd_pykakasi.txt | uniq > diff_janome-neologd_pykakasi.uniq.txt
$ cat diff_janome-neologd_pykakasi.uniq.txt | wc -l
328
$ rg ':white_check_mark:' diff_janome-neologd_pykakasi.uniq.txt | wc -l
203 (201 + Markdown table header 2 lines)
$ rg ':heavy_check_mark:' diff_janome-neologd_pykakasi.uniq.txt | wc -l
58
$ rg ':x:' diff_janome-neologd_pykakasi.uniq.txt | wc -l
65
$ python
>>> (201 + 58) / 328
0.789 diff_janome-neologd_pykakasi.uniq.txt 👉 Click to expand diff_janome-neologd_pykakasi.uniq.txt
|
結論
TODO
|
#31 was closed, #40 opened.
The text was updated successfully, but these errors were encountered: