Manga transcription data format and ways to render them into readable formats, statistics and more.
The goal of this project is to be able to write manga transcriptions as data and not at the final result. The program should parse this data and construct various forms of output from this, such as readables, language reports and statistics.
A transcription should have the literal text from the manga in it. Along with it, it could contain extra auxiliary information such as the text with the kanjis replaced, romanized text and translation or notes.
The goal of this is to be able to keep it along while reading the manga or be able to go back and prevent redoing work such as figuring out the spelling of kanji.
An extra tool for learning Japanese while reading and transcribing manga. Could contain lists such common kanji and their mappings into hiragana. The introduction of simple frequency statistics and sorted output may reveal patterns in the language used such as the most common words and kanij, useful for learning.
A tool that is of novel interest. Knowing what locations show up most in the manga, who speaks most, which characters converse lots which each other and other novelties may be interesting to know for fans of the manga. From my experience, as you go through a chapter to transcribe it, transcribing the language and translating it, understanding it, dealing with kanji is much more effort than simply stating the location and which characters are present. That's why you might as well transcribe these things as they don't inflate the effort by much and give rise to the opportunity to compute insights that are just really fun.
Transcribing into a data format and not into a format that is to be consumed directly has some advantages. Configuration may allow different output to be generated from the data. The reason could be different people's preferences or different goals. If you have done much work and realize you want different formatting, instead of reformatting your work you can just generate a different output. This is one of the main goals of the project since I just kept changing my formatting in the beginning of my transcribing journey.
- parse data
- pictures data: location, characters, nr, page
- transcription
- translation
- kanji map
- from/to (directed from to which character)
- One or more pattern
- general
- report: volumes and chapters included, picture and morae counts
- customization with config file
- recursive locations
- transcribe
- original text and translation
- kanji replacement
- automatic romanization
- automatic indentation
- text consistency improvements
- page headers
- statistics
- ranked locations: appreanances, morae spoken in
- characters ranked on number of appearances
- characters ranked on morae spoken
- characters ranked on morae spoken to by other characters
- character pairs ranked on number of morae spoken in their interactions
- characters ranked on overall prominence
- language report
- hiragana/katakana characters ranked by count
- kanji's ranked by count
- words ranked by count
The data is written down in a toml file.
Every chapter should be in it's own toml file.
Every file starts with some information about the chapter:
# this is a comment and will be ignored by the program
manga = "日常"
author = "あらゐけいいち"
volume = 1
chapter = 1
subchapter = 0.5
title = "日常の1.5"
The subchapter
field is optional, the rest is expected to be there.
After that you supply an array of pictures like this:
[[pic]]
# picture data
[[pic]]
# picture data
[[pic]]
# picture data
An example of picture data:
[[pic]]
nr = 2
page = 1
location = "shinonome house"
characters = ["nano"]
[[pic.text]]
# text data
[[pic.text]]
# text data
[[pic.text]]
# text data
Picture number(nr
) is expected to be there.
page
is optional and sets the page number.
If page
is not present, it is assumed we're still on the same page
as last time you declared a page number.
Every chapter must have it's first picture assigned a page number so it's knows how to continue.
location
, characters
and array of texts are optional.
Every chapter's first page must have an initial location.
An example of text data:
[[pic.text]]
from = "nano"
to = "hakase"
lines = ["今日", "日直 でしたー"]
kmap = [
["今日", "きょう"],
["日", "につ"],
["直", "ちょく"],
]
transl = ["Today", "is my shift!"]
notes = ["This in an optional note, the transcriber may want to say something about this text."]
todo = true
from
is used to describe which characters says the current text.
I personally regard the narator as a "character" in this field.
to
holds the character to which this text is directed.
I personally regard the audience as a "character" in this field.
The field is optional. If a character speaks to themselves,
or thinks internally you can just leave this field out.
lines
is a mandatory field and should be an array of strings,
each containing a line of transcribed text.
This text should be transcribed literally from the manga, apart from spaces
and is used for many calculations and transformations.
To split of words you can insert ascii spaces " "
between characters.
These are removed in the output of the transcription
but will be used for things such as the romanized version.
kmap
is an optional field that defines a mapping of kanji to hiragana or katakana.
This is used for the substitution.
transl
is for translation of lines
.
It's optional and you may have a different number of entries in the transl
array than you have
in the lines
array.
notes
is optional and you may write down notes about this text here.
todo
is optional and when it's set so true, it will be logged that this text needs work.
You may want to set todo = true
if there is an error, something is incomplete etc.
This way it will be logged every time so you won't forget it after a while and you know what items
need some work.
Setting todo = false
doesn't do anything, it's the same as leaving todo
out.
When a kanji appears multiple times in a text, you must give the correct mapping as many times as it appears. This is so that it is possible to have sentences that have the same kanji with different mappings. An example of this:
[[pic.text]]
from = "yukko"
lines = ["そう考 えると", "不幸中 の 幸いって", "ヤツだね"]
kmap = [
["考", "かんが"],
["不", "ふ"],
["幸", "こう"],
["中", "ちゅ"],
["幸", "さいわ"],
]
transl = ["If you think about it that way it's a blessing in disguise."]
The 'One Or More' pattern has been implemented with backwards compatibility.
This means that in fields that can take multiple values with arrays,
you can now also leave out the array if you just have one value.
For example, when denoting which characters appear in a picture you could write
characters = ["yukko", "mai"]
. For a single character you can write characters = ["nano"]
.
With the 'One Or More' pattern you can write the former as characters = "nano"
.
Fields that support the pattern are: characters
, form
, to
, lines
, kmap
, transl
and notes
.
For an example check out example.toml
.
For more information on the toml language visit https://toml.io/en/
All example material such as example lines and characters in example.toml
and this readme are
from the manga 日常
by あらゐけいいち
.
The material is used for educational purposes.
Program is used through a command line interface (CLI).
USAGE:
mangatrans [OPTIONS] <INPUTFILES>...
ARGS:
<INPUTFILES>...
OPTIONS:
-d, --outputdir <OUTPUTDIR>
-h, --help Print help information
-l, --log <log> [default: true]
-m, --mode <MODE> [default: transcribe] [possible values: transcribe, stats,
language]
-o, --outputmode <OUTPUTMODE> [default: stdout] [possible values: stdout, file]
-V, --version Print version information
Sample output generated from chapter 1 of the manga 日常.
- picture 65
- text 1
- なのちゃん
休みなんて
めずらしいね - なのちゃん
やすみなんて
めずらしいね - nano chan
yasuminante
mezurashii ne - It's rare for Nano chan not to be present, right?
- なのちゃん
- text 2
- 故障かな?
- こしょうかな?
- koshou kana?
- Malfunction maybe?
- text 3
- ちょっとやめなよー
本人バレてないと
思ってるんだから - ちょっとやめなよー
ほんにんバレてないと
おもってるんだから - chotto yamenayo-
honnin baretenaito
omotterundakara - Stop it, she doesn't know we found out yet.
- ちょっとやめなよー
- text 1
Manga: 日常
Volumes: 1
Chapters: 1
Pictures: 74
Morae spoken: 880
Locations:
street: 30 appearances, 339 morae spoken in.
roof: 19 appearances, 130 morae spoken in.
classroom: 17 appearances, 311 morae spoken in.
shinonome house: 3 appearances, 73 morae spoken in.
chimney: 2 appearances, 7 morae spoken in.
school hallway: 1 appearances, 12 morae spoken in.
school grounds: 1 appearances, 8 morae spoken in.
aioi lookalike family garden: 1 appearances, 0 morae spoken in.
Character appearances:
yukko: 33
nano: 24
mio: 19
mai: 9
kokeshi: 6
person: 5
akabeko: 5
headphones guy: 4
aioi mom lookalike: 2
nakanojo: 2
izumi: 1
crow wug: 1
hakase: 1
mono: 1
chissan lookalike: 1
Morae spoken:
yukko: 309
nano: 238
mio: 143
narator: 74
izumi: 65
headphones guy: 16
mai: 14
person: 11
nakanojo: 9
hakase: 1
Morae spoken to:
audience: 74
yukko: 73
class: 65
mio: 55
hakase: 48
mai: 42
person: 17
nakanojo: 3
Conversation pairs in morae:
mio, yukko: 114
audience, narator: 74
class, izumi: 65
hakase, nano: 48
mai, yukko: 40
mai, mio: 16
nakanojo, person: 12
person, person: 8
Kanji frequencies:
降: ふ: 5
生: せい: 4
高: こう: 4
子: こ: 3
麻: ま: 3
今日: きょう: 3
赤: あか: 3
衣: い: 3
雲: のめ: 2
行: い: 2
先: せん: 2
下: くだ: 2
思: おも: 2
日: につ: 2
本: ほん: 2
直: ちょく: 2
人: にん: 2
東: しの: 2
当: あ: 2
原: はら: 2
Copyright (C) 2024 Cody Bloemhard
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <https://www.gnu.org/licenses/>.