-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
feat(gatsby-transformer-remark): Better timeToRead for Chinese/Japanese texts #21312
Conversation
What are |
We run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Thanks @jlkiri!
Holy buckets, @jlkiri — we just merged your PR to Gatsby! 💪💜 Gatsby is built by awesome people like you. Let us say “thanks” in two ways:
If there’s anything we can do to help, please don’t hesitate to reach out to us: tweet at @gatsbyjs and we’ll come a-runnin’. Thanks again! |
Description
This PR addresses the issue #21311.
It provides better Chinese/Japanese character counting heuristics for
gatsby-transformer-remark
, in place of the current one which outputstimeToRead
values twice higher than expected by a native reader.I considered using actual morphological parsers like kuromoji but they only target one language at a time, so dealing with both Chinese/Japanese automatically would require two new libraries and possibly dictionary files for morphological analysis. I feel like that is too much for this particular function.
Instead, we can use the fact that most words in both Chinese and Japanese consist of two characters (slightly more for Japanese). After playing with different texts, I found that simply multiplying non-latin character count by
0.56
gives almost the same result as analyzing text with an actual morphological parser (±10 words on average). No libraries needed. This is what I'm doing in this PR.Note that Korean (which uses whitespace) is already perfectly countable by
_.words
so I am not dealing with it.Here is a codesandbox that shows how different approaches count words (
gatsby
is the current one,smart
is the one in this PR and moprhological is the most correct one):https://codesandbox.io/s/better-word-count-2uziu
Documentation
https://www.gatsbyjs.org/packages/gatsby-transformer-remark/
Related Issues
#21311
#17988