feat(gatsby-transformer-remark): Better timeToRead for Chinese/Japanese texts #21312

jlkiri · 2020-02-09T14:09:00Z

Description

This PR addresses the issue #21311.
It provides better Chinese/Japanese character counting heuristics for gatsby-transformer-remark, in place of the current one which outputs timeToRead values twice higher than expected by a native reader.

I considered using actual morphological parsers like kuromoji but they only target one language at a time, so dealing with both Chinese/Japanese automatically would require two new libraries and possibly dictionary files for morphological analysis. I feel like that is too much for this particular function.

Instead, we can use the fact that most words in both Chinese and Japanese consist of two characters (slightly more for Japanese). After playing with different texts, I found that simply multiplying non-latin character count by 0.56 gives almost the same result as analyzing text with an actual morphological parser (±10 words on average). No libraries needed. This is what I'm doing in this PR.

Note that Korean (which uses whitespace) is already perfectly countable by _.words so I am not dealing with it.

Here is a codesandbox that shows how different approaches count words (gatsby is the current one, smart is the one in this PR and moprhological is the most correct one):
https://codesandbox.io/s/better-word-count-2uziu

Documentation

https://www.gatsbyjs.org/packages/gatsby-transformer-remark/

Related Issues

#21311
#17988

packages/gatsby-transformer-remark/src/extend-node-type.js

jlkiri · 2020-02-11T03:55:52Z

What are starters_validate tests are why are they failing?

pieh · 2020-02-11T12:25:14Z

What are starters_validate tests are why are they failing?

We run npm audit on our starters and sometimes it will fail when new advirsory is published on unrelated pull requests. This one was fixed in master already ( #21354 , so don't worry about it - but you can merge master in to get rid of that failing check here

pieh

Looks good! Thanks @jlkiri!

gatsbot · 2020-02-11T12:58:27Z

Holy buckets, @jlkiri — we just merged your PR to Gatsby! 💪💜

Gatsby is built by awesome people like you. Let us say “thanks” in two ways:

We’d like to send you some Gatsby swag. As a token of our appreciation, you can go to the Gatsby Swag Store and log in with your GitHub account to get a coupon code good for one free piece of swag. We’ve got Gatsby t-shirts, stickers, hats, scrunchies, and much more. (You can also unlock even more free swag with 5 contributions — wink wink nudge nudge.) See gatsby.dev/swag for details.
We just invited you to join the Gatsby organization on GitHub. This will add you to our team of maintainers. Accept the invite by visiting https://github.com/orgs/gatsbyjs/invitation. By joining the team, you’ll be able to label issues, review pull requests, and merge approved pull requests.

If there’s anything we can do to help, please don’t hesitate to reach out to us: tweet at @gatsbyjs and we’ll come a-runnin’.

Thanks again!

jlkiri added 3 commits February 9, 2020 21:59

Recognize CJ characters

62ddd19

Update test and snapshot

1cc9a99

Add comments

c9b6e19

jlkiri requested a review from a team as a code owner February 9, 2020 14:09

jlkiri changed the title ~~feat(gatsby-transformer-remark): Better time to read~~ feat(gatsby-transformer-remark): Better timeToRead for Chinese/Japanese texts Feb 9, 2020

pieh reviewed Feb 10, 2020

View reviewed changes

packages/gatsby-transformer-remark/src/extend-node-type.js Outdated Show resolved Hide resolved

Move timeToRead to its own file

9cebafa

jlkiri requested a review from pieh February 11, 2020 10:35

pieh approved these changes Feb 11, 2020

View reviewed changes

Merge remote-tracking branch 'upstream/master' into better-time-to-read

0ddd1b8

pieh added the bot: merge on green Gatsbot will merge these PRs automatically when all tests passes label Feb 11, 2020

gatsbybot merged commit d677deb into gatsbyjs:master Feb 11, 2020

This was referenced Feb 11, 2020

[gatsby-transformer-remark] timeToRead is wrong for Chinese/Japanese texts #21311

Closed

fix:(gatsby-transformer-remark) Add utils to .gitignore #21441

Closed

pieh mentioned this pull request Feb 17, 2020

chore(gatsby-transformer-remark): gitignore built file #21532

Merged

This was referenced Jul 5, 2020

[gatsby-plugin-mdx] timeToRead is wrong for Chinese/Japanese texts #25532

Closed

handle timeToRead for Chinese/Japanese on mdx #25533

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(gatsby-transformer-remark): Better timeToRead for Chinese/Japanese texts #21312

feat(gatsby-transformer-remark): Better timeToRead for Chinese/Japanese texts #21312

jlkiri commented Feb 9, 2020 •

edited

Loading

jlkiri commented Feb 11, 2020

pieh commented Feb 11, 2020

pieh left a comment

gatsbot bot commented Feb 11, 2020

feat(gatsby-transformer-remark): Better timeToRead for Chinese/Japanese texts #21312

feat(gatsby-transformer-remark): Better timeToRead for Chinese/Japanese texts #21312

Conversation

jlkiri commented Feb 9, 2020 • edited Loading

Description

Documentation

Related Issues

jlkiri commented Feb 11, 2020

pieh commented Feb 11, 2020

pieh left a comment

Choose a reason for hiding this comment

gatsbot bot commented Feb 11, 2020

jlkiri commented Feb 9, 2020 •

edited

Loading