Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

confidences of language detection #39

Closed
fergald opened this issue Feb 10, 2025 · 0 comments · Fixed by #40
Closed

confidences of language detection #39

fergald opened this issue Feb 10, 2025 · 0 comments · Fixed by #40

Comments

@fergald
Copy link

fergald commented Feb 10, 2025

Language detection returns a map of languages to confidence. There are many languages. Chrome's current language detector outputs about 120 entries in this map. Usually some small number will have high confidence and most of will have a confidence < 1%. The long tail is just noise. However it is good to have the confidences summing to 1.0.

If we mask low-confidence langauges, we should add their weight to the undefined language ("und") so the sum continues to be 1.0.

Approaches to cutting off the output:

Fixed cutoff

We set a fixed cutoff and hide any languages below that cutoff. This could be problematic when a text is genuinely multilingual and genuinely contains many smaller portions of different languages. If they are below the cutoff they will not be mentioned. It's possible for every language to be below the cutoff.

Fixed cumulative cutoff

We set a fixed cutoff and sum the weights from highest to lowest until we exceed the cutoff. We merge all subsequent languages into "und". E.g. with a cumulative cut-off of 0.99, the returned languages make up at least 99% of the weight and the omitted languages make up at most 1%. If the text contains equal amounts of many different languages, all or most of them will be present in the output.

Conclusion

I think fixed cumulative is simple enough to implement (sort then accumulate). We still need to pick a cutoff. 1% seems reasonable, if the tail sums to less than 1%, it seems like it cannot be impactful.

domenic added a commit that referenced this issue Feb 18, 2025
domenic added a commit that referenced this issue Feb 19, 2025
domenic added a commit that referenced this issue Feb 26, 2025
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant