You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Language detection returns a map of languages to confidence. There are many languages. Chrome's current language detector outputs about 120 entries in this map. Usually some small number will have high confidence and most of will have a confidence < 1%. The long tail is just noise. However it is good to have the confidences summing to 1.0.
If we mask low-confidence langauges, we should add their weight to the undefined language ("und") so the sum continues to be 1.0.
Approaches to cutting off the output:
Fixed cutoff
We set a fixed cutoff and hide any languages below that cutoff. This could be problematic when a text is genuinely multilingual and genuinely contains many smaller portions of different languages. If they are below the cutoff they will not be mentioned. It's possible for every language to be below the cutoff.
Fixed cumulative cutoff
We set a fixed cutoff and sum the weights from highest to lowest until we exceed the cutoff. We merge all subsequent languages into "und". E.g. with a cumulative cut-off of 0.99, the returned languages make up at least 99% of the weight and the omitted languages make up at most 1%. If the text contains equal amounts of many different languages, all or most of them will be present in the output.
Conclusion
I think fixed cumulative is simple enough to implement (sort then accumulate). We still need to pick a cutoff. 1% seems reasonable, if the tail sums to less than 1%, it seems like it cannot be impactful.
The text was updated successfully, but these errors were encountered:
Language detection returns a map of languages to confidence. There are many languages. Chrome's current language detector outputs about 120 entries in this map. Usually some small number will have high confidence and most of will have a confidence < 1%. The long tail is just noise. However it is good to have the confidences summing to 1.0.
If we mask low-confidence langauges, we should add their weight to the undefined language (
"und"
) so the sum continues to be 1.0.Approaches to cutting off the output:
Fixed cutoff
We set a fixed cutoff and hide any languages below that cutoff. This could be problematic when a text is genuinely multilingual and genuinely contains many smaller portions of different languages. If they are below the cutoff they will not be mentioned. It's possible for every language to be below the cutoff.
Fixed cumulative cutoff
We set a fixed cutoff and sum the weights from highest to lowest until we exceed the cutoff. We merge all subsequent languages into
"und"
. E.g. with a cumulative cut-off of 0.99, the returned languages make up at least 99% of the weight and the omitted languages make up at most 1%. If the text contains equal amounts of many different languages, all or most of them will be present in the output.Conclusion
I think fixed cumulative is simple enough to implement (sort then accumulate). We still need to pick a cutoff. 1% seems reasonable, if the tail sums to less than 1%, it seems like it cannot be impactful.
The text was updated successfully, but these errors were encountered: