confidences of language detection #39

fergald · 2025-02-10T08:02:53Z

Language detection returns a map of languages to confidence. There are many languages. Chrome's current language detector outputs about 120 entries in this map. Usually some small number will have high confidence and most of will have a confidence < 1%. The long tail is just noise. However it is good to have the confidences summing to 1.0.

If we mask low-confidence langauges, we should add their weight to the undefined language ("und") so the sum continues to be 1.0.

Approaches to cutting off the output:

Fixed cutoff

We set a fixed cutoff and hide any languages below that cutoff. This could be problematic when a text is genuinely multilingual and genuinely contains many smaller portions of different languages. If they are below the cutoff they will not be mentioned. It's possible for every language to be below the cutoff.

Fixed cumulative cutoff

We set a fixed cutoff and sum the weights from highest to lowest until we exceed the cutoff. We merge all subsequent languages into "und". E.g. with a cumulative cut-off of 0.99, the returned languages make up at least 99% of the weight and the omitted languages make up at most 1%. If the text contains equal amounts of many different languages, all or most of them will be present in the output.

Conclusion

I think fixed cumulative is simple enough to implement (sort then accumulate). We still need to pick a cutoff. 1% seems reasonable, if the tail sums to less than 1%, it seems like it cannot be impactful.

The text was updated successfully, but these errors were encountered:

Closes #39. Closes #13.

domenic added a commit that referenced this issue Feb 18, 2025

Add specification for language detector

832dcaf

Closes #39. Closes #13.

domenic mentioned this issue Feb 18, 2025

Add specification for language detector #40

Merged

domenic added a commit that referenced this issue Feb 19, 2025

Add specification for language detector

30a076f

Closes #39. Closes #13.

domenic closed this as completed in #40 Feb 26, 2025

domenic added a commit that referenced this issue Feb 26, 2025

Add specification for language detector

5e7cd44

Closes #39. Closes #13.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

confidences of language detection #39

confidences of language detection #39

fergald commented Feb 10, 2025

confidences of language detection #39

confidences of language detection #39

Comments

fergald commented Feb 10, 2025

Approaches to cutting off the output:

Fixed cutoff

Fixed cumulative cutoff

Conclusion