Ideas for pythainlp.lm function #1048

wannaphong · 2024-12-27T10:06:31Z

I think pythainlp.lm class should collect the function for doing preprocessing or post-processing Thai text from LLM and include a small language model that can run in computers for home users to do simple NLP jobs.

Preprocessing

pythainlp.lm.calculate_ngram_counts: Calculates the counts of n-grams in the list words for the specified range. Add pythainlp.lm.calculate_ngram_counts #1054

Post-processing

pythainlp.lm.remove_repeated_ngrams: Remove repeated n-grams (to fixed lm) Add pythainlp.llm #1043

The text was updated successfully, but these errors were encountered:

bact · 2024-12-27T13:38:24Z

If we're going to have a small language model as well, should we call the module just "lm"?
Just to make it more generic.

wannaphong · 2024-12-27T15:07:35Z

If we're going to have a small language model as well, should we call the module just "lm"? Just to make it more generic.

Agree 👍

matichon-vultureprime · 2025-01-02T10:39:08Z

How about leveraging NVIDIA-Curator to do pre-processing and post-processing?

We already have some examples from the NVIDIA team:

wannaphong · 2025-01-02T13:58:14Z

Add pythainlp.lm.calculate_ngram_counts #1054

bact · 2025-01-05T05:35:00Z

For the "small language model", what about having that model as a core/cores for most of the basic tasks that don't required larger model in PyThaiNLP? So we will have less dependencies as well.

Related to

wannaphong · 2025-01-05T10:49:45Z

For the "small language model", what about having that model as a core/cores for most of the basic tasks that don't required larger model in PyThaiNLP? So we will have less dependencies as well.

Related to
* [Porting model to ONNX model #639](https://github.com/PyThaiNLP/pythainlp/issues/639)

* [Porting Thai2fit from fastai v1 to fastai v2 #716](https://github.com/PyThaiNLP/pythainlp/issues/716)

* [Remove all python-crfsuite models from PyThaiNLP #655](https://github.com/PyThaiNLP/pythainlp/issues/655)

* [Consider reduce dependencies #935](https://github.com/PyThaiNLP/pythainlp/issues/935)

Just llama-cpp-python or onnx model. I think it is ok.

wannaphong mentioned this issue Dec 27, 2024

Add pythainlp.llm #1043

Merged

wannaphong added this to PyThaiNLP Dec 27, 2024

wannaphong moved this to In progress in PyThaiNLP Dec 27, 2024

wannaphong changed the title ~~Ideas for pythainlp.llm function~~ Ideas for pythainlp.lm function Dec 28, 2024

bact added the enhancement enhance functionalities label Dec 30, 2024

new5558 mentioned this issue Jan 10, 2025

Reduce reload word tokenizer engine in word_tokenize #973

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ideas for pythainlp.lm function #1048

Ideas for pythainlp.lm function #1048

wannaphong commented Dec 27, 2024 •

edited

Loading

bact commented Dec 27, 2024

wannaphong commented Dec 27, 2024

matichon-vultureprime commented Jan 2, 2025

wannaphong commented Jan 2, 2025

bact commented Jan 5, 2025 •

edited

Loading

wannaphong commented Jan 5, 2025

Ideas for pythainlp.lm function #1048

Ideas for pythainlp.lm function #1048

Comments

wannaphong commented Dec 27, 2024 • edited Loading

Preprocessing

Post-processing

bact commented Dec 27, 2024

wannaphong commented Dec 27, 2024

matichon-vultureprime commented Jan 2, 2025

wannaphong commented Jan 2, 2025

bact commented Jan 5, 2025 • edited Loading

wannaphong commented Jan 5, 2025

wannaphong commented Dec 27, 2024 •

edited

Loading

bact commented Jan 5, 2025 •

edited

Loading