Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

支持推理模型进行翻译 #650

Open
highkay opened this issue Feb 19, 2025 · 7 comments · May be fixed by #653
Open

支持推理模型进行翻译 #650

highkay opened this issue Feb 19, 2025 · 7 comments · May be fixed by #653
Labels
enhancement New feature or request

Comments

@highkay
Copy link
Contributor

highkay commented Feb 19, 2025

在什么场景下,需要你请求的功能?

推理模型的翻译质量比原版要高不少

解决方案

主要是groq提供了免费的蒸馏过的推理模型,主要是deepseek-r1-distill-qwen-32b。

我正在开发此功能,代码如下

# 过滤掉<think>标签内的内容
if "<think>" in content and "</think>" in content:
    content = re.sub(r"<think>.*?</think>", "", content, flags=re.DOTALL)

我添加到了OpenAITranslator的do_translate方法内,然后本地用python pdf2zh.pdf2zh -i -d运行的,问题是没有生效,翻译的内容还是包含了标签,我单独写了一个unittest,跑起来是没问题的,标签被删掉了。

请帮我看一下,解决之后我会发pr的。

其他内容

No response

@highkay highkay added the enhancement New feature or request label Feb 19, 2025
@awwaawwa
Copy link
Collaborator

#637 给ollama做了一个,你参考一下

@awwaawwa
Copy link
Collaborator

grok在代码中可能是单独的一个类?

@highkay
Copy link
Contributor Author

highkay commented Feb 20, 2025

#637 给ollama做了一个,你参考一下

他那个主要代码和我一样的,不过条件不一样,他限定了model,其实蒸馏模型也会输出think标签的,而且没人会用满血的推理模型翻译文章的,太贵了,而且提升有限(我感觉是单次抽取的上下文窗口太短了,天花板太低)。免费的蒸馏模型是非常合适的,我也对比了一下效果,明显比glm4-falsh(相当于chatglm4-9b)强很多,而且速度快的多。

@highkay
Copy link
Contributor Author

highkay commented Feb 20, 2025

grok在代码中可能是单独的一个类?

groq是继承了OpenAITranslator,并没有重写do_translate方法,所以我应该修改OpenAITranslator这个父类的do_translate吧?然后我的unittest也是基于Groq做的Translator,跑出来没问题,没有think标签。

import unittest
from pdf2zh.translator import GroqTranslator
from pdf2zh import cache

class TestGroqTranslator(unittest.TestCase):
    def setUp(self):
        self.test_db = cache.init_test_db()
        # Mock environment variables and config 
        self.test_env = {
            "GROQ_API_KEY": "xxxxxx",
            "GROQ_MODEL": "deepseek-r1-distill-qwen-32b"
        }
        
    def tearDown(self):
        cache.clean_test_db(self.test_db)
        
    def test_do_translate_success(self):

        # Create translator instance
        translator = GroqTranslator(
            lang_in="en",
            lang_out="zh",
            model=None, 
            envs=self.test_env
        )
        
        text = """Get personalized book picks and up-to-date news about this author."""

        # Test translation
        result = translator.do_translate(text)
        
        print(result)


if __name__ == "__main__":
    unittest.main()

@awwaawwa
Copy link
Collaborator

发一个draft的PR,方便大家看到你的代码。

@awwaawwa
Copy link
Collaborator

另外 re.sub(r"<think>.*?</think>" 会把不在响应开头的内容也干掉。正则表达式你参考另一个PR的那个regex。

@highkay
Copy link
Contributor Author

highkay commented Feb 20, 2025

另外 re.sub(r"<think>.*?</think>" 会把不在响应开头的内容也干掉。正则表达式你参考另一个PR的那个regex。

#653

@awwaawwa awwaawwa linked a pull request Feb 20, 2025 that will close this issue
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants