Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Unable to reproduce the PhoMT results with the HuggingFace Model #2

Closed
justinphan3110 opened this issue Oct 10, 2022 · 5 comments
Closed

Comments

@justinphan3110
Copy link

justinphan3110 commented Oct 10, 2022

Hi, I'm trying to reproduce the En2Vi result described in the paper on the PhoMT Test Set.
I used the generation type as showed in the example

model_name = 'vinai/vinai-translate-en2vi'

tokenizer = AutoTokenizer.from_pretrained(model_name, src_lang="en_XX")
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
.....
outputs = model.generate(
        input_ids=batch['input_ids'].to('cuda'),
        max_length=max_target_length,
        do_sample=True,
        top_k=100,
        top_p=0.8,
        decoder_start_token_id=tokenizer.lang_code_to_id["vi_VN"],
        num_return_sequences=1,
    )

Yet, the testing result I got from the HuggingFace model was around 42.2 (The result showed in the paper is 44.29).

Do you plan to release the eval code/pipeline to reproduce the result discussed in the paper?

@datquocnguyen
Copy link
Collaborator

Are you using sacreBLEU?

@justinphan3110
Copy link
Author

I'm using the sacreBLEU from HuggingFace Metric. Is this different than the sacreBLEU you used in the paper? If so, can you share the Command-line that you used with the sacreBLEU?

@datquocnguyen
Copy link
Collaborator

datquocnguyen commented Oct 10, 2022

Our training and inference stages (an example below) were originally performed by using fairseq. We then computed the detokenized and case-sensitive BLEU score using SacreBLEU (with the signature “BLEU+case.mixed+numrefs.1+smooth.exp+tok.1- 3a+version.1.5.1”).
The huggingface versions are just variants converted from our original fairseq models. So, I am not sure what makes differences in obtained scores between the two libraries atm.

SOURCE_LANG=vi_VN
TARGET_LANG=en_XX
LANGS=ar_AR,cs_CZ,de_DE,en_XX,es_XX,et_EE,fi_FI,fr_XX,gu_IN,hi_IN,it_IT,ja_XX,kk_KZ,ko_KR,lt_LT,lv_LV,my_MM,ne_NP,nl_XX,ro_RO,ru_RU,si_LK,tr_TR,vi_VN,zh_CN

fairseq-generate $DATA_DIR \
   --path $MODEL_DIR/checkpoint_best.pt\
   --task translation_from_pretrained_bart \
   --gen-subset valid\
   -t $TARGET_LANG -s $SOURCE_LANG \
   --bpe 'sentencepiece' --sentencepiece-model $MODEL_DIR/sentence.bpe.model \
   --sacrebleu --remove-bpe 'sentencepiece' \
   --batch-size 32 --langs $LANGS > vi_en

cp $SOURCE_DATA_DIR/val_tourism_finance.en_XX vi_en.ref
#cp $SOURCE_DATA_DIR/test_tourism.en_XX vi_en.ref

cat vi_en| grep -P "^H" |sort -V |cut -f 3- | sed 's/\[en_XX\]//g' > vi_en.hyp

@datquocnguyen
Copy link
Collaborator

datquocnguyen commented Oct 30, 2023

@justinphan3110 I just had a bit of time to redo the evaluation. Using the simple script below, you'd obtain the scarebleu score at 44.2.

import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer_en2vi = AutoTokenizer.from_pretrained(
    "vinai/vinai-translate-en2vi", src_lang="en_XX"
)
model_en2vi = AutoModelForSeq2SeqLM.from_pretrained("vinai/vinai-translate-en2vi")
device_en2vi = torch.device("cuda")
model_en2vi.to(device_en2vi)

def translate_en2vi(en_texts: str) -> str:
    input_ids = tokenizer_en2vi(en_texts, padding=True, return_tensors="pt").to(
        device_en2vi
    )
    output_ids = model_en2vi.generate(
        **input_ids,
        decoder_start_token_id=tokenizer_en2vi.lang_code_to_id["vi_VN"],
        num_return_sequences=1,
        num_beams=5,
        early_stopping=True
    )
    vi_texts = tokenizer_en2vi.batch_decode(output_ids, skip_special_tokens=True)
    return vi_texts

with open("PhoMT-detokenization-test/test.en", "r") as input_file:
    lines = [line.strip() for line in input_file.readlines()]
    index = 0
    writer = open("PhoMT-detokenization-test/test.vi_generated.v1", "w")
    while index < len(lines):
        texts = lines[index : index + 8]
        outputs = translate_en2vi(texts)
        print(outputs)
        for output in outputs:
            writer.write(output.strip() + "\n")
        index = index + 8
    writer.close()
    
import evaluate
references = [[line.strip()] for line in open("PhoMT-detokenization-test/test.vi", "r").readlines()]
predictions = [
    line.strip() for line in open("PhoMT-detokenization-test/test.vi_generated.v1", "r").readlines()
]
sacrebleu = evaluate.load("sacrebleu")
results = sacrebleu.compute(predictions=predictions, references=references)
print(results)

@datquocnguyen
Copy link
Collaborator

Evaluation for VietAI/envit5-translation:

import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("VietAI/envit5-translation")
model = AutoModelForSeq2SeqLM.from_pretrained("VietAI/envit5-translation")
device = torch.device("cuda")
model.to(device)


def translate(vi_texts: str) -> str:
    input_ids = tokenizer(vi_texts, padding=True, return_tensors="pt").to(device)
    output_ids = model.generate(
        **input_ids,
        num_return_sequences=1,
        num_beams=5,
        early_stopping=True,
        max_length=512
    )
    en_texts = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
    return en_texts


with open("PhoMT-detokenization-test/test.vi", "r") as input_file:
    lines = ["vi: " + line.strip() for line in input_file.readlines()]
    index = 0
    writer = open("PhoMT-detokenization-test/test.en_generated.vietai", "w")
    while index < len(lines):
        texts = lines[index : index + 8]
        outputs = translate(texts)
        print(outputs)
        for output in outputs:
            writer.write(output[4:].strip() + "\n")

        index = index + 8

    writer.close()

with open("PhoMT-detokenization-test/test.en", "r") as input_file:
    lines = ["en: " + line.strip() for line in input_file.readlines()]
    index = 0
    writer = open("PhoMT-detokenization-test/test.vi_generated.vietai", "w")
    while index < len(lines):
        texts = lines[index : index + 8]
        outputs = translate(texts)
        print(outputs)
        for output in outputs:
            writer.write(output[4:].strip() + "\n")

        index = index + 8

    writer.close()
    
references = [
    [line.strip()]
    for line in open("PhoMT-detokenization-test/test.en", "r").readlines()
]
predictions = [
    line.strip()
    for line in open(
        "PhoMT-detokenization-test/test.en_generated.vietai", "r"
    ).readlines()
]
sacrebleu = evaluate.load("sacrebleu")
results = sacrebleu.compute(predictions=predictions, references=references)
print(results)

references = [
    [line.strip()]
    for line in open("PhoMT-detokenization-test/test.vi", "r").readlines()
]
predictions = [
    line.strip()
    for line in open(
        "PhoMT-detokenization-test/test.vi_generated.vietai", "r"
    ).readlines()
]
sacrebleu = evaluate.load("sacrebleu")
results = sacrebleu.compute(predictions=predictions, references=references)
print(results)

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants