Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Bug in BPE algorithm #318

Closed
xbelonogov opened this issue Apr 17, 2019 · 5 comments
Closed

Bug in BPE algorithm #318

xbelonogov opened this issue Apr 17, 2019 · 5 comments
Labels

Comments

@xbelonogov
Copy link

xbelonogov commented Apr 17, 2019

I think the BPE algorithm is not working properly. This code snippet reproduces the bug.

import sentencepiece as spm

vocab_size= 9
model_prefix = 'model'
train_data_file = 'corpus.txt'
text = "bc a aaa"

with open(train_data_file, 'w') as fout:
    fout.write(text)

spm.SentencePieceTrainer.Train(f'--input={train_data_file} --model_prefix={model_prefix} --vocab_size={vocab_size} --model_type=bpe')

sp = spm.SentencePieceProcessor();
sp.Load(model_prefix + '.model')
for i in range(vocab_size):
    s = sp.IdToPiece(i)
    s = s.replace(chr(9601), '_')
    print(f'i: {i} piece: {s}') 

The input for BPE is "bc a aaa"
I got the following output.

i: 0 piece: <unk>
i: 1 piece: <s>
i: 2 piece: </s>
i: 3 piece: _a
i: 4 piece: bc
i: 5 piece: a
i: 6 piece: _
i: 7 piece: b
i: 8 piece: c

The first merged pair should be _ + a = _a and it's correct. (i: 3 piece: _a)
After that the second pair should be a + a = aa but algorithm produced pair bc. (i: 4 piece: bc)
I used the debug output and found out that at the second iteration here symbol->freq for aa is 0 and symbol->freq for bc is 1.

It happened because logic in this if statement is incorrect. You can't simply remove this positions.

@taku910
Copy link
Collaborator

taku910 commented Apr 24, 2019

Thank you for the report.

Do you think we can fix this issue just by removing the condition in the if statement?
Let me run large test to make sure it biring no side effects. (In real corpus, I believe that it will not have huge different)

@tombosc
Copy link

tombosc commented Apr 22, 2020

@xbelonogov , I agree that after the first merge, the count for "aa" should be 1. Indeed, if we note "X=aa", the corpus becomes "bcXXaa". Then both "bc" and "aa" have a count of 1, in fact, all the bigrams occur only once. So why would you prefer "aa" to be merged instead of "bc" as they should have the same counts? Thanks.

@taku910 taku910 added the bug label Jan 10, 2021
@xiefangqi
Copy link
Contributor

xiefangqi commented Dec 7, 2021

I see that the v0.1.96 output is still "bc", not "aa".
I put some logs to watch the process,
the first iteration:
symbol: ▁a freq: 2
symbol: aa freq: 1
symbol: ▁b freq: 1
symbol: bc freq: 1

▁a put into the final_pieces_, and the freq of "aa" reset to 0.

the second iteration:
symbol: aa freq: 0
symbol: ▁b freq: 1
symbol: bc freq: 1
symbol: _aa freq: 1

sentencepiece put "bc" into the final_pieces.

Is there a problem with this process?

@taku910
Copy link
Collaborator

taku910 commented Apr 24, 2023

Sorry for the late response. This bug is going to be fixed in the next release.

@taku910
Copy link
Collaborator

taku910 commented May 2, 2023

@taku910 taku910 closed this as completed May 2, 2023
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants