-
Notifications
You must be signed in to change notification settings - Fork 490
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Why do we not set the ignore_index
of FusedCrossEntropy
to bos_id
?
#83
Comments
Yes I think you are right here. TinyLlama/scripts/prepare_slimpajama.py Line 85 in c53075b
If you have 64 CPU cores. prepare_slimpajama.py will initiate 64 processes, each with a PackedDatasetBuilder and call TinyLlama/scripts/prepare_slimpajama.py Line 26 in c53075b
TinyLlama/scripts/prepare_slimpajama.py Line 69 in c53075b
That means each process will leave a chunk file that is not fully filled with text tokens but rather with some sep tokens. TinyLlama/lit_gpt/packed_dataset.py Line 77 in c53075b
TinyLlama/lit_gpt/packed_dataset.py Line 91 in c53075b
So we may have 64 files with some sep tokens remaining.
I think it is because 64 files with some sep tokens remaining is a relatively small portion compared with the entire pretraining corpus(450k small bin files after processing), especially when you consider the small size of each chunk. So I do not know why your second checkpoint became really bad. Maybe it is just this specific prompt? Does the benchmark performance degrade significantly? These are some of my preliminary thoughts. Haven't looked very deeply into it yet. Thanks for spotting this out. We will fix it soon. For example, we can opt to not call builder.write_reminder() at all. |
Thanks for your reply Peiyuan!
It's true that the % of files with some bos token remaining is relatively small. The chunk size is actually quite big (i.e. TinyLlama/scripts/prepare_slimpajama.py Line 76 in 072536c
But you are right that as the % of files is relatively small, it shouldn't affect. I'll let you know if I managed to fix the bug. Thanks for your help anyway!
It degraded significantly across the instruct-eval benchmark. |
Hi,
Can I check why did you not ignore the loss for the bos token (
<s>
)?TinyLlama/pretrain/tinyllama.py
Line 197 in c53075b
I noticed that the preprocessing causes the remainder of the binary file to be the bos token (
<s>
).TinyLlama/scripts/prepare_slimpajama.py
Line 69 in c53075b
Consequently, my model checkpoint (not TinyLlama's) outputs poor qualitative results:
Interestingly, my model of previous checkpoint (100b tokens before) performed okay.
I'm trying to fix this by specifying the loss function to ignore the
<s>
idx (i.e. 1). I think this is a correct fix, but i'm not sure if it fixes the underlying issue (the issue should have plagued our model from the start, why did it only happen at this iter step?).The text was updated successfully, but these errors were encountered: