Finetuning Florence2 Token Length Limit #289
Unanswered
kevinjeswani
asked this question in
Q&A
Replies: 0 comments
# for free
to join this conversation on GitHub.
Already have an account?
# to comment
-
Hi,
I've been trying to follow the fine tuning notebooks below and I'm getting stuck on the token length issues.
https://colab.research.google.com/github/roboflow-ai/notebooks/blob/main/notebooks/how-to-finetune-florence-2-on-detection-dataset.ipynb?ref=blog.roboflow.com#scrollTo=zqDWEWDcaSxN
https://colab.research.google.com/drive/1Y8GVjwzBIgfmfD3ZypDX5H1JA_VG0YDL?usp=sharing
https://colab.research.google.com/drive/1hKDrJ5AH_o7I95PtZ9__VlCTNAo1Gjpf?usp=sharing
I am typically getting token lengths in the 1040-1200 range that throw an error during training.
Training Epoch 1/1: 0%| | 0/10 [00:00<?, ?it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (1038 > 1024). Running this sequence through the model will result in indexing errors
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [174,0,0], thread: [96,0,0] Assertion srcIndex < srcSelectDimSize failed.
I was initially trying to use two image dataset with images that are capped at 960px and 1600px (which can in reality be up to 4000px before scaling). The reason for the large size is because these are construction drawings, which I ideally need to avoid segmenting them to keep proper context of certain objects in the same frame. The larger ones have a considerable amount more of annotations (up to 150 / frame), so I tried experimenting with the smaller image dataset which has a maximum of 40 annotations in a single frame. I tried resizing them down from 960px to 320px max, which didn't seem to do very much. I tried to cap the number of annotations per frame, and the model was able to fine tune at least.
Is there suggestions as to how I can get around the 1024 token length maximum? Would it be technically sound to have multiple copies of the images in the datasets, with each copy only have annotations for a single class to reduce the input token length? I fear that this would cause issues having multiple copies of the same image, and having the other trained classes not labeled in that frame.
I actually wanted to not only train but also train deeper descriptions for each class so the model can understand the construction/engineering context for these novel classes (similar to this question: https://huggingface.co/microsoft/Florence-2-large/discussions/32). I initially tried to add two types of annotations in the jsonl one with the "OD" prefix and then another line with "DENSE_REGION_CAPTION" for each image. However, if I can't even get the number of annotations to work for the "OD", this definitely won't work. Any suggestions?
Beta Was this translation helpful? Give feedback.
All reactions