-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
I want to fine-tune a complete text encoder model, but it seems that the model trained by ft-B-train-OpenAI-CLIP-ViT-L-14.py is a visual encoder model. #16
Comments
The fine-tune is actually a text-vision model, consisting of a text transformer AND a vision transformer. For the "TE only" / text encoder only models on my HuggingFace, I fine-tuned the entire CLIP model (text + vision) and then simply "detached" the vision transformer (i.e. delete the keys / associated parameters). CLIP's objective is in the name - Contrastive Language-Image Pretraining. Learning both text and image, optimizing for dot-product of matching pairs (high) vs. negative examples (low), is the objective / optimization goal. It needs both image and text to be a "CLIP", per definition. So, the question is - what are you trying to archive? Or do you mean that you only want to train the text encoder, with a frozen visual encoder (no parameter updates)? In that case: The vision transformer is
|
For the "TE only" / text encoder only models on my HuggingFace, I fine-tuned the entire CLIP model (text + vision) and then simply "detached" the vision transformer (i.e. delete the keys / associated parameters). Can you please give me the code for this, I want to use it with the flux model, I tested the text only encoder model you provided on HF and it works with the flux model, and now I want to train the CLIP model as a multi-lingual model, but I am not familiar with the steps to "separate" the vision transformer. I would like your help, thank you very much. |
I just committed |
Thank you very much. I think the code you provided is what I want, but I encountered some problems when converting. The error message is below. I would like to ask if you have encountered the same problem. I am trying to train several of your training programs separately, and then try each one: |
Can you open
I can't reproduce your error, but somebody else reported the same; I am assuming it might be related to the venv / conda, and trying to load a torch jit scripted archive. I don't use a venv. However, torch.jit is just for "interoperability, speed and production environments", so it's not needed, and we can just put the map_location on CPU in any case. If that doesn't work, my other random guess at a fix (as I can't reproduce the problem): |
I learned the cause of this error in other forums and tried to solve the problem with it. It worked, but I'm not sure if it was the final factor. |
Thank you for the suggestion, and glad you got it to work! I'll try it and consider implementing as a Bool to switch - to True if you want to script the model, else save a normal torch.save, with my next update. 👍 |
I updated the code with a new model saver; you can now choose to either save as GmP (legacy behavior) or directly convert back to .weight (original OpenAI/CLIP; no extra script for conversion needed anymore!). Plus, you can save the model as 1. a full model object (legacy behavior) or 2. a state_dict or 3. a torch.jit.trace() -- or all of those combined. Hope it's useful to you! 👍 |
First of all, thank you for your work. I have a question for you.
I want to fine-tune a complete text encoder model, but it seems that the model trained by ft-B-train-OpenAI-CLIP-ViT-L-14.py is a visual encoder model. How can I get the model of the pure text encoder ViT-L-14-TEXT-detail-improved-hiT-GmP-HF.safetensors given in your HF?
The text was updated successfully, but these errors were encountered: