-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Greedy Soup selects only best individual model #10
Comments
I too have a different experience trying to achieve greedy soup that has higher performance. My model soup for this is the 3 best models:
from then onwards, no more models get added. |
I have read this paper. what i can tell is that the greedy soup tests in a sequential way starting from the top best model if a model added to the top best increase the performance then that model is added to the soup with the top best model and repeat the process until the end. in other words if there is no set of combination with the top best model which is better than the top best model alone then the soup is the top best model alone otherwise the greedy soup is the set of model combined with the top best model whose performance is the best as possible than the top best model and all other sets including the top best found following the greedy algorithm procedure. |
Dear @Djoels Since you are using a different codebase, a different dataset, and experience a different issue, it might be best to create an issue in this repository yourself. |
@Djoels hmm.. your getting that plot following the steps in the repository? Or doing something custom? Because when we follow the steps we get the figure you see here https://github.com/mlfoundations/model-soups |
@ivovdongen would love to help if possible! can you let me know some more detail about what is your task, what is your network, and how you are fine-tuning? one thing it could be is introducing new params when fine-tuning. we always start from a shared zero-shot model, or model with a shared learned linear-probe head. we believe this helps. concretely, if your model does not already have a classification head, good to first learn a linear probe then initialize with that when doing fine-tuning (as in the LP-FT paper). when fine-tuning multiple times for model soups, best to start out with the same learned linear probe head. |
Dear @mitchellnw, The task is to classify four different brain tumor types on MRI scans. The dataset contains 7023 samples in total. However, my experiment described above was executed on only 35% of the dataset and had a model pool of size 12. My network is created as follows:
I create random hyperparameter configurations on the grid in my initial post and train these using the following function. For callbacks, I use EarlyStopping and ReduceLROnPlateau from Keras.
I am not familiar with zero-shot or linear-probe models, but I can try to look into this. More of my code for the experiment above can be found in this Kaggle repository. I am currently running an experiment that fine-tunes 24 models instead of 12 and uses 100% of the dataset (75% for training) instead of 35%. In addition, I included the dropout_probability [0.1, 0.2, 0.3] of the dropout layer in my network as an additional param for my grid. Thanks in advance. |
Thanks! Ok, as suspected I think the issue in terms of model soups performance is that you're introducing new parameters. Looking at your model definiton
It seems that you have two new layers, If you run the following experiment my guess is things would work again:
|
@mitchellnw Thank you for the fast reply. I am afraid that I don't understand what you mean. "1. First, fine-tune the model once to get a starting point, let's call this model A." Do you mean that I can use the model architecture that I already had and fine-tune it once with an arbitrary hyperparameter setting? For example:
First-time fine-tuning, with arbitrary params...
Now should I save the weights and biases of the first dense layer after this first-time fine-tuning (Model A)?
"2. Now, use model A as your initialization of your future fine-tunings. Now averaging things should work."
This should load the weights and biases of the first dense layer, "layers.Dense(128, ...)", of Model A in the network at that same layer using the function above. Or should I perhaps get the w's and b's of both dense layers in my network and initialize them in their corresponding layer? I would love to hear if I understand you correctly. Unfortunately, I have no experience with PyTorch, which is why I am trying to implement this in Tensorflow/Keras. Thanks in advance. |
I'm not familiar with TF/Keras but at a high level looks good! Just trying to make your experimental setting more similar to what we consider. Concretely, in the paper we use LP-FT (https://arxiv.org/abs/2202.10054), so we first train a linear probe, then use [featruizer, linear_probe] as the common initialization and fine-tune end-to-end (so even changing the weights of the featurizer). Basically, the "soup starter" = the initialization you use when you start fine-tuning with different seeds, etc., should already be a good model. We accomplish this through LP-FT. |
@mitchellnw Thanks, I will try to implement this. By the way, is the initialization model (or 'soup starter') also part of the model pool on which the soups are constructed, or is this excluded? |
great! should be fine to include, this is typically what we do. |
Dear M. Wortsman,
I am experimenting with Model Soups for four-class brain tumor classification. I use ViT-B32 with AdamW and CategoricalCrossentropy (with label_smoothing). I randomly created 12 model configurations from the hyperparameter grid below. From my 12 models, the best and worst models have a validation accuracy of 91.964% and 84.226%, respectively. The Uniform Soup has a validation accuracy of 88.393%. My Greedy Soup, however, only includes the best individual model (i.e. no combination of weights yields accuracy > 91.964%). What can I do to have my Greedy Soup outperform the best individual model, besides creating a bigger model pool?
Many thanks in advance.
Where the different data augmentation intensities are defined as:
The text was updated successfully, but these errors were encountered: