-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Sparsemax not actually used in COMET-KIWI, XCOMET-XL/XXL #195
Comments
Follow-up on that... I am also wondering if you realized that Roberta-XL and Roberta-XXL are pre-norm, while the base model you used for Comet-KIWI is post-norm, but you treat them the same during training/inference. The huggingface implementation is collecting the hidden states without normalization for the XL models with the exception of very last hidden state which is normed. That seems to mean that the hidden states that you use for your layer-mixing have wildly different magnitudes across layers -- the first and the last one (the most important one?) have very small norms, the ones in-between are unnormed. I am wondering if that wouldn't give you a really hard time during training the xComet-XXL models and skew the weighting during layer mixing? |
@emjotde nothing like a re-implementation challenge to find bugs 😄... I just confirmed and you are right. Its defaulting to softmax instead of sparsemax. >>> from comet import download_model, load_from_checkpoint
>>> model = load_from_checkpoint(download_model("Unbabel/wmt23-cometkiwi-da-xxl"))
>>> model.layerwise_attention.transform_fn
<built-in method softmax of type object at 0x7fda5cbd2460>
>>> model.layerwise_attention.layer_norm
False same thing for XCOMET models. Regarding Roberta-XL and XXL I realised the change from post-norm to pre-norm. I did not realised the impact on the embeddings returned from HF. Actually HF took a long long time to integrate Roberta-XL/XXL because of this issue... but I never inspected the magnitudes across layers. Btw the rational for using sparsemax instead of softmax was not performance related. Our goal when integrating Sparsemax was to study if all layers are relevant or not. The performance between sparsemax and softmax is usually the same. Yet, for wmt22-comet-da, because of sparsemax, we can clearly observe which layers are relevant: e.g: >>> model = load_from_checkpoint(download_model("Unbabel/wmt22-comet-da"))
>>> weights = torch.cat([parameter for parameter in model.layerwise_attention.scalar_parameters])
>>> normed_weights = model.layerwise_attention.transform_fn(weights, dim=0)
>>> normed_weights
tensor([0.0849, 0.0738, 0.0504, 0.0463, 0.0166, 0.0125, 0.0103, 0.0027, 0.0000,
0.0000, 0.0007, 0.0088, 0.0151, 0.0463, 0.0591, 0.0466, 0.0516, 0.0552,
0.0581, 0.0621, 0.0666, 0.0609, 0.0621, 0.0645, 0.0448],
grad_fn=<SparsemaxFunctionBackward>) Here we can see that some layers are set to 0 and thus ignored. This provides some layer of interpretability... Ideally, the model would ignore the top layers and we could, after training, prune those (unfortunately this usually does not happen). With XCOMET, the learned weights are all very similar.... But like you said probably because of the different norms? >>> model = load_from_checkpoint(download_model("Unbabel/XCOMET-XL"))
>>> weights = torch.cat([parameter for parameter in model.layerwise_attention.scalar_parameters])
>>> normed_weights = model.layerwise_attention.transform_fn(weights, dim=0)
>>> normed_weights
tensor([0.0285, 0.0267, 0.0267, 0.0267, 0.0267, 0.0267, 0.0267, 0.0267, 0.0267,
0.0267, 0.0267, 0.0267, 0.0267, 0.0267, 0.0267, 0.0267, 0.0268, 0.0268,
0.0268, 0.0268, 0.0268, 0.0269, 0.0270, 0.0271, 0.0271, 0.0272, 0.0273,
0.0273, 0.0273, 0.0273, 0.0273, 0.0273, 0.0273, 0.0273, 0.0273, 0.0272,
0.0287], grad_fn=<SoftmaxBackward0>) Also, not sure if you noticed but we only use the layerwise attention for creating the sentence-embedding that are used for regression. The embeddings used for classifying the individual tokens as error spans are those from the word_layer (model.hparams.word_layer). We have not played a lot with this hyper-parameters but our goal was to make an individual layer more specialised on that task (usually a top layer because its closer to the MLM objective) while for regression we would like to pool information from all layers.
It did not... I was actually surprised but the training was very stable from the get go.... I had some issues with distributed training and pytorch-lightning and ended up implementing something without lightning but after that was done, training was smooth. |
Yeah, I am currently not looking at the word-level predictions yet, stopped at the regressor-implementation. Regarding the weights above, the fact that they are near-uniform after softmax despite the that the norms over the hidden states are so different is what made me wonder if proper learning happens or rather some form of saturation (always hard to tell with those neural models). I would have expected the model to strongly push down the weights for the models with high norms. On the other hand, if this becomes bascially an unweighted arithmetic average then the two very small vectors pull everything down by a lot considering that averages reward outliers. Who knows... |
Its the black magic art of NN 🙂 |
Hi,
I have been playing around with re-implementing some of your models in Marian and while progressing through the code I noticed that you are not actually using sparsemax for Comet-KIWI and Comet-XL/XXL, instead you are falling back to a softmax.
In both cases you forgot to pass the
layer_transformation
parameter to its base class:See here for
UnifiedMetric
COMET/comet/models/multitask/unified_metric.py
Line 106 in 2bcf666
and here for
XCOMETMetric
COMET/comet/models/multitask/xcomet_metric.py
Line 54 in 2bcf666
In both cases the
layer_transformation
parameter does not appear in the parameter list of the base class below, but the base class hassoftmax
as the default.In my re-implementation I am reproducing your exact numbers for Comet-KIWI with a softmax, not the sparsemax. While the sparsemax works fine for COMET-22 ref-based.
It's not clear to me if the model was trained with a softmax or sparsemax, but you might either have a train/inference mismatch here or at the very least your models are doing something different than you expected/described.
The text was updated successfully, but these errors were encountered: