-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Addressing issue #753 by adding feature mapping for the Fourier Encoder #759
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mobra7 thank you for giving this a go.
I think we should consider generalizing the FourierEncoder
even more than what you propose here. In its current implementation, the method does quite a bit more than just sinusoidal embedding:
- Sinosodial embedding of features and unpadded sequence length
- embedding of boolean "auxiliary" flag into a vector representation
- MLP projection of 1 and 2.
In this refactoring, we can consider 2. and 3. to be model and problem-specific treatments of the embeddings. If we move this functionality inside the model itself instead of having it in the FourierEncoder
, it becomes straight forward to re-use the FourierEncoder
in different problems. Here is an example:
class FourierEncoder(LightningModule):
"""
Apply sinosodial positional encodings to sequence representations.
This method will embed [B,K,D]-dimensional sequences, where B is batch size
K is sequence length and D is the number of features for each step, into
sinosodial positions.
"""
def __init__(
self,
schema: Dict[int,int],
seq_length: int = 128,
scaled: bool = False,
add_sequence_length: bool = True,
):
"""Construct `FourierEncoder`.
Args:
seq_length: Desired dimensionality of the base sinusoidal
positional embeddings.
scaled: Whether or not to scale the embeddings.
schema: A dictionary that maps an input column index to an
embedding frequency. E.g. `schema = {1: 1024, 3: 4012}`
add_sequence_length: If True, the unpadded length of each sequence
is also embedded. Defaults to True.
"""
super().__init__()
# Member Variables
self.schema = schema
self._add_sequence_length = add_sequence_length
self.sin_feature = SinusoidalPosEmb(dim=seq_length, scaled=scaled)
if add_sequence_length:
self.sin_length = SinusoidalPosEmb(dim=seq_length // 2,
scaled=scaled)
def forward(
self,
x: Tensor,
seq_length: Optional[Tensor] = None,
) -> Tensor:
"""Apply positional encoding of `x`.
`x` is assumed to be a [B,K,D]-dimensional array, where B is the batch
size, k is the padded sequence length and D is the number of time step
features.
Args:
x: [B,K,D]-dimensional sequence representation of an event
seq_length: unpadded length of each sequence in `x`.
Defaults to None.
Returns:
Embedded [B,K,J]-dimensional sequence `x`
"""
embeddings = []
# Embed Features
for col in self.schema.keys():
embeddings.append(self.sin_feature(self.schema[col]*x[:,:,col]))
# Embed sequence length if specified
if self._add_sequence_length:
if seq_length is not None:
length = torch.log10(seq_length.to(dtype=x.dtype))
emb_length = self.sin_length(length).unsqueeze(1)
embeddings.append(emb_length.expand(-1, max(seq_length), -1))
else:
raise ValueError("Must pass `seq_length` when "
"`add_sequence_length` = True.")
return torch.cat(embeddings, -1)
By introducing the argument schema = {1: 1024, 3: 4012}
, we can give the user complete freedom over which columns of the sequences are embedded and how. Notice here that embedding of boolean values and MLP projections are not included. Instead, we can place these in the first few lines of forward pass of the DeepIce
:
def forward(self, data: Data) -> Tensor:
"""Apply learnable forward pass."""
x0, mask, seq_length = array_to_sequence(
data.x, data.batch, padding_value=0
)
x = self.fourier_ext(x0, seq_length) # here fourier_ext is the new, proposed version without boolean embedding and MLP
x = self._sinosodial_mlp(x) # new MLP layer
Notice here that the boolean embedding is not included. I think we should omit this in the refactor as I struggle to see the benefit of projecting an integer into a 64-dimensional vector (default value I think).
So in summary, I would propose that:
- Rename the existing
FourierEncoder
andDeepIce
methods toFourierEncoderEPJC
andDeepIceEPJC
and mention in the doc-strings that these are the graphnet implementations of the kaggle solutions as presented in the EPJ-C publication, and that contain assumptions that make them hard to use outside the Kaggle dataset. - Introduce a refactored version of
FourierEncoder
andDeepIce
similar to what I proposed above (feel free to deviate a bit) and mention in their respective docstrings that these are versions of the original kaggle solutions but with superficial modifications that make them easier to use outside the scope of the kaggle competition - Have the example training script
examples/06_train_icemix_model.py
use the new, refactored version of the models.
@mobra7 what do you think?
Fix for issue #753 :
FourierEncoder
(which is used in IceMix model) previously assumed a specific structure for the feature dimension, which caused issues with certain datasets (e.g. using standardFEATURES.ICECUBE86
).FourierEncoder
accepts a simple list ofint
orNone
to map the corresponding features of the data to the features used inFourierEncoder
.DeepIce
model. Specifyingn_features
is no longer required.Minor Fixes:
IceMixNodes
.