Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Addressing issue #753 by adding feature mapping for the Fourier Encoder #759

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

mobra7
Copy link

@mobra7 mobra7 commented Oct 24, 2024

Fix for issue #753 :

  • FourierEncoder (which is used in IceMix model) previously assumed a specific structure for the feature dimension, which caused issues with certain datasets (e.g. using standard FEATURES.ICECUBE86).
  • Now, FourierEncoder accepts a simple list of int or None to map the corresponding features of the data to the features used in FourierEncoder.
  • The Fourier mapping can be passed during the initialization of the DeepIce model. Specifying n_features is no longer required.

Minor Fixes:

  • Fixed small typos within IceMixNodes.

@RasmusOrsoe
Copy link
Collaborator

@mobra7 The problem with pre-commit is now fixed in main. If you update the branch, the failing checks in this PR is likely to pass (see #760).

Copy link
Collaborator

@RasmusOrsoe RasmusOrsoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mobra7 thank you for giving this a go.

I think we should consider generalizing the FourierEncoder even more than what you propose here. In its current implementation, the method does quite a bit more than just sinusoidal embedding:

  1. Sinosodial embedding of features and unpadded sequence length
  2. embedding of boolean "auxiliary" flag into a vector representation
  3. MLP projection of 1 and 2.

In this refactoring, we can consider 2. and 3. to be model and problem-specific treatments of the embeddings. If we move this functionality inside the model itself instead of having it in the FourierEncoder, it becomes straight forward to re-use the FourierEncoder in different problems. Here is an example:

class FourierEncoder(LightningModule):
    """
    Apply sinosodial positional encodings to sequence representations.

    This method will embed [B,K,D]-dimensional sequences, where B is batch size
    K is sequence length and D is the number of features for each step, into
    sinosodial positions. 
    """

    def __init__(
        self,
        schema: Dict[int,int],
        seq_length: int = 128,
        scaled: bool = False,
        add_sequence_length: bool = True,
    ):
        """Construct `FourierEncoder`.

        Args:
            seq_length: Desired dimensionality of the base sinusoidal 
                positional embeddings.
            scaled: Whether or not to scale the embeddings.
            schema: A dictionary that maps an input column index to an
                embedding frequency. E.g. `schema = {1: 1024, 3: 4012}`
            add_sequence_length: If True, the unpadded length of each sequence
                is also embedded. Defaults to True. 
        """
        super().__init__()
        
        # Member Variables
        
        self.schema = schema
        self._add_sequence_length = add_sequence_length
        self.sin_feature = SinusoidalPosEmb(dim=seq_length, scaled=scaled)
        if add_sequence_length:
            self.sin_length = SinusoidalPosEmb(dim=seq_length // 2, 
                                               scaled=scaled)

    def forward(
        self,
        x: Tensor,
        seq_length: Optional[Tensor] = None,
    ) -> Tensor:
        """Apply positional encoding of `x`.

        `x` is assumed to be a [B,K,D]-dimensional array, where B is the batch
        size, k is the padded sequence length and D is the number of time step
        features. 

        Args:
            x: [B,K,D]-dimensional sequence representation of an event
            seq_length: unpadded length of each sequence in `x`.
            Defaults to None.

        Returns:
            Embedded [B,K,J]-dimensional sequence `x` 
        """

        embeddings = []
        # Embed Features
        for col in self.schema.keys():
            embeddings.append(self.sin_feature(self.schema[col]*x[:,:,col]))
        
        # Embed sequence length if specified
        if self._add_sequence_length:
            if seq_length is not None:
                length = torch.log10(seq_length.to(dtype=x.dtype))
                emb_length = self.sin_length(length).unsqueeze(1)
                embeddings.append(emb_length.expand(-1, max(seq_length), -1))
            else:
                raise ValueError("Must pass `seq_length` when "
                                 "`add_sequence_length` = True.")
        return torch.cat(embeddings, -1)

By introducing the argument schema = {1: 1024, 3: 4012}, we can give the user complete freedom over which columns of the sequences are embedded and how. Notice here that embedding of boolean values and MLP projections are not included. Instead, we can place these in the first few lines of forward pass of the DeepIce :

def forward(self, data: Data) -> Tensor:
        """Apply learnable forward pass."""
        x0, mask, seq_length = array_to_sequence(
            data.x, data.batch, padding_value=0
        )
        x = self.fourier_ext(x0, seq_length) # here fourier_ext is the new, proposed version without boolean embedding and MLP
        x = self._sinosodial_mlp(x) # new MLP layer

Notice here that the boolean embedding is not included. I think we should omit this in the refactor as I struggle to see the benefit of projecting an integer into a 64-dimensional vector (default value I think).

So in summary, I would propose that:

  1. Rename the existing FourierEncoder and DeepIce methods to FourierEncoderEPJC and DeepIceEPJC and mention in the doc-strings that these are the graphnet implementations of the kaggle solutions as presented in the EPJ-C publication, and that contain assumptions that make them hard to use outside the Kaggle dataset.
  2. Introduce a refactored version of FourierEncoder and DeepIce similar to what I proposed above (feel free to deviate a bit) and mention in their respective docstrings that these are versions of the original kaggle solutions but with superficial modifications that make them easier to use outside the scope of the kaggle competition
  3. Have the example training script examples/06_train_icemix_model.py use the new, refactored version of the models.

@mobra7 what do you think?

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants