Addressing issue #753 by adding feature mapping for the Fourier Encoder #759

mobra7 · 2024-10-24T13:44:38Z

Fix for issue #753 :

FourierEncoder (which is used in IceMix model) previously assumed a specific structure for the feature dimension, which caused issues with certain datasets (e.g. using standard FEATURES.ICECUBE86).
Now, FourierEncoder accepts a simple list of int or None to map the corresponding features of the data to the features used in FourierEncoder.
The Fourier mapping can be passed during the initialization of the DeepIce model. Specifying n_features is no longer required.

Minor Fixes:

Fixed small typos within IceMixNodes.

RasmusOrsoe · 2024-11-05T10:47:04Z

@mobra7 The problem with pre-commit is now fixed in main. If you update the branch, the failing checks in this PR is likely to pass (see #760).

…_enc

RasmusOrsoe

@mobra7 thank you for giving this a go.

I think we should consider generalizing the FourierEncoder even more than what you propose here. In its current implementation, the method does quite a bit more than just sinusoidal embedding:

Sinosodial embedding of features and unpadded sequence length
embedding of boolean "auxiliary" flag into a vector representation
MLP projection of 1 and 2.

In this refactoring, we can consider 2. and 3. to be model and problem-specific treatments of the embeddings. If we move this functionality inside the model itself instead of having it in the FourierEncoder, it becomes straight forward to re-use the FourierEncoder in different problems. Here is an example:

class FourierEncoder(LightningModule):
    """
    Apply sinosodial positional encodings to sequence representations.

    This method will embed [B,K,D]-dimensional sequences, where B is batch size
    K is sequence length and D is the number of features for each step, into
    sinosodial positions. 
    """

    def __init__(
        self,
        schema: Dict[int,int],
        seq_length: int = 128,
        scaled: bool = False,
        add_sequence_length: bool = True,
    ):
        """Construct `FourierEncoder`.

        Args:
            seq_length: Desired dimensionality of the base sinusoidal 
                positional embeddings.
            scaled: Whether or not to scale the embeddings.
            schema: A dictionary that maps an input column index to an
                embedding frequency. E.g. `schema = {1: 1024, 3: 4012}`
            add_sequence_length: If True, the unpadded length of each sequence
                is also embedded. Defaults to True. 
        """
        super().__init__()
        
        # Member Variables
        
        self.schema = schema
        self._add_sequence_length = add_sequence_length
        self.sin_feature = SinusoidalPosEmb(dim=seq_length, scaled=scaled)
        if add_sequence_length:
            self.sin_length = SinusoidalPosEmb(dim=seq_length // 2, 
                                               scaled=scaled)

    def forward(
        self,
        x: Tensor,
        seq_length: Optional[Tensor] = None,
    ) -> Tensor:
        """Apply positional encoding of `x`.

        `x` is assumed to be a [B,K,D]-dimensional array, where B is the batch
        size, k is the padded sequence length and D is the number of time step
        features. 

        Args:
            x: [B,K,D]-dimensional sequence representation of an event
            seq_length: unpadded length of each sequence in `x`.
            Defaults to None.

        Returns:
            Embedded [B,K,J]-dimensional sequence `x` 
        """

        embeddings = []
        # Embed Features
        for col in self.schema.keys():
            embeddings.append(self.sin_feature(self.schema[col]*x[:,:,col]))
        
        # Embed sequence length if specified
        if self._add_sequence_length:
            if seq_length is not None:
                length = torch.log10(seq_length.to(dtype=x.dtype))
                emb_length = self.sin_length(length).unsqueeze(1)
                embeddings.append(emb_length.expand(-1, max(seq_length), -1))
            else:
                raise ValueError("Must pass `seq_length` when "
                                 "`add_sequence_length` = True.")
        return torch.cat(embeddings, -1)

By introducing the argument schema = {1: 1024, 3: 4012}, we can give the user complete freedom over which columns of the sequences are embedded and how. Notice here that embedding of boolean values and MLP projections are not included. Instead, we can place these in the first few lines of forward pass of the DeepIce :

def forward(self, data: Data) -> Tensor:
        """Apply learnable forward pass."""
        x0, mask, seq_length = array_to_sequence(
            data.x, data.batch, padding_value=0
        )
        x = self.fourier_ext(x0, seq_length) # here fourier_ext is the new, proposed version without boolean embedding and MLP
        x = self._sinosodial_mlp(x) # new MLP layer

Notice here that the boolean embedding is not included. I think we should omit this in the refactor as I struggle to see the benefit of projecting an integer into a 64-dimensional vector (default value I think).

So in summary, I would propose that:

Rename the existing FourierEncoder and DeepIce methods to FourierEncoderEPJC and DeepIceEPJC and mention in the doc-strings that these are the graphnet implementations of the kaggle solutions as presented in the EPJ-C publication, and that contain assumptions that make them hard to use outside the Kaggle dataset.
Introduce a refactored version of FourierEncoder and DeepIce similar to what I proposed above (feel free to deviate a bit) and mention in their respective docstrings that these are versions of the original kaggle solutions but with superficial modifications that make them easier to use outside the scope of the kaggle competition
Have the example training script examples/06_train_icemix_model.py use the new, refactored version of the models.

@mobra7 what do you think?

mobra7 added 7 commits October 24, 2024 15:01

Changes made to adress issue graphnet-team#753

3eb2d99

Update to the icemix example script

274ac78

Fixed typo in example script

3a27eee

Fixed typo in IceMixNodes

c54d66e

Fixed small bug in embedding.py

a587e18

codeclimate fixes

6efa538

more codeclimate fixes

c37110b

mobra7 and others added 6 commits November 5, 2024 14:35

Resolve conflict in icemix.py

6f6fe07

Merge branch 'main' into fourier_enc

afc83ad

fixed missing whitespace

90ac284

Merge branch 'fourier_enc' of github.com:mobra7/graphnet into fourier…

a10e118

…_enc

Adjusting ISeeCube to use the new FourierEncoder.

547976b

ran with new pre-commit

5e0b59a

RasmusOrsoe requested changes Nov 12, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Addressing issue #753 by adding feature mapping for the Fourier Encoder #759

Addressing issue #753 by adding feature mapping for the Fourier Encoder #759

mobra7 commented Oct 24, 2024

RasmusOrsoe commented Nov 5, 2024

RasmusOrsoe left a comment

Addressing issue #753 by adding feature mapping for the Fourier Encoder #759

Are you sure you want to change the base?

Addressing issue #753 by adding feature mapping for the Fourier Encoder #759

Conversation

mobra7 commented Oct 24, 2024

RasmusOrsoe commented Nov 5, 2024

RasmusOrsoe left a comment

Choose a reason for hiding this comment