This directory holds the implementation of the model architecture described by the original paper.
flowchart LR;
in(Spacetime Frames);
featureExtractor["Feature Extractor (ResNet18)"];
lstm[LSTM];
mlp[MLP Head];
out(Cochleagram);
in -->|Nx45x3x244x244|featureExtractor;
featureExtractor -->|Nx45x512|lstm;
lstm -->lstm;
lstm -->|Nx45x256|mlp;
mlp-->|Nx45x42|out;
We initially designed the architecture exactly as described in the paper.
The input to the feature extactor is a set of 45 spacetime frames. Each of the resulting 45 embeddings are then concatenated with the features of the first color frame, resulting in an
We also trained the model with the loss described in the paper (VISLoss
):
Where