Skip to content

Add AudioToText and AudioToTextPreprocessor class stubs to enable auto class functionality #2265

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Merged
merged 3 commits into from
May 23, 2025

Conversation

harshaljanjani
Copy link
Collaborator

@harshaljanjani harshaljanjani commented May 22, 2025

Description of the change

This PR was originally intended as a design diff to make MoonshineAudioConverter and WhisperAudioConverter independent of tf ops and leave the tf-specific handling to the AudioConverter base class. However, it is now restricted to adding class stubs for auto class functionality. Please review the following considerations that should be kept in mind for the design diff regardless.

We have two choices in the matter:

a. Make AudioConverter a layer with specific arguments, an init(), call() and get_config().

I believe this option is a bit too strict given that at the moment, in the art at present, we have two classes with highly divergent parameters in themselves. I believe the preprocessing fundamentally differs from the other, and having an argument such as use_spectrogram_features might be a bit too premature, since in Whisper itself, it uses a specific variant, log-mel spectrogram features with Whisper-specific audio preprocessing.

  1. Moonshine: The input is a waveform, and the output is a tensor for an audio waveform as well. There's manipulation in terms of padding and normalization but effectively no feature extraction with STFTs like Whisper, etc.
  2. Whisper: Raw waveform to spectrogram, so it's time → time-frequency domain.

b. The choice this PR is based on makes AudioConverter a core ops class that future AudioConverters can subclass from. These compatibility issues will be ubiquitous when writing the AudioConverter and would have to be dealt with when using the tf-specific ops; instead, they can use the base class ops without worrying too much about the tf-specific handling.
I've taken efforts here to eliminate the need for the developer to worry about the low-level utils they share. Also, this itself would be a great addition because it'll clean up a lot of messy backend-specific work in the audio converters, in my opinion!

This cleans up MoonshineAudioConverter and WhisperAudioConverter.

Checklist

  • I have added all the necessary unit tests for my change (here, the fact that none of the tests break is test enough to the fact that the logic is sound).
  • I have verified that my change does not break existing code and works with all backends (TensorFlow, JAX, and PyTorch).
  • My PR is based on the latest changes of the main branch (if unsure, rebase the code).
  • I have followed the Keras Hub Model contribution guidelines in making these changes.
  • I have followed the Keras Hub API design guidelines in making these changes.
  • I have signed the Contributor License Agreement.

@harshaljanjani harshaljanjani added the kokoro:force-run Runs Tests on GPU label May 22, 2025
@kokoro-team kokoro-team removed the kokoro:force-run Runs Tests on GPU label May 22, 2025
@harshaljanjani harshaljanjani added the kokoro:force-run Runs Tests on GPU label May 22, 2025
@kokoro-team kokoro-team removed the kokoro:force-run Runs Tests on GPU label May 22, 2025
@harshaljanjani harshaljanjani self-assigned this May 22, 2025
@harshaljanjani
Copy link
Collaborator Author

harshaljanjani commented May 22, 2025

@mattdangerw / @divyashreepathihalli CI is green on the refactor! I've gone ahead and added the class stubs. Since it's a subtle design change, it shouldn't break CI. Feel free to review the diff, thanks!

Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Functionality requirements...

  • We should be able to run preprocessing through tf.data on all backends (torch, tf, and jax). We should also be able to run outside of tf.data eagerly batch by batch.
  • We should be able to run the models without jax or torch installed. No unconditional imports of these libraries.

Design considerations...

  • Try to think about keeping our modeling experience consistent as you switch from moonshine to whisper and vice versa. This is one of the big things we want to solve for.
  • There is a lot of backend specific code here. Consider if we can use a
    def preprocessing_function(fn):
    """Wraps a preprocessing function to handle tf tensor conversion."""
    annotation. See other preprocessing as an example. This handles a lot of cases. Converting tensors from any backend, leaving tf.tensors as is when running in a graph context in tf.data, converting tf.ragged tensors to lists when running eagerly and not on the tf backend.

I think that might cut a lot of code complexity.

tf = None

import keras
import torch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can't assume torch is installed, no unconditional import like this. the entirely library would break without torch available

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, will take care of this the next time around, thanks!

"TensorFlow is required for computing the log mel spectrogram."
)
if isinstance(audio, torch.Tensor):
audio = audio.cpu()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ops.convert_to_numpy will already move torch tensors to cpu

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I'll remove the redundancy after verifying breakage, thanks!

@@ -35,10 +43,6 @@ class AudioConverter(PreprocessingLayer):

backbone_cls = None

def audio_shape(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this for task.summary(). Find to make changes here, but tokenizers show vocab size, image converters the output image size.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will make sure to include this, thanks!


return log_spec

def _process_audio_tensor(self, audio, num_samples):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's think about the design of this base class a little more. Is there a somewhat standard form of audio processing that we can expose on the base class so this class is actually usable directly? Is there a clean way we can expose a very small surface to override in subclasses?

This is a lot of private methods just tossed onto the class here. Are all of these used by both subclasses? Is there a way to expose fewer methods on the base class?

Try to thing of a durable design that will hold up well as we add more audio models over moths and years.

  • What is generic to most audio models?
  • What is model specific?

import tensorflow as tf
except ImportError:
tf = None


@keras_hub_export("keras_hub.layers.WhisperAudioConverter")
class WhisperAudioConverter(AudioConverter):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to think about how we can make the interface of these layers more cohesive. Why does whisper have max_audio_length and moonshine have max_length, padding and pad_to_multiple_of. Are these inherent to the model themselves? Or did we just copy a bunch of design decision from an upstream implementation differently for each model.


def _cond(self, condition, true_fn, false_fn, tensor):
"""Conditional execution based on backend and tensor type."""
if self._use_tf_graph_ops(tensor) and keras.config.backend() != "torch":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check seems weird. what if you are running preprocessing with tf.data and the backend is torch? Don't we need graph ops then?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something I thought I learnt from the Moonshine PR. In the Moonshine PR I faced a CI issue with Torch-GPU and I fixed it like this:
92dc884, involved a conditional check for the Torch backend for tf.cond.

@harshaljanjani harshaljanjani changed the title Add AudioConverter and cleanup MoonshineAudioConverter & WhisperAudioConverter Add AudioToText and AudioToTextPreprocessor class stubs to enable auto class functionality May 23, 2025
@harshaljanjani
Copy link
Collaborator Author

harshaljanjani commented May 23, 2025

Thank you for the reviews @mattdangerw! For now, the scope of this PR is limited to the class stubs for auto class functionality, as you mentioned. However, we can address the AudioConverter in a future PR after a thorough design discussion with all of you, thanks!

@harshaljanjani harshaljanjani marked this pull request as ready for review May 23, 2025 11:44
Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This looks great

@mattdangerw mattdangerw merged commit 536ec93 into keras-team:master May 23, 2025
8 checks passed
@harshaljanjani harshaljanjani deleted the audio-converter branch May 23, 2025 16:20
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants