Add AudioToText and AudioToTextPreprocessor class stubs to enable auto class functionality #2265

harshaljanjani · 2025-05-22T09:33:27Z

Description of the change

This PR was originally intended as a design diff to make MoonshineAudioConverter and WhisperAudioConverter independent of tf ops and leave the tf-specific handling to the AudioConverter base class. However, it is now restricted to adding class stubs for auto class functionality. Please review the following considerations that should be kept in mind for the design diff regardless.

We have two choices in the matter:

a. Make AudioConverter a layer with specific arguments, an init(), call() and get_config().

I believe this option is a bit too strict given that at the moment, in the art at present, we have two classes with highly divergent parameters in themselves. I believe the preprocessing fundamentally differs from the other, and having an argument such as use_spectrogram_features might be a bit too premature, since in Whisper itself, it uses a specific variant, log-mel spectrogram features with Whisper-specific audio preprocessing.

Moonshine: The input is a waveform, and the output is a tensor for an audio waveform as well. There's manipulation in terms of padding and normalization but effectively no feature extraction with STFTs like Whisper, etc.
Whisper: Raw waveform to spectrogram, so it's time → time-frequency domain.

b. The choice this PR is based on makes AudioConverter a core ops class that future AudioConverters can subclass from. These compatibility issues will be ubiquitous when writing the AudioConverter and would have to be dealt with when using the tf-specific ops; instead, they can use the base class ops without worrying too much about the tf-specific handling.
I've taken efforts here to eliminate the need for the developer to worry about the low-level utils they share. Also, this itself would be a great addition because it'll clean up a lot of messy backend-specific work in the audio converters, in my opinion!

This cleans up MoonshineAudioConverter and WhisperAudioConverter.

Checklist

I have added all the necessary unit tests for my change (here, the fact that none of the tests break is test enough to the fact that the logic is sound).
I have verified that my change does not break existing code and works with all backends (TensorFlow, JAX, and PyTorch).
My PR is based on the latest changes of the main branch (if unsure, rebase the code).
I have followed the Keras Hub Model contribution guidelines in making these changes.
I have followed the Keras Hub API design guidelines in making these changes.
I have signed the Contributor License Agreement.

harshaljanjani · 2025-05-22T19:41:19Z

@mattdangerw / @divyashreepathihalli CI is green on the refactor! I've gone ahead and added the class stubs. Since it's a subtle design change, it shouldn't break CI. Feel free to review the diff, thanks!

mattdangerw

Thanks!

Functionality requirements...

We should be able to run preprocessing through tf.data on all backends (torch, tf, and jax). We should also be able to run outside of tf.data eagerly batch by batch.
We should be able to run the models without jax or torch installed. No unconditional imports of these libraries.

Design considerations...

Try to think about keeping our modeling experience consistent as you switch from moonshine to whisper and vice versa. This is one of the big things we want to solve for.
There is a lot of backend specific code here. Consider if we can use a

keras-hub/keras_hub/src/utils/tensor_utils.py

Lines 43 to 44 in d1d014c

def preprocessing_function(fn):

"""Wraps a preprocessing function to handle tf tensor conversion."""

annotation. See other preprocessing as an example. This handles a lot of cases. Converting tensors from any backend, leaving tf.tensors as is when running in a graph context in tf.data, converting tf.ragged tensors to lists when running eagerly and not on the tf backend.

I think that might cut a lot of code complexity.

mattdangerw · 2025-05-22T20:31:41Z

keras_hub/src/layers/preprocessing/audio_converter.py

+    tf = None
+
+import keras
+import torch


we can't assume torch is installed, no unconditional import like this. the entirely library would break without torch available

Right, will take care of this the next time around, thanks!

mattdangerw · 2025-05-22T20:33:33Z

keras_hub/src/layers/preprocessing/audio_converter.py

+                "TensorFlow is required for computing the log mel spectrogram."
+            )
+        if isinstance(audio, torch.Tensor):
+            audio = audio.cpu()


ops.convert_to_numpy will already move torch tensors to cpu

I see, I'll remove the redundancy after verifying breakage, thanks!

mattdangerw · 2025-05-22T20:42:15Z

keras_hub/src/layers/preprocessing/audio_converter.py

@@ -35,10 +43,6 @@ class AudioConverter(PreprocessingLayer):

    backbone_cls = None

-    def audio_shape(self):


We need this for task.summary(). Find to make changes here, but tokenizers show vocab size, image converters the output image size.

Will make sure to include this, thanks!

mattdangerw · 2025-05-22T20:48:38Z

keras_hub/src/layers/preprocessing/audio_converter.py

+
+        return log_spec
+
+    def _process_audio_tensor(self, audio, num_samples):


Let's think about the design of this base class a little more. Is there a somewhat standard form of audio processing that we can expose on the base class so this class is actually usable directly? Is there a clean way we can expose a very small surface to override in subclasses?

This is a lot of private methods just tossed onto the class here. Are all of these used by both subclasses? Is there a way to expose fewer methods on the base class?

Try to thing of a durable design that will hold up well as we add more audio models over moths and years.

What is generic to most audio models?

What is model specific?

mattdangerw · 2025-05-22T20:51:33Z

keras_hub/src/models/whisper/whisper_audio_converter.py

-    import tensorflow as tf
-except ImportError:
-    tf = None
-

 @keras_hub_export("keras_hub.layers.WhisperAudioConverter")
 class WhisperAudioConverter(AudioConverter):


Try to think about how we can make the interface of these layers more cohesive. Why does whisper have max_audio_length and moonshine have max_length, padding and pad_to_multiple_of. Are these inherent to the model themselves? Or did we just copy a bunch of design decision from an upstream implementation differently for each model.

mattdangerw · 2025-05-22T20:51:49Z

keras_hub/src/layers/preprocessing/audio_converter.py

+
+    def _cond(self, condition, true_fn, false_fn, tensor):
+        """Conditional execution based on backend and tensor type."""
+        if self._use_tf_graph_ops(tensor) and keras.config.backend() != "torch":


This check seems weird. what if you are running preprocessing with tf.data and the backend is torch? Don't we need graph ops then?

This is something I thought I learnt from the Moonshine PR. In the Moonshine PR I faced a CI issue with Torch-GPU and I fixed it like this:
92dc884, involved a conditional check for the Torch backend for tf.cond.

harshaljanjani · 2025-05-23T11:03:19Z

Thank you for the reviews @mattdangerw! For now, the scope of this PR is limited to the class stubs for auto class functionality, as you mentioned. However, we can address the AudioConverter in a future PR after a thorough design discussion with all of you, thanks!

mattdangerw

Nice! This looks great

init: Add AudioConverter and update Moonshine & Whisper modules

897bee7

harshaljanjani added the kokoro:force-run Runs Tests on GPU label May 22, 2025

kokoro-team removed the kokoro:force-run Runs Tests on GPU label May 22, 2025

may fix: Torch CI fix

d9ec27b

harshaljanjani added the kokoro:force-run Runs Tests on GPU label May 22, 2025

kokoro-team removed the kokoro:force-run Runs Tests on GPU label May 22, 2025

harshaljanjani self-assigned this May 22, 2025

mattdangerw reviewed May 22, 2025

View reviewed changes

refactor: Revert for now, and limit PR scope to class stubs

f4fe0db

harshaljanjani changed the title ~~Add AudioConverter and cleanup MoonshineAudioConverter & WhisperAudioConverter~~ Add AudioToText and AudioToTextPreprocessor class stubs to enable auto class functionality May 23, 2025

harshaljanjani marked this pull request as ready for review May 23, 2025 11:44

harshaljanjani requested a review from mattdangerw May 23, 2025 11:44

mattdangerw approved these changes May 23, 2025

View reviewed changes

mattdangerw merged commit 536ec93 into keras-team:master May 23, 2025
8 checks passed

harshaljanjani deleted the audio-converter branch May 23, 2025 16:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add AudioToText and AudioToTextPreprocessor class stubs to enable auto class functionality #2265

Add AudioToText and AudioToTextPreprocessor class stubs to enable auto class functionality #2265

Uh oh!

harshaljanjani commented May 22, 2025 •

edited

Loading

Uh oh!

harshaljanjani commented May 22, 2025 •

edited

Loading

Uh oh!

mattdangerw left a comment

Uh oh!

mattdangerw May 22, 2025

Uh oh!

harshaljanjani May 23, 2025

Uh oh!

mattdangerw May 22, 2025

Uh oh!

harshaljanjani May 23, 2025

Uh oh!

mattdangerw May 22, 2025

Uh oh!

harshaljanjani May 23, 2025

Uh oh!

mattdangerw May 22, 2025

Uh oh!

mattdangerw May 22, 2025

Uh oh!

mattdangerw May 22, 2025

Uh oh!

harshaljanjani May 23, 2025

Uh oh!

harshaljanjani commented May 23, 2025 •

edited

Loading

Uh oh!

mattdangerw left a comment

Uh oh!

Uh oh!

Uh oh!

	def preprocessing_function(fn):
	"""Wraps a preprocessing function to handle tf tensor conversion."""

		@@ -35,10 +43,6 @@ class AudioConverter(PreprocessingLayer):

		backbone_cls = None

		def audio_shape(self):


		return log_spec

		def _process_audio_tensor(self, audio, num_samples):

Add AudioToText and AudioToTextPreprocessor class stubs to enable auto class functionality #2265

Add AudioToText and AudioToTextPreprocessor class stubs to enable auto class functionality #2265

Uh oh!

Conversation

harshaljanjani commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of the change

Checklist

Uh oh!

harshaljanjani commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattdangerw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

harshaljanjani commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattdangerw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

harshaljanjani commented May 22, 2025 •

edited

Loading

harshaljanjani commented May 22, 2025 •

edited

Loading

harshaljanjani commented May 23, 2025 •

edited

Loading