-
Notifications
You must be signed in to change notification settings - Fork 283
Add AudioToText and AudioToTextPreprocessor class stubs to enable auto class functionality #2265
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Add AudioToText and AudioToTextPreprocessor class stubs to enable auto class functionality #2265
Conversation
@mattdangerw / @divyashreepathihalli CI is green on the refactor! I've gone ahead and added the class stubs. Since it's a subtle design change, it shouldn't break CI. Feel free to review the diff, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Functionality requirements...
- We should be able to run preprocessing through
tf.data
on all backends (torch, tf, and jax). We should also be able to run outside oftf.data
eagerly batch by batch. - We should be able to run the models without jax or torch installed. No unconditional imports of these libraries.
Design considerations...
- Try to think about keeping our modeling experience consistent as you switch from moonshine to whisper and vice versa. This is one of the big things we want to solve for.
- There is a lot of backend specific code here. Consider if we can use a
keras-hub/keras_hub/src/utils/tensor_utils.py
Lines 43 to 44 in d1d014c
def preprocessing_function(fn): """Wraps a preprocessing function to handle tf tensor conversion."""
I think that might cut a lot of code complexity.
tf = None | ||
|
||
import keras | ||
import torch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can't assume torch is installed, no unconditional import like this. the entirely library would break without torch available
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, will take care of this the next time around, thanks!
"TensorFlow is required for computing the log mel spectrogram." | ||
) | ||
if isinstance(audio, torch.Tensor): | ||
audio = audio.cpu() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ops.convert_to_numpy will already move torch tensors to cpu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, I'll remove the redundancy after verifying breakage, thanks!
@@ -35,10 +43,6 @@ class AudioConverter(PreprocessingLayer): | |||
|
|||
backbone_cls = None | |||
|
|||
def audio_shape(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need this for task.summary()
. Find to make changes here, but tokenizers show vocab size, image converters the output image size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will make sure to include this, thanks!
|
||
return log_spec | ||
|
||
def _process_audio_tensor(self, audio, num_samples): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's think about the design of this base class a little more. Is there a somewhat standard form of audio processing that we can expose on the base class so this class is actually usable directly? Is there a clean way we can expose a very small surface to override in subclasses?
This is a lot of private methods just tossed onto the class here. Are all of these used by both subclasses? Is there a way to expose fewer methods on the base class?
Try to thing of a durable design that will hold up well as we add more audio models over moths and years.
- What is generic to most audio models?
- What is model specific?
import tensorflow as tf | ||
except ImportError: | ||
tf = None | ||
|
||
|
||
@keras_hub_export("keras_hub.layers.WhisperAudioConverter") | ||
class WhisperAudioConverter(AudioConverter): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try to think about how we can make the interface of these layers more cohesive. Why does whisper have max_audio_length
and moonshine have max_length
, padding
and pad_to_multiple_of
. Are these inherent to the model themselves? Or did we just copy a bunch of design decision from an upstream implementation differently for each model.
|
||
def _cond(self, condition, true_fn, false_fn, tensor): | ||
"""Conditional execution based on backend and tensor type.""" | ||
if self._use_tf_graph_ops(tensor) and keras.config.backend() != "torch": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check seems weird. what if you are running preprocessing with tf.data
and the backend is torch? Don't we need graph ops then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something I thought I learnt from the Moonshine PR. In the Moonshine PR I faced a CI issue with Torch-GPU and I fixed it like this:
92dc884, involved a conditional check for the Torch backend for tf.cond
.
Thank you for the reviews @mattdangerw! For now, the scope of this PR is limited to the class stubs for auto class functionality, as you mentioned. However, we can address the AudioConverter in a future PR after a thorough design discussion with all of you, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! This looks great
Description of the change
This PR was originally intended as a design diff to make MoonshineAudioConverter and WhisperAudioConverter independent of tf ops and leave the tf-specific handling to the AudioConverter base class. However, it is now restricted to adding class stubs for auto class functionality. Please review the following considerations that should be kept in mind for the design diff regardless.
We have two choices in the matter:
a. Make AudioConverter a layer with specific arguments, an init(), call() and get_config().
I believe this option is a bit too strict given that at the moment, in the art at present, we have two classes with highly divergent parameters in themselves. I believe the preprocessing fundamentally differs from the other, and having an argument such as
use_spectrogram_features
might be a bit too premature, since in Whisper itself, it uses a specific variant, log-mel spectrogram features with Whisper-specific audio preprocessing.b. The choice this PR is based on makes AudioConverter a core ops class that future AudioConverters can subclass from. These compatibility issues will be ubiquitous when writing the AudioConverter and would have to be dealt with when using the tf-specific ops; instead, they can use the base class ops without worrying too much about the tf-specific handling.
I've taken efforts here to eliminate the need for the developer to worry about the low-level utils they share. Also, this itself would be a great addition because it'll clean up a lot of messy backend-specific work in the audio converters, in my opinion!
This cleans up MoonshineAudioConverter and WhisperAudioConverter.
Checklist