I have the log mel spectrograms of a few audio clips and I am trying to augment the spectrograms using tfa.image.sparse_image_warp so that time warping can be achieved as done in Google’s SpecAugment
But I am confused on how to do achieve time warping as the documentation does not specify how to initialize arguments to sparse_image_warp.
The method declaration is like this:
tfa.image.sparse_image_warp(
image: tfa.types.TensorLike,
source_control_point_locations: tfa.types.TensorLike,
dest_control_point_locations: tfa.types.TensorLike,
interpolation_order: int = 2,
regularization_weight: tfa.types.FloatTensorLike = 0.0,
num_boundary_points: int = 0,
name: str = ‘sparse_image_warp’) → tf.Tensor
Can someone point out how to initialize source_control_point_locations, dest_control_point_locations and num_boundary_points?
1 Like
Bhack
June 19, 2021, 10:45am
#2
I suppose you need to use the Deepspeech fork of this
See this thread:
opened 10:25PM - 05 Mar 20 UTC
This is about API design of the audio processing in tfio. Comments and discussio… ns are welcomed.
When we started to get into reading audio for tfio, we start with a feature request from the community of 24bit WAV file which was not possible in tf core repo. Since then, we also supported `Dataset` types of access that essentially is a sequential read, and random access (AudioIOTensor) which allows python `__getitem__()` and `__len__()` styles access in tensorflow graph.
Later, we added additional audio format support such as `Flac`, `Ogg`, `MP3` and `MP4`, in addition to the already supported `WAV` file. Note `WAV`, `Flac`, `Ogg`, `MP3` does not requires external .so dependency. For `MP4`, we use AVFoundation on macOS, but needs FFmpeg's .so files on Linux. We also added the `resample` as a basic ops (not a python class) recently upon request from community.
Recently, new requests to add supports for decoding memory input audio into tensors comes in play (See PR #815). So the APIs for audio processing in tfio will expand even further.
Initially we have not truly come up with a good audio api design as those features are added over gradually, span across more than a year. Now as the features are gradually in shape, I think it is point to start lay out tfio.audio to revisit APIs so that:
1. Capture the usage from community for different use cases.
2. Allow expansion in the future (our focus was on decoding but encoding will be added soon).
The audio processing in tf repo, is pretty much limited and is really not a good template for us to follow. I also look into pytorch's audio package as well. While pytorch's APIs are typically very clean, the audio related apis are not as clean as some other parts. pytorch also relies on sox which could be an issue (sometimes could be a benefit depending on the scenario.).
We can summarize the use case that has been covered:
1. AudioIODataset which is a subclass of tf.data.Dataset and could be passed to tf.keras.
Dataset is a sequential access (not random accessed) that allows iteration from the beginning to the end.
The shape and dtype are probed automatically in eager mode. dtype has to be provided by user in graph mode.
It is normally file based (or callback-based) and will not load the whole clip into memory (lazy-loaded).
2. AudioIOTensor which exposes `__getitem__`, `__len__`, `shape`, `dtype`, and `rate` API in python and could be used in graph mode (certain restriction apply).
IOTensor is a random access that allows reading data at any index (`__getitem__`). This is very useful in situations where user want to access just a small clip of the audio. The `shape` API allows user to get the number of channels and samples. `shape = [samples, channels]`.
The shape and dtype are probed automatically in eager mode. dtype has to be provided by user in graph mode.
It is file based (or callback-based) and will not load the whole clip into memory (lazy-loaded).
3. `decode`/`decode_mp3`/`decode_mp4`/etc (see discussion in #815) where the API could be used to decode a string (memory) into tensor.
This is useful in situations where we already load the small clip of audio files into memory and we just want to do a decoding.
On a separate discussion, we may want to expose `metainfo()` API in audio to return back user the shape, dtype, and rate of the audio clip in kernel ops.
4. `encode`/`encode_mp3`/`encode_mp4`/etc which is the counterpart of 3.
We haven't add any ops yet though I the need is obvious.
5. `resample` is an audio processing API we recently added. We may want to expand more in the future.
Given the above I think the layout could be:
```
tfio.audio
- AudioIODataset
- AudioIOTensor
- decode
- decode_mp3
- decode_mp4
- decode_wav
- encode
- encode_mp3
- encode_mp4
- encode_wav
- resample
```
The list could be easy though details could be challenging. For example in tf:
```
tf.audio.decode_wav(
contents, desired_channels=-1, desired_samples=-1, name=None
)
```
Do we really want to provide the option of `desired_channels` and `desired_samples`? My understanding is that we already provided AudioIODataset and AudioIOTensor which already allows lazy-loaded partial access (either sequentially or completely random). I don't see we want to manipulate further. Any additional manipulation could be done after a tensor has been returned.
For example, to reduce the stereo to mono is just to drop one dim in tensor (after the tensor is returned from decode_wav).
Another thing is the dtype. In order for a custom op to work in graph mode, Shape is not needed because shape could be pass as unknown dimension. But, at the minimum a dtype has to be provided (unless it is known before hand). For example, even in `decode_wav` case, the above will not work for 24bit wav file. We do have to provide an API with minimal input:
```
tfio.audio.decode_wav(input, dtype, name=None)
```
There might be some other details to sort out. /cc @jjedele @terrytangyuan @BryanCutler, also cc @faroit @lieff in case you are interested.
2 Likes