Simple audio recognition: Recognizing keywords | TensorFlow Core

I have been trying to follow this guide. However, I feel like it’s severely lacking.
[Reconocimiento de audio simple: reconocimiento de palabras clave  |  TensorFlow Core](https://Simple audio recognition: Recognizing keywords)

First and before most, why there is no import for tensorflow-io

Reading audio files and their labels
def decode_audio(audio_binary):
audio, _ =
return tf.squeeze(audio, axis=-1)

Regarding this method does it allows mp3? Is possible to load mp3? What tf.squeeze actually does to the audio_binary I provided?

files_ds =
waveform_ds =, num_parallel_calls=AUTOTUNE)

Regarding this piece of code, I would love to know what does.
def get_spectrogram(waveform):
# Padding for files with less than 16000 samples
zero_padding = tf.zeros([16000] - tf.shape(waveform), dtype=tf.float32)

  # Concatenate audio with padding so that all audio clips will be of the 
  # same length
  waveform = tf.cast(waveform, tf.float32)
  equal_length = tf.concat([waveform, zero_padding], 0)
  spectrogram = tf.signal.stft(
      equal_length, frame_length=255, frame_step=128)

  spectrogram = tf.abs(spectrogram)

  return spectrogram

Can I create a spectogram from mp3?
What this does? zero_padding = tf.zeros([16000] - tf.shape(waveform), dtype=tf.float32)
Why we casting it to 32? waveform = tf.cast(waveform, tf.float32)
Do I need a degree in SoundPreformatted text Engineering to use this? Cause it all seems gibberish to me.

Now the worst part, “Run inference on an audio file”.
sample_file = data_dir/‘no/01bb6a2a_nohash_0.wav’sample_ds = preprocess_dataset([str(sample_file)])
for spectrogram, label in sample_ds.batch(1):
prediction = model(spectrogram), tf.nn.softmax(prediction[0])) plt.title(f’Predictions for “{commands[label[0]]}”’)

So let me see, I generated a model and now its time to use it! So according to this guide, I need to create a dataset with just one entry do sample_ds.batch(1) because again I just have on entry and then magic I use the model I just create!
Shouldn’t instead be explicit in this tutorial how to correctly save the model(including its classes) and then how could I use this model? For example to beep every time I say a trained word with the mic or count the course of a trained word in a file mp3. As it is I don’t think I could possibly use this to turn my house lights on.

I also checked the @tensorflow-models/speech-commands, but again it’s kind of useless when you can only use a pre-defined model. Instead of explaining how to convert a given .h5 model into a JSON and then how to load it.

Sadly, I’m really disappointed;; I wanted to be able to do much more with this and I think this has a future, but as it is is really hard and unpractical to use;; My understanding of audio itself is really limited. Also, I notice even when I train this model the thresholds are ridiculous, if I say a given word not trained by the model this model will classify it in one of the classes really high, I was expecting more entropy in the classification. Again I’m no expert in this field and I was just doing this for fun.

I’m sorry for the text not been formated the way I wanted it;_; Still, I would like to hear some feedback. Maybe a good tutorial or something along those lines would be great. I have always been fascinated with Sound and making this streamline would greatly increase TensorFlow usage.

I can recommend a few resources that use TF and signal processing libraries for audio/music, as a start:


Regarding with AUTOTUNE, check out the Parallelizing data transformation section of the Better performance with the API guide.

Yes. With the librosa library for audio and music processing in Python, you can:

To learn about short-time Fourier transforms/spectrograms with NumPy, SciPy, matplotlib, and librosa, this Music Information Retrieval (MIR): STFT notebook may be useful: stft. There are more MIR notebooks here:

1 Like

Adding some annotation, hope this helps:

def get_spectrogram(waveform):
  # Zero-padding for audio waveforms with less than 16,000 samples
  zero_padding = tf.zeros(
      [16000] - tf.shape(waveform),
  # Cast the waveform tensors' dtype to float32.
  waveform = tf.cast(waveform, dtype=tf.float32)
  # Concatenate the waveforms with `zero_padding`, which ensures all audio
  # clips are of the same length.
  equal_length = tf.concat([waveform, zero_padding], 0)
  # Convert the waveforms to spectrograms via a STFT.
  spectrogram = tf.signal.stft(
      equal_length, frame_length=255, frame_step=128)
  # Obtain the magnitude of the STFT.
  spectrogram = tf.abs(spectrogram)

  return spectrogram

You can read about precision and running computations on ML hardware here:

Today, most models use the float32 dtype, which takes 32 bits of memory. However, there are two lower-precision dtypes, float16 and bfloat16, each which take 16 bits of memory instead. Modern accelerators can run operations faster in the 16-bit dtypes, as they have specialized hardware to run 16-bit computations and 16-bit dtypes can be read from memory faster.

and here: Reduce costs and increase throughput with NVIDIA T4s, P100s, V100s | Google Cloud Blog.

Check out this notebook on Kaggle: Audio Data Analysis Using librosa 📈 | Kaggle

librosa.load() # for loading the MP3
librosa.stft() # to convert to the time-frequency domain using the short-time Fourier transform)
librosa.display.specshow() # display a spectrogram

(LibROSA docs)

More useful resources from the world of research:

1 Like

The original dataset’s waveforms are in mono ( (single channel):

Each utterance is stored as a one-second (or less) WAVE format file, with the sample data encoded as linear 16-bit single-channel PCM values, at a 16 KHz rate.

If you decode the WAV file using to a normalized tensor with dtype float32:

test_audio ='data/mini_speech_commands/down/0a9f9af7_nohash_0.wav')
test_decoded_audio, _ =

the resulting tensor shape will be:

tf.Tensor(... shape(13654, 1), dtype=float32)

where 1 is the mono channel. Before you perform a short-term Fourier transform with the get_spectrogram() function to obtain a spectrogram (a 2D image), you remove the last dimension with tf.squeeze:

test_squeezed_decoded_audio = tf.squeeze(test_decoded_audio)

The input for the tf.keras.layers.Conv2D has a shape of (BATCH_SIZE, {IMAGE_SIZE}, CHANNELS). Therefore, you add the mono channel (1) back in the get_spectrogram_and_label_id() function with spectrogram = tf.expand_dims(spectrogram, axis=-1) (thanks @markdaoust).

You can also learn more about short-time Fourier transforms in the following external notebooks (they were created to accompany a book):

Here’s another potentially useful resource:

1 Like

The tutorial

has recently been updated :+1:

1 Like

@joao_marques Spotify recently open-sourced a TensorFlow 2 library which you may find helpful: GitHub - spotify/realbook: Easier audio-based machine learning with TensorFlow.

Realbook is a Python library for easier training of audio deep learning models with Tensorflow made by Spotify’s Spotify’s Audio Intelligence Lab. Realbook provides callbacks (e.g., spectrogram visualization) and well-tested Keras layers (e.g., STFT, ISTFT, magnitude spectrogram) that we often use when training. These functions have helped standardized consistency across all of our models we and hope realbook will do the same for the open source community.

Realbook contains a number of layers that convert audio data (i.e.: waveforms) into various spectral representations (i.e.: spectrograms). For convenience, the amount of memory required for the most commonly used layers is provided below.