Simple audio recognition: Recognizing keywords | TensorFlow Core

I have been trying to follow this guide. However, I feel like it’s severely lacking.
[Simple audio recognition: Recognizing keywords  |  TensorFlow Core](https://Simple audio recognition: Recognizing keywords)

First and before most, why there is no import for tensorflow-io

Reading audio files and their labels
def decode_audio(audio_binary):
audio, _ =
return tf.squeeze(audio, axis=-1)

Regarding this method does it allows mp3? Is possible to load mp3? What tf.squeeze actually does to the audio_binary I provided?

files_ds =
waveform_ds =, num_parallel_calls=AUTOTUNE)

Regarding this piece of code, I would love to know what does.
def get_spectrogram(waveform):
# Padding for files with less than 16000 samples
zero_padding = tf.zeros([16000] - tf.shape(waveform), dtype=tf.float32)

  # Concatenate audio with padding so that all audio clips will be of the 
  # same length
  waveform = tf.cast(waveform, tf.float32)
  equal_length = tf.concat([waveform, zero_padding], 0)
  spectrogram = tf.signal.stft(
      equal_length, frame_length=255, frame_step=128)

  spectrogram = tf.abs(spectrogram)

  return spectrogram

Can I create a spectogram from mp3?
What this does? zero_padding = tf.zeros([16000] - tf.shape(waveform), dtype=tf.float32)
Why we casting it to 32? waveform = tf.cast(waveform, tf.float32)
Do I need a degree in SoundPreformatted text Engineering to use this? Cause it all seems gibberish to me.

Now the worst part, “Run inference on an audio file”.
sample_file = data_dir/‘no/01bb6a2a_nohash_0.wav’sample_ds = preprocess_dataset([str(sample_file)])
for spectrogram, label in sample_ds.batch(1):
prediction = model(spectrogram), tf.nn.softmax(prediction[0])) plt.title(f’Predictions for “{commands[label[0]]}”’)

So let me see, I generated a model and now its time to use it! So according to this guide, I need to create a dataset with just one entry do sample_ds.batch(1) because again I just have on entry and then magic I use the model I just create!
Shouldn’t instead be explicit in this tutorial how to correctly save the model(including its classes) and then how could I use this model? For example to beep every time I say a trained word with the mic or count the course of a trained word in a file mp3. As it is I don’t think I could possibly use this to turn my house lights on.

I also checked the @tensorflow-models/speech-commands, but again it’s kind of useless when you can only use a pre-defined model. Instead of explaining how to convert a given .h5 model into a JSON and then how to load it.

Sadly, I’m really disappointed;; I wanted to be able to do much more with this and I think this has a future, but as it is is really hard and unpractical to use;; My understanding of audio itself is really limited. Also, I notice even when I train this model the thresholds are ridiculous, if I say a given word not trained by the model this model will classify it in one of the classes really high, I was expecting more entropy in the classification. Again I’m no expert in this field and I was just doing this for fun.

I’m sorry for the text not been formated the way I wanted it;_; Still, I would like to hear some feedback. Maybe a good tutorial or something along those lines would be great. I have always been fascinated with Sound and making this streamline would greatly increase TensorFlow usage.

I can recommend a few resources that use TF and signal processing libraries for audio/music, as a start:


Regarding with AUTOTUNE, check out the Parallelizing data transformation section of the Better performance with the API guide.

Yes. With the librosa library for audio and music processing in Python, you can:

To learn about short-time Fourier transforms/spectrograms with NumPy, SciPy, matplotlib, and librosa, this Music Information Retrieval (MIR): STFT notebook may be useful: stft. There are more MIR notebooks here: