Implementing a CNN LSTM architecture for audio segmentation

Hi everyone,
I’m trying to implement a part of this paper: https://people.kth.se/~ghe/pubs/pdf/szekely2019casting.pdf

This part specifically:

Mel-spectrograms were extracted using the Librosa Python package with a window width of 20 ms and 2.5 ms hop length. The resulting spectrograms for two seconds of audio have 128×800 pixels. Zero crossing rates were calculated on the same windows. The neural network was implemented in Keras following the architecture in Figure 1. The first convolutional layer used 16 2D filters (size 3×3, stride 1×1) and ReLU nonlinearities, followed by batch normalisation and 5×4 max-pooling in both time and frequency. The second 2D convolutional layer used 8 filters in the frequency domain (4×1) and ReLU, followed by batch norm and 6×5 max pooling. Due to downsampling by the pooling layers, this produced 40 1×1 cells with 8 channels at a rate of 20 times per second. These were fed into a bidirectional LSTM layer of 8 hidden units in each direction, followed by a softmax output layer. The network was randomly initialised and trained for 40 epochs to minimise cross-entropy using Adadelta (with default parameters) batches of 16 two-second spectrogram excerpts. The softmax outputs can be interpreted as estimated per-frame class probabilities and used to automatically annotate the held-out episodes. Prior to further processing by either method, the temporal coherence of the automatic annotations was improved by merging mixed speech after a single-speaker segment into that speaker’s speech.

This is what I have :

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

model = keras.Sequential(
    [
     keras.Input(shape=(128, 800, 2)),
      layers.Conv2D(16, (3, 3), activation='relu'),
      layers.BatchNormalization()
      layers.MaxPooling2D(pool_size=(5, 4)),
      layers.Conv2D(8, (4, 1), activation='relu'),
      layers.BatchNormalization()
      layers.MaxPooling2D(pool_size=(6, 5)),
      layers.Bidirectional(layers.LSTM(8))
      layers.Dense(7),
    ]
)

model.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer = keras.optimizers.Adadelta(),
    metrics = ["accuracy"],
)

model.fit(x_train, y_train, epochs=40, batch_size=16)

Can someone please help?

What do you need in particular?
Do you need to prepare audio data?

Thanks for the response @Bhack
The code that I have shared above should implementat the model architecture specified in the block quotes.

But my code clearly does not do that. I’d really be grateful if you could look at my code and suggest edits to it so that it represents the architecture described in the block quotes.

Architecture of the model is also given as an image in the paper whose link I have shared above. I’m not able to attach images (because of permissions i guess) otherwise I would have added that also.

I am able to upload the image of the architecture now @Bhack

Some quick updates, I added a TimeDistributed layer

now the code looks like this:

import tensorflow as tf

from tensorflow import keras

from tensorflow.keras import layers

model = keras.Sequential(

    [

     keras.Input(shape=(128, 800, 2)),

      layers.Conv2D(16, (3, 3), activation='relu', padding='same'),

      layers.BatchNormalization(),

      layers.MaxPooling2D(pool_size=(5, 4)),

      layers.Conv2D(8, (1, 4), activation='relu', padding='same'),

      layers.BatchNormalization(),

      layers.MaxPooling2D(pool_size=(6, 5)),

      layers.TimeDistributed(layers.Flatten()),

      layers.Bidirectional(layers.LSTM(8)),

      layers.Dense(7),

    ]

)

model.summary()

model.compile(

    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),

    optimizer = keras.optimizers.Adadelta(),

    metrics = ["accuracy"],

)

This is some progress but far from what is described in the image and description above.

Maybe because you dont have stride in the MaxPooling layers. The paper does not specify this.
Plus input in the paper is 128x800, yours 128x800x2.

@Kzyh I dont think so. They say

However, we also investigated augmenting the frequency-domain spectra with timedomain
information to improve classification. In particular, the zerocrossing rate (ZCR) has been shown to be an effective feature for differentiating breath events from unvoiced fricatives [25, 13] and have also been of interest for detecting overlapped speech [26]. It is defined as the number of times the audio waveform changes sign divided by the total number of samples in the window.

and more importantly

We added ZCR as another image channel to each spectrogram cell

So the input shape is fine I think.

@Kzyh the paper does say however that

The second 2D convolutional layer used 8 filters in the frequency domain

and

…this produced 40 1×1 cells with 8 channels at a rate of 20 times per second. These were fed
into a bidirectional LSTM layer of 8 hidden units in each direction…

I am unsure about these parts.

I think last pooling layers should output [x, 1, 40, 8], then after squeeze [x, 40, 8] you can feed this to LSTM.
But with parameters used in paper last pooling ouptputs [x, 3, 39, 8].

@Kzyh when I set padding='valid' instead of padding=same then model.summary is:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d_1 (Conv2D)            (None, 128, 800, 16)      304       
_________________________________________________________________
batch_normalization (BatchNo (None, 128, 800, 16)      64        
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 25, 200, 16)       0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 25, 200, 8)        520       
_________________________________________________________________
batch_normalization_1 (Batch (None, 25, 200, 8)        32        
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 4, 40, 8)          0         
_________________________________________________________________
time_distributed (TimeDistri (None, 4, 320)            0         
_________________________________________________________________
bidirectional (Bidirectional (None, 16)                21056     
_________________________________________________________________
dense (Dense)                (None, 7)                 119       
=================================================================
Total params: 22,095
Trainable params: 22,047
Non-trainable params: 48

After some reading I came up with these two models @Kzyh @Bhack – still unsure if this is the right implementation of the architecture described above(image shared in replies and description in the intro) way to go about things.

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.python.ops.gen_array_ops import reshape

model = keras.Sequential(
        [
        keras.Input(shape=(128, 800, 2)),
        layers.Conv2D(16, (3, 3), activation='relu', padding="same"),
        layers.BatchNormalization(),
        layers.MaxPooling2D(pool_size=(5, 4)),
        layers.Conv2D(8, kernel_size=(4,1),strides=(4,1), activation='relu', padding="same"),
        layers.BatchNormalization(),
        layers.MaxPooling2D(pool_size=(6, 5)),
        layers.Reshape((40,8)),
        layers.Bidirectional(layers.LSTM(8, return_sequences=True)),
        layers.Dense(7),
        ]
    )

model.summary()
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

model = keras.Sequential(
    [
     keras.Input(shape=(128, 800, 2)),
      layers.Conv2D(16, (3, 3), activation='relu', padding='same'),
      layers.BatchNormalization(),
      layers.MaxPooling2D(pool_size=(5, 4)),
      layers.Conv2D(8, (4, 1), activation='relu', strides=(4,1), padding='same'),
      layers.BatchNormalization(),
      layers.MaxPooling2D(pool_size=(6, 5)),
      layers.TimeDistributed(layers.Flatten()),
      layers.Bidirectional(layers.LSTM(8, return_sequences=True)),
      layers.Dense(7),
    ]
)

model.summary()

model.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer = keras.optimizers.Adadelta(),
    metrics = ["accuracy"],
)

If some one can look at the code and architecture described above and let me know in case I am missing something that will be really helpful.

1 Like

First model seems good. Dont know about the second one. Did you try training them?

Nope. I have never worked with speech input before so am unsure about the data pipeline for the model as well.

So as of now

  1. I have a 52-minute long audio which I have annotated into 7 different categories by preparing a .txt file that looks like:
    start-time, end-time, class

and

  1. for feature extraction I have this code
import numpy as np
import librosa
sr = 44100
frame_length = 4096 
hop_length = 1024 
stream = librosa.stream('final.wav', block_length=800, frame_length=frame_length, hop_length=hop_length)

mel_specs_log_zcr = []
for y in stream:
    mel = librosa.feature.melspectrogram(y, sr=sr, center=False, n_fft=4096, hop_length = 1024)
    log_mel=librosa.power_to_db(mel)
    zcr = librosa.feature.zero_crossing_rate(y, center=False, frame_length = 4096, hop_length = 1024 )
    zcr = np.tile(zcr, (128, 1))
    mel_spec_log_zcr = np.stack((log_mel, zcr), axis=2)
    mel_specs_log_zcr.append(mel_spec_log_zcr)
import numpy as np
import librosa
sr = 44100
frame_length = 4096 
hop_length = 1024 
stream = librosa.stream('final.wav', block_length=800, frame_length=frame_length, hop_length=hop_length)

mel_specs_log_zcr = []
for y in stream:
    mel = librosa.feature.melspectrogram(y, sr=sr, center=False, n_fft=4096, hop_length = 1024)
    log_mel=librosa.power_to_db(mel)
    zcr = librosa.feature.zero_crossing_rate(y, center=False, frame_length = 4096, hop_length = 1024 )
    zcr = np.tile(zcr, (128, 1))
    mel_spec_log_zcr = np.stack((log_mel, zcr), axis=2)
    mel_specs_log_zcr.append(mel_spec_log_zcr)

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.python.ops.gen_array_ops import reshape

model = keras.Sequential(
        [
        keras.Input(shape=(128, 800, 2)),
        layers.Conv2D(16, (3, 3), activation='relu', padding="same"),
        layers.BatchNormalization(),
        layers.MaxPooling2D(pool_size=(5, 4)),
        layers.Conv2D(8, kernel_size=(4,1),strides=(4,1), activation='relu', padding="same"),
        layers.BatchNormalization(),
        layers.MaxPooling2D(pool_size=(6, 5)),
        layers.Reshape((40,8)),
        layers.Bidirectional(layers.LSTM(8, return_sequences=True)),
        layers.Dense(7),
        ]
    )

model.summary()

model.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer = keras.optimizers.Adadelta(),
    metrics = ["accuracy"],
)

with open("final.txt", "r") as f:
  text = f.read()
labels = []
text = text.split("\n")
for t in text:
  line = t.split("\t")
  labels.append(line)

x_train = mel_specs_log_zcr
y_train = np.array(labels)

model.fit(x_train, y_train, batch_size=64, epochs=10)

This is what I have. But obviously hti sis not how data should be feeded into the CNN LSTM model.

So I get the error

ValueError: Data cardinality is ambiguous:
x sizes: 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128
y sizes: 1007
Make sure all arrays contain the same number of samples.

@Kzyh

Can you share your final.txt file?

@Kzyh final.txt is basically the file with annotation for producing the labels.

There are 1007 rows and every row has 3 values : start time, end time and the category.
These values are tab separated. Something like this.

Can’t upload .txt files on the forum:(

I think your input and labels should be something like this:
x_train shape (batch, 128, 800, 2)
y_train shape (batch, 7)

y_train is calculated from your final.txt file using timestamps.
Lets say your first spectrogram starts from 0, ends at 72.13. The label should looke like this [n, s1, i1, s1, i1, s1, i1]. Then you change it to class id: [1,2,3,2,3,2,3].