Deep Learning NN for audio: data preparation, loss, evaluation question

Hi All!

I work with a Audio processing neural networks for a task such as noise suppression.
Currently, I investigate the existed NN for such tasks and try to reproduce whole Training flow to get similar results.

I have some question about data preparation and loss calculation principles, because it hard to find some advices/results about it.

Most of application use STFT or Mel-cepstrum audio representation and process it by NN. But my NN approach works with Time Domain of audio. To simplify, the structure is Next: Audio_Signal_With_Noise → NN → Audio_Signal

SO NN block should process Time Domain audio signal and have an output as new Time Domain signal without noise. Let’s say, that for LOSS function I will use SDR metric.

My questions are next:

  1. Should I train my NN to get output signal normalized to -1:1 range? If yes, what should be the most efficient approach for training: normalize output of a neural network to -1:1 range and than calculate LOSS during the training? Or just pass Normalized X and Y to NN and train it like this?
  2. What about mean of Audio? Should I normalize audio to -1:1 and EXPECT to have a 0-mean signal output, or fix output of NN by substracting the mean like this: nn_output = nn_output - mean(nn_output). Or Traing NN on a data normalized from 0 to 1?

Hi @Andrii_Tsemko

Welcome to the TensorFlow Forum!

You can prepare the audio signal data by trimming the noise and then can use into the model. Please refer to this Simple Audio Recognition model for more information. Thank you.