Deep Learning NN for audio: data preparation, loss, evaluation question

Hi All!

I work with a Audio processing neural networks for a task such as noise suppression.
Currently, I investigate the existed NN for such tasks and try to reproduce whole Training flow to get similar results.

I have some question about data preparation and loss calculation principles, because it hard to find some advices/results about it.

Most of application use STFT or Mel-cepstrum audio representation and process it by NN. But my NN approach works with Time Domain of audio. To simplify, the structure is Next: Audio_Signal_With_Noise → NN → Audio_Signal

SO NN block should process Time Domain audio signal and have an output as new Time Domain signal without noise. Let’s say, that for LOSS function I will use SDR metric.

My questions are next:

  1. Should I train my NN to get output signal normalized to -1:1 range? If yes, what should be the most efficient approach for training: normalize output of a neural network to -1:1 range and than calculate LOSS during the training? Or just pass Normalized X and Y to NN and train it like this?
  2. What about mean of Audio? Should I normalize audio to -1:1 and EXPECT to have a 0-mean signal output, or fix output of NN by substracting the mean like this: nn_output = nn_output - mean(nn_output). Or Traing NN on a data normalized from 0 to 1?