Hello everyone, I’m currently building a neural network to predict whether a speaker sounds like a man or a woman. I have a dataset of about 150,000 5-second voice clips, all labelled for gender and with androgynous-sounding voices removed. The networks I’ve built have been pretty good so far - generally about 96% accuracy. However, I’ve also noticed a strange phenomenon of very similar recordings getting very different predictions. AnnaFriel_4 might get a result of 100% female, while AnnaFriel_5 gets 20% female, for example, despite having data which looks statistically nonsignificant (i.e. one has a pitch of 140Hz, the other has a pitch of 141Hz). Both recordings sound totally female too - it’s not like she’s putting on a manly-sounding voice in the wrongly predicted clip.
Is this the result of an overly complicated model? Currently my setup is just the following, nothing that seems too elaborate (as far as I can tell):
model.add(Dense(1000, activation='relu', input_shape = (n_cols,))) # Add the output layer model.add(Dense(2, activation='softmax')) sgd = SGD(learning_rate = 0.0025) # Compile the model model.compile(optimizer=sgd, loss='categorical_crossentropy', metrics=['accuracy'])
Or is this likely to be something else? Gaps in the dataset, maybe? Wondered if somebody more experienced in this had a better idea…