How can I classify the language of voice data?

How can I classify the language of voice data? Specifically, I speak English, Japanese, French, Italian, and Russian to the voice data. I want to make a model learn this and create a model that classifies what language the new voice data is speaking. What kind of preprocessing, feature extraction, and model selection should be performed?