Need help with Audio Detection

Brandon_Myers · March 6, 2024, 6:14am

Hello, I am finishing a B.S. in Digital Forensics and am working on a research project for Deepfake Audio Detection. The scope of the project is directed towards if it reasonable for an individual using a personal device to quantify audio detection using ML algorithms without outsourcing hardware.
From my understanding I would need to classify both real and fake audio samples using something such as a CNN but I am not sure. (I feel as though the conversions from audio to image to data would result in loss and possibly saturate the training with noise) I’ve gotten the main Keras tutorials to run on my personal PC but am having trouble with converting a local dataset of .wav files to spectrogram to classify the waveforms.
Regretfully I am not sure where to go after this that doesn’t lead off scope.
(generating ML Audio for the Deepfake training is an entirely new project in itself and quite possibly unethical)

tagoma · March 6, 2024, 8:57pm

Hi @Brandon_Myers
It is not clear to me what you have tried so far and what data your are using.
Apologies if following comment is dumb because -yes, obviously- you already did it but did you browsed through past Kaggle challenges and datasets? There are tons of relevant insights you can get reading through their stuff.

Brandon_Myers · March 7, 2024, 2:46pm

Thank you, from Kaggle, I attempted to use the Deep Audio Classification using the Capuchin bird dataset as a test and found some of the dependencies of the modules used in the Jupyter Notebook provided to not be easily accessible or not available.
I also have done various attempts at the AudioMNIST dataset from git. This dataset is also on Kaggle although I just noticed there is more imports on the documentation for the Kaggle notebook vs the git repo and will be trying it later.
These models are useful as the bird recordings/0-9 voice recordings can be replaced/reorganized with voice data into the two categories, though that is where I am still unclear as to the classification of deep faked voice recordings.

For example:
If there are 10,000 voice recordings of real people
Then do I also need a generated audio sample equivalent of each for a proper classification model?
If then does this make the project unethical to deep fake those (public domain) samples if they were not collected by myself as originals?

I have tried to build the the various Tensorflow models on my personal device and cannot successfully get the tensorflow-io to work properly as the TensorRT dependency does not function correctly on what I’m assuming to be incompatible hardware on my device.

Asus TUF Gaming Laptop FX705DT AMD Ryzen 7 3750h
Nvidia GTX 1650 4GB
512 SSD + 2TB m.2
32GB RAM
Dual boot Ubuntu 22.04 / Windows 11 Pro (WSL2 activated running Ubuntu)
Additionally I have VS/VSCode and a multitude of Anaconda environments all with various dependencies I was able to find in the conda-forge sources.

The entire project is based around the idea of “can this be done with base level CPU’s, if so, what is the quality of the detection success using minimal datasets”