Fine-tuning speech to text model

I have a large dataset with 25-45 second audio files with their transcriptions in a low-resource language (has common vocabulary to English in some ways), and I want to fine-tune an existing model against my own data. All tutorials I find use Common Voice, and tailoring them to my use-case isn’t very straightforward. This is the tutorial I tried following; it uses torchaudio but I would prefer tensorflow: Fine-Tune XLSR-Wav2Vec2 for low-resource ASR with 🤗 Transformers

I was wondering if there were any references I could follow. My dataset is simply two columns; a ‘filename’ column which is just the audio file (and I concatenate the path to each filename when I want to load the audio file) and a ‘sentence’ column which is the audio transcription. Audio is in MP3 format.

I’m really stuck on how to proceed from here.

There is a tutorial on Simple audio recognition that can be good reference to start the speech to text using Tensorflow.