Hi! I’m trying to convert two models from TensorSpeech/TensorflowTTS into tensorflow.js, but the expected input dimensions seem to be set to a specific number instead of being allowed to vary. E.g. I expect the input shape for my mel spectrogram generator to be [-1, -1], but tensorflowjs_wizard converts it to [-1, 10], which only lets me input exactly 10 phonemes. The output of this spectrogram generator is a different size than the conversion wizard makes the vocoder model accept. Is there a setting in the wizard or in tensorflow.js I am overlooking?
Are you able to share the input saved model with us for us to check what may be happening here? If you could share via Google drive or similar to me I can ask our team to check what may be causing this. If you can not share publicly via link here let me know and I can drop you a message on LinkedIn/Twitter with my work email to share to privately if that works better for you for the purpose of trying to debug this. Thanks!
Sure! Thanks a ton for helping
Here in this folder is the original SavedModel and the tensorflow.js model converted with tensorflowjs_wizard. It contains architecture information, but it is an instance of FastSpeech2 from TensorflowTTS.
Hey, I got it working so here’s that demo TensorflowTTS in Tensorflow.js
Unfortunately it seems to work extremely slowly, even after quantizing it to a much smaller size. I’ve already tried all the different backends (it is not compatible with WASM), and no luck making it faster than WebGL. I don’t know very much about FastSpeech2, MB-MelGAN, or TensorFlow, so I’m not sure there’s anything I can do to make it faster. I guess my dream of turning this into a useful browser extension won’t work
Latency aside, this is so cool! I have some questions: Can you customize the voice easily for this eg can I make it sound like me? I managed to clone my voice once with a 3rd party and curious if it would now be possible via your conversion too?
Even though it takes some time - on my old laptop its about 8 seconds for me, it is still a really interesting demo. I will try on my desktop later which is more modern with a dedicated GPU to see if that improves things. I will also share with our team to see if they have any ideas on optimization here.
For the WASM compatibility is it a missing op issue?
@Taylor I have tried your demo, it is pretty amazing.
The performance can be better, since your model contains variable length input, the model need to compile the op shader each inference if the shapes changes.
You can see that for the exact same input string, the first inference could take up to 6sec, which the following inference only takes 600ms.
We also have a flag you can try where we parameterize the shape for the op shader, it should significant improve the first inference. Can you try to add following flag to your demo.
Wow! As a newbie to TensorFlow (and to ML in general) this is a reaction I did not expect!
TF.JS show and tell seems really cool. I’d be honored to be on! But, now I must ask you, you know this is a port of someone else’s model, right? I feel like I did only one tenth of the work they did to get this model working. That said, and motivated by the slowness of the previous model, I have been studying up on ML with plans to create a native TensorFlow.js model of FastSpeech2! Perhaps we could talk about that.
In terms of cloning your voice with this, while this architecture is not designed to facilitate that, I wouldn’t rule it out completely. As I understand it, FastSpeech2 trains significantly faster than auto-regressive models such as Tacotron 2 (and I believe with less data but I’m not sure). A noisy and somewhat acceptable voice is possible from Tacotron 2 using only 10 minutes of training data, so I’m sure FastSpeech2 could do just as good if not better! I wouldn’t bet too much on it being fast enough in the browser, though, since the whole model would need to be trained at once.
The WASM compatibility error seems to be a missing implementation of Softplus, an activation function:
Error: Kernel 'Softplus' not registered for backend 'wasm'
Wow, if that’s true then my extension idea might really work with this model! Unfortunately, I tried it both on a lower-end (512MB VRAM) and higher-end (6GB VRAM) machine, and this flag made no difference. Perhaps I am using it wrong, but I could see with tf.getBackend() that the backend was set to 'webgl' and tf.env().get('WEBGL_USE_SHAPES_UNIFORMS') returns true. I have tried with and without tf.enableProdMode()
Sure we have had people talk about their conversion experiences too in the past especially if they have built something with the resulting model or done optimizations which I figured you had done given the discussions above. Not to mention this is the first time I have seen a successful TTS conversion. Always exciting to see how others may take and use such a conversion in their work.
Maybe we can wait until you have done some optimization or re-made TFJS native version / using it for something novel in the browser - eg a demo with a text chat demo to read aloud messages or some accessibility use case to read highlighted text out aloud for example as part of a chrome extension. Ideally it can solve a problem for many and that is always good content to talk.
No rush here so just let me know when you feel you have made some progress and we can have a chat to see where you are at.
Feel free to drop me a direct message on the forum or via my other social media if you are following me there when you feel it is in a good state that you can talk about your learnings.
Knowledge sharing is the key part of this show and to inspire others to also convert more models so the JS community can take those and use them in ways others have never dreamt of! JS Engineers are very creative bunch of people
Ping is out of office today but he will hopefully get back next week around your usage of his suggestion.