NLP Translation Between Alphabets

I’m trying to create a translation model, but the two languages have completely different alphabets, and are written in opposite directions. Thus, the embedding of one language will not help for the other. Any ideas on how to do this?

You can find an example of character level translation model for English and French languages here: Character-level recurrent sequence-to-sequence model
Probably, if you reverse the order of characters in the sentences of the language, which is written right to left, you would be able to use this code without other significant changes.


Just to add to Ekaterina’s answer, this tutorial might also be helpful:

@markdaoust can you share any insight here?

1 Like

+1, and AFAIK text is usually encoded “start to end” not “left to right” or “right to left”. If you use an RNN be sure you’re running it in the direction you think you’re running it in.

But also note that transformer layers only even see token order thanks to the position-encoding.

That’s not a big problem. Gus is linking to the right tutorial there. Even though the two languages use the same alphabet, that tutorial has separate tokenizer and embedding for the two languages. I tried doing this with a single tokenizer, and that clearly performed worse (but that might have been with a shared embedding too).

See this tutorial for a walkthrough of how those tokenizers were created:

1 Like