Language translation

So, I have been exploring NLP of recent.
There’s this one project that I have been planning to do for a while, but each time I put it on the sidelines.
It’s translating English to Chinese and Chinese to English.
The biggest challenge am encountering is preprocessing Chinese characters. I have been looking around, but unable to find any information.
How do I preprocess and tokenize Chinese and similar languages characters?

We had a thread at:

1 Like

Hi @Samu_2505, To convert chinese characters to tokens you can use CEDICT(Chinese-English dictionary) which looks for the longest word in the CEDICT dictionary that matches the input. For more details please refer to this document. Thank You.