@VChang_WMF you have been testing Taiwanese as well, right?
Right now, the existing machine translation tools will not work on Taiwanese Hokkien, they will treat it as Mandarin if it is written in Chinese character, or it Vietnamese if use the POJ or Taiwan’s MOE’s Tâi-lô system. Some Chinese people couldn’t recognize Taiwanese Hokkien written in POJ form, using the embedded machine translation think it is Vietnamese.
I am glad that WMF has a technology optimism about the machine translation technology, but for small or regional languages like Taiwanese Hokkien, it doesn’t work right now.
Would you like to contribute your findings to Supporting automatic translations of languages existing on wiki but not supported by google translate?
@Supaplex is there any tool/way comes to your mind that you think can potentially address this issue? To-siā!
I am not a language expert, so I could only point you to the existing resource on Wikipedia. The more Hokkien grammar a sentence is, the more machine translation relies on Mandarin will go wrong. And if there are more Hokkien only characters or totally different usage in Hokkien, the more likely a machine translation relying on Mandarin will go wrong.
Góa bo̍k-chêng chai-iáⁿ Tâi-oân ū chi̍t-tīn lâng teh chhòng Tâi-oân-ōe ê Common Voice, m̄-koh in chú-iàu chhái-iōng ê sī Hàn-jī kap Tâi-lô, án-ne khióng-kiaⁿ sī khah bô hāu-lut ê. In-ūi chhin-chhiūⁿ lí kóng ê, ko͘ Hàn-jī, ke-khì to̍h pháiⁿ liáu-kái ah, koh khah bián kóng sī Hàn-jī hām Lô-má-jī lām-ēng.
As far as I know, there’s already a project on Common Voice carried by the community here in Taiwan. However, they are working the project with the documents of Han characters and Tailo romanization which might be not efficient enough for the machine to understand the language. Like you mentioned sentences only written by the Han characters can not be processed well now. And not to mention the sentence in both Han and Tailo (I’m not sure, it seems more complicated to me).
Siūⁿ-beh chhiáⁿ-mn̄g lán che ke-khì hoan-e̍k ê būn-tê, kám mā ū khan-siap tio̍h ISO gí-giân hoan-hō (ISO639-3 code). Nā sī ū, lán Tâi-oân-ōe ê hoan-hō khó-lêng tī chit 2 nî ē ū piàn-tōng. In-ūi kū-nî, ū lâng chiam-tùi Bân-lâm-gú ê “nan” hoan-hō, hiòng ISO639-3 úi-oân-hōe the̍h-chhut siong-koan ê sin-chhéng. Chhiáⁿ-khòaⁿ chit-ê liân-kiat.
I was thinking about if the translation stuffs are related to langauge code (ISO639-3). If yes, the code of Taigi (Taiwanese) might be changed due to the code change request last year which aims to separate Taigi from the code of Minnan (nan).