-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Option for replacement char #2
Comments
A goal is to support all characters in Unicode, which would make replacements unnecessary. It is already practically the case that users will never encounter unsupported characters. Currently the only missing characters are in the following blocks/scripts: CJK, CUNEIFORM, BAMUM_SUPPLEMENT, TANGUT, KHITAN_SMALL_SCRIPT, DUPLOYAN, BYZANTINE_MUSICAL_SYMBOLS, MUSICAL_SYMBOLS, SUTTON_SIGNWRITING. I am planning on adding support for many of these. I don't want to confuse users into thinking they have to worry about unsupported characters. Other cases involving unassigned or Special code points I think are out of scope and should be handled elsewhere if necessary. |
Ah, I see. I still think it could be useful, as you never know what strange characters might appear in a string, and it's better to replace than to strip IMO. But I didn't realize so much of Unicode was covered, that's great. |
This is also be useful for different languages for instance my name Володимир in Ukrainian should be Volodymyr(Not Volodimir), but same name Владимир in Russian should be Vladimir Even in test its wrong check("Володимир Горбулін", "Volodimir Gorbulin"); this is Ukrainian name since Russian dont have i. So right translation should be Volodymyr Gorbulin. (https://en.wikipedia.org/wiki/Volodymyr_Horbulin) PS. I am happy to contribute |
If you would like custom replacements for specific characters I would suggest doing them yourself before calling anyascii. public static String transliterate(String s) {
s = s.replace('Г,'H').replace('и', 'y'); // etc
return AnyAscii.transliterate(s);
} However Ukrainian like most languages requires context like look-ahead for correct romanization and can't be fully supported by the simple model used by anyascii (context free 1-to-1 replacements). You should use a separate language-specific method to romanize the Ukrainian Cyrillic and then call anyascii afterwards if you still need to. AnyAscii.transliterate(romanizeUkrainian(s)) I don't want to add the custom replacements logic into anyascii because it can easily be done beforehand and if someone wants language-specific replacements done they are probably better off using a language-specific library. The test cases are not checking whether the result is perfect just that it stays consistent with the examples given in the readme. The readme examples are for highlighting the limitations of anyascii. There's a 4th column in the table that compares it to the correct romanization you may need to scroll to see. |
To get back to the original issue of can't translate, would it be possible to have an option to keep characters that can't/aren't translated rather than drop them? In my use case, I'm still in UTF but want transliteration. However, if something can't/isn't transliterated, I want the original character kept (and not simply removed). The problem with simple removal is there is no way to know a character wasn't translated/transliterated (without checking the code tables). |
Instead of removing characters that can't be translated, it'd be nice to have an option to replace them with a character.
For some languages (like Python) this could be added as a new argument with a default value, like
replace=""
. For others (like Go) this would have to be a new function.The text was updated successfully, but these errors were encountered: