Option for replacement char #2

makew0rld · 2021-03-26T20:52:16Z

Instead of removing characters that can't be translated, it'd be nice to have an option to replace them with a character.

For some languages (like Python) this could be added as a new argument with a default value, like replace="". For others (like Go) this would have to be a new function.

The text was updated successfully, but these errors were encountered:

hunterwb · 2021-03-28T20:30:56Z

A goal is to support all characters in Unicode, which would make replacements unnecessary. It is already practically the case that users will never encounter unsupported characters. Currently the only missing characters are in the following blocks/scripts: CJK, CUNEIFORM, BAMUM_SUPPLEMENT, TANGUT, KHITAN_SMALL_SCRIPT, DUPLOYAN, BYZANTINE_MUSICAL_SYMBOLS, MUSICAL_SYMBOLS, SUTTON_SIGNWRITING. I am planning on adding support for many of these. I don't want to confuse users into thinking they have to worry about unsupported characters. Other cases involving unassigned or Special code points I think are out of scope and should be handled elsewhere if necessary.

makew0rld · 2021-03-28T22:07:32Z

Ah, I see. I still think it could be useful, as you never know what strange characters might appear in a string, and it's better to replace than to strip IMO. But I didn't realize so much of Unicode was covered, that's great.

vovikdrg · 2021-06-08T06:31:09Z

This is also be useful for different languages for instance my name Володимир in Ukrainian should be Volodymyr(Not Volodimir), but same name Владимир in Russian should be Vladimir

Even in test its wrong check("Володимир Горбулін", "Volodimir Gorbulin"); this is Ukrainian name since Russian dont have i. So right translation should be Volodymyr Gorbulin. (https://en.wikipedia.org/wiki/Volodymyr_Horbulin)

PS. I am happy to contribute

hunterwb · 2021-08-08T06:37:34Z

If you would like custom replacements for specific characters I would suggest doing them yourself before calling anyascii.

public static String transliterate(String s) {
    s = s.replace('Г,'H').replace('и', 'y'); // etc
    return AnyAscii.transliterate(s);
}

However Ukrainian like most languages requires context like look-ahead for correct romanization and can't be fully supported by the simple model used by anyascii (context free 1-to-1 replacements). You should use a separate language-specific method to romanize the Ukrainian Cyrillic and then call anyascii afterwards if you still need to.

AnyAscii.transliterate(romanizeUkrainian(s))

I don't want to add the custom replacements logic into anyascii because it can easily be done beforehand and if someone wants language-specific replacements done they are probably better off using a language-specific library.

The test cases are not checking whether the result is perfect just that it stays consistent with the examples given in the readme. The readme examples are for highlighting the limitations of anyascii. There's a 4th column in the table that compares it to the correct romanization you may need to scroll to see.

stephenwilcoxon · 2023-09-19T18:23:14Z

To get back to the original issue of can't translate, would it be possible to have an option to keep characters that can't/aren't translated rather than drop them? In my use case, I'm still in UTF but want transliteration. However, if something can't/isn't transliterated, I want the original character kept (and not simply removed). The problem with simple removal is there is no way to know a character wasn't translated/transliterated (without checking the code tables).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option for replacement char #2

Option for replacement char #2

makew0rld commented Mar 26, 2021

hunterwb commented Mar 28, 2021

makew0rld commented Mar 28, 2021

vovikdrg commented Jun 8, 2021 •

edited

Loading

hunterwb commented Aug 8, 2021 •

edited

Loading

stephenwilcoxon commented Sep 19, 2023

Option for replacement char #2

Option for replacement char #2

Comments

makew0rld commented Mar 26, 2021

hunterwb commented Mar 28, 2021

makew0rld commented Mar 28, 2021

vovikdrg commented Jun 8, 2021 • edited Loading

hunterwb commented Aug 8, 2021 • edited Loading

stephenwilcoxon commented Sep 19, 2023

vovikdrg commented Jun 8, 2021 •

edited

Loading

hunterwb commented Aug 8, 2021 •

edited

Loading