Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Unicode tables #745

Merged
merged 3 commits into from
Jan 3, 2023
Merged

Conversation

3dyd
Copy link
Contributor

@3dyd 3dyd commented Nov 28, 2022

Fixes #744

Old Unicode data files had this comment:

# It is ok to redistribute this file "solely for informational 
# purposes in the creation of products supporting the Unicode Standard".
# We don't nee to add a Boost License to this file: boostinspect:nolicense. 

Original terms of use no longer contain quoted text from this comment, so I did not modify new Unicode data files but just added boost-no-inspect file to ignore the entire directory with these files. Also added https://www.unicode.org/license.txt there as this seems to be necessary.

@djowel djowel requested a review from Kojoley November 29, 2022 00:29
@djowel
Copy link
Collaborator

djowel commented Nov 29, 2022

I'll defer to @Kojoley. He's doing the heavy-lifting of spirit maintenance.

@Kojoley
Copy link
Collaborator

Kojoley commented Nov 29, 2022

I'm actually not informed on the subject. One time I though about updating these, but had never found time to dig into.

@djowel is the script handles everything automatically? I see new blocks added, but I have no idea whether it really fixed #744 or not. and IIRC we don't have proper unicode tests so it is also unclear that everything ok after the update.

Low priority thing: do we really need these .txt files in the repository? I guess anyone who wants to update tables will download them anyway, so it would be better to replace them with a download script.

@3dyd
Copy link
Contributor Author

3dyd commented Nov 29, 2022

I have no idea whether it really fixed #744 or not.

I believe they do. I had tests in place, which failed after switching from internal table to x3::unicode::format for symbols 0x604, 0x605, 0x61C, 0x8E2, 0x180E, 0x2066-0x2069. After regenerating tables I verified if they fix #744. All tests passed.

@djowel
Copy link
Collaborator

djowel commented Nov 29, 2022

@djowel is the script handles everything automatically? I see new blocks added, but I have no idea whether it really fixed #744 or not. and IIRC we don't have proper unicode tests so it is also unclear that everything ok after the update.

Yes, they do. Maybe @3dyd might be interested to contribute some tests ;-)

Low priority thing: do we really need these .txt files in the repository? I guess anyone who wants to update tables will download them anyway, so it would be better to replace them with a download script.

Makes sense. I trust your judgment, @Kojoley

@Kojoley
Copy link
Collaborator

Kojoley commented Nov 29, 2022

It took me a while to connect a few things, and there is an issue. The create_tables.cpp is hardcoded to specific list of General Categories, Derived Properties and Scripts. I've checked and found that the first two didn't change (and only a subset is used in Spirit) but Scripts did change (there are additions and removals). IIUC these things are defined in PropertyValueAliases.txt which is not processed by create_tables.cpp and is not presented in the repository either.

Manually processed data I was able to identify is in:

  • create_tables.cpp
  • boost/spirit/home/support/char_encoding/unicode/query.hpp
  • boost/spirit/home/support/char_encoding/unicode.hpp
  • boost/spirit/home/support/char_class.hpp
  • boost/spirit/home/support/common_terminals.hpp
  • boost/spirit/home/x3/char/unicode.hpp

@djowel
Copy link
Collaborator

djowel commented Nov 30, 2022

It's been a decade now (I think) since those were added. It's good to have someone have a fresh look at it. @3dyd it would be really nice if you can contribute some tests, esp. since you have a keen interest on spirit unicode support.

@3dyd
Copy link
Contributor Author

3dyd commented Dec 2, 2022

@djowel, do you say about unit tests, e.g. one positive and one negative test for all main, sub, derived categories, and several scripts? Or some more comprehensive, but not automated testing, to be done just on my end?

I guess anyone who wants to update tables will download them anyway, so it would be better to replace them with a download script.

@Kojoley, this still leaves manual work (compile and run create_tables, move generated header files, manually update scripts). Wouldn't it be enough then to just add "readme" file with complete description where to get data files and what to do with them? Or maybe go in opposite direction, i.e. not only download files, but also compile and run create_tables and move generated files to proper location?

Would you be ok with Scripts updated manually? Or some sort of automation is expected here?

@djowel
Copy link
Collaborator

djowel commented Dec 3, 2022

@djowel, do you say about unit tests, e.g. one positive and one negative test for all main, sub, derived categories, and several scripts? Or some more comprehensive, but not automated testing, to be done just on my end?

Let's start with something more basic. I think one positive and one negative test makes sense for now.

@3dyd
Copy link
Contributor Author

3dyd commented Dec 5, 2022

Tests revealed two issues, both were fixed:

  • All code points not explicitly listed in UnicodeData.txt belong to general category "Unassigned". In generated table such code points were defaulted to zero, which stands for "Uppercase_Letter".

  • All code points not explicitly listed in Scripts.txt belong to script "Unknown". In generated table they were defaulted to zero, i.e. to first script in the list (this was Arabic).

@djowel
Copy link
Collaborator

djowel commented Jan 3, 2023

LGTM. Good to merge?

@djowel djowel merged commit 2f65e6e into boostorg:develop Jan 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

x3::unicode::format does not recognize certain format control characters
3 participants