-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed length unmarshaller - Make it possible to customize calculating of record length #159
Comments
Are you saying you'd like to compute the number of bytes instead of the number of UTF-8 code points? class UtfTest {
public static void main(String[] args) {
String str = "hello 世界";
System.out.println(str.length()); // 8
System.out.println(str.getBytes(StandardCharsets.UTF_8).length); // 12
}
} Why do you think the current behavior is incorrect? |
Hi @bjansen. It's common in Japan(and Asia) to treat "double byte character" as 2 character length. Of course "single byte character" counts as 1 character length. I found a similar discussion in the google group (It's about 8 years ago). I hope this request is understandable. |
Thanks @bjansen for jumping in here. To better understand this more clearly then we are in process of getting a real world example put together to ensure what we discuss and potentially can improve in beanio is on the right track. |
@hfuruich thanks for the explanation, I think I understand the problem. I'm intrigued though, how do you know how many bytes a given character takes? Do you need a UTF-8 table on hand, or do you assume that characters in the Hiragana block for example are always 3 bytes long? For an actual solution, I can think of several possibilities:
<parser class="org.beanio.stream.fixedlength.FixedLengthRecordParserFactory">
<property name="countMode" value="chars"/>
</parser>
What kind of granularity would be needed? Would the last suggestion be enough? |
Those are some really good suggestions. I like all of them (sorry for that) but having it in the mapping file make it easy for non developers to specify. And Java code is needed when you must do it via Java or have some special code that can control this. And the global option make it easy to set instead of having to change a lot of mapping files |
@bjansen thank you so much for your excellent idea. The bytes of a character is depends on the encoding user uses. So following code will count the expected length.
It looks like "<stream>" has an attribute named "encoding". Is it possible using this "encoding" attribute to calculate the character's bytes? If it's possible, your excellent idea covers users scenario.
As @davsclaus votes, all of them covers most of use cases in the world. I like all of your ideas too. |
Hello.
This is very confusing use case but please also consider about this too. |
When beanio is unmarshalling some data (like a line of text), then the mapping to fields is based on string length value. However for asian countries then they have single and double byte characters that can be mixed in a String, which causes the length to be mis-calculated.
In the source code of beanio, then the setRecordValue method:
https://github.com/beanio/beanio/blob/main/src/org/beanio/internal/parser/format/fixedlength/FixedLengthUnmarshallingContext.java#L34
Is hardcoded the record length as follows:
And the recordLength is private, so you cannot override this and calculate this via a custom implementation of the unmarshalling context.
I wonder if beanio could make this pluggable so end users can provide their own implementation of this, so you can calulcate the record length that you need.
The text was updated successfully, but these errors were encountered: