Fixed length unmarshaller - Make it possible to customize calculating of record length #159

davsclaus · 2024-03-05T11:53:23Z

When beanio is unmarshalling some data (like a line of text), then the mapping to fields is based on string length value. However for asian countries then they have single and double byte characters that can be mixed in a String, which causes the length to be mis-calculated.

In the source code of beanio, then the setRecordValue method:
https://github.com/beanio/beanio/blob/main/src/org/beanio/internal/parser/format/fixedlength/FixedLengthUnmarshallingContext.java#L34

Is hardcoded the record length as follows:

    @Override
    public void setRecordValue(Object value) {
        this.record = (String) value;
        this.recordLength = value == null ? 0 : record.length();
    }

And the recordLength is private, so you cannot override this and calculate this via a custom implementation of the unmarshalling context.

I wonder if beanio could make this pluggable so end users can provide their own implementation of this, so you can calulcate the record length that you need.

The text was updated successfully, but these errors were encountered:

bjansen · 2024-03-05T13:33:01Z

Are you saying you'd like to compute the number of bytes instead of the number of UTF-8 code points?

class UtfTest {
	public static void main(String[] args) {
		String str = "hello 世界";
		System.out.println(str.length()); // 8
		System.out.println(str.getBytes(StandardCharsets.UTF_8).length); // 12
	}
}

Why do you think the current behavior is incorrect?

hfuruich · 2024-03-05T23:34:56Z

Hi @bjansen. It's common in Japan(and Asia) to treat "double byte character" as 2 character length. Of course "single byte character" counts as 1 character length.
"Ａa" is a combination of "double byte character" and "single byte character".
The expected behavior for BeanIO is counting this String length as 3 instead of 2.

I found a similar discussion in the google group (It's about 8 years ago).
https://groups.google.com/g/beanio/c/00lSwPI2U6Y

I hope this request is understandable.

davsclaus · 2024-03-06T08:19:34Z

Thanks @bjansen for jumping in here. To better understand this more clearly then we are in process of getting a real world example put together to ensure what we discuss and potentially can improve in beanio is on the right track.

bjansen · 2024-03-06T13:09:38Z

@hfuruich thanks for the explanation, I think I understand the problem. I'm intrigued though, how do you know how many bytes a given character takes? Do you need a UTF-8 table on hand, or do you assume that characters in the Hiragana block for example are always 3 bytes long?

For an actual solution, I can think of several possibilities:

a new attribute on records: <record count="chars|bytes">
a new attribute on fields: <field count="chars|bytes">
a new property in FixedLengthParserConfiguration that could be configured like this:

<parser class="org.beanio.stream.fixedlength.FixedLengthRecordParserFactory">
    <property name="countMode" value="chars"/>
</parser>

a global configuration in beanio.properties that would cover all the fixed length parsers

What kind of granularity would be needed? Would the last suggestion be enough?

davsclaus · 2024-03-06T14:37:18Z

Those are some really good suggestions.

I like all of them (sorry for that) but having it in the mapping file make it easy for non developers to specify. And Java code is needed when you must do it via Java or have some special code that can control this.

And the global option make it easy to set instead of having to change a lot of mapping files

hfuruich · 2024-03-07T01:16:23Z

@bjansen thank you so much for your excellent idea.

The bytes of a character is depends on the encoding user uses. So following code will count the expected length.
(We might need some exception handling whether the requested encoding is supported or not.)

str.getBytes(Charset.forName("encoding name which user specified")).length

It looks like "<stream>" has an attribute named "encoding". Is it possible using this "encoding" attribute to calculate the character's bytes?

If it's possible, your excellent idea covers users scenario.

User A who only uses single byte characters
Set <field count="char"/>.
User B who uses single byte characters and multi byte characters
Set <stream encoding="MS932"> or other encoding name and <field count="byte"/>.

As @davsclaus votes, all of them covers most of use cases in the world. I like all of your ideas too.

hfuruich · 2024-03-11T02:02:47Z

Hello.
If there is a possibility, please also consider about providing an annotation or interface to calculate how many bytes for characters.
This is an example.
UTF-8 treats "Ａ" as 3 bytes. But some users in Asia want to count this as 2 characters. It sounds like very strange for non Asian users but this is the real world. Some Asian users want to treat the multiple-bytes characters as double-bytes characters.
To handle this use case, I thought providing an interface or annotation could be a solution. User simply can implement this
interface as they wants.

int countBytes(character[])

This is very confusing use case but please also consider about this too.

bjansen added the Type-Enhancement label Mar 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed length unmarshaller - Make it possible to customize calculating of record length #159

Fixed length unmarshaller - Make it possible to customize calculating of record length #159

davsclaus commented Mar 5, 2024

bjansen commented Mar 5, 2024

hfuruich commented Mar 5, 2024

davsclaus commented Mar 6, 2024

bjansen commented Mar 6, 2024

davsclaus commented Mar 6, 2024

hfuruich commented Mar 7, 2024 •

edited

Loading

hfuruich commented Mar 11, 2024

Fixed length unmarshaller - Make it possible to customize calculating of record length #159

Fixed length unmarshaller - Make it possible to customize calculating of record length #159

Comments

davsclaus commented Mar 5, 2024

bjansen commented Mar 5, 2024

hfuruich commented Mar 5, 2024

davsclaus commented Mar 6, 2024

bjansen commented Mar 6, 2024

davsclaus commented Mar 6, 2024

hfuruich commented Mar 7, 2024 • edited Loading

hfuruich commented Mar 11, 2024

hfuruich commented Mar 7, 2024 •

edited

Loading