Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Inconsistent Handling of Unicode Line Breaks (U 2028, U 2029) #924

Open
2 of 3 tasks
AnonymouX47 opened this issue Aug 31, 2024 · 1 comment
Open
2 of 3 tasks
Assignees
Labels
bug Unicode Issues related to Unicode <-> bytes conversion

Comments

@AnonymouX47
Copy link
Contributor

Description:

The Unicode codepoints U 2028 (Line Separator) and U 2029 (Paragraph Separator) are kind of treated as line breaks by Text.pack() but as normal zero-width symbols by Text.render(), instead of omitting the codepoints, breaking the text and padding the line as necessary.

I said "kind of" because pack() returns the width as though the text is broken by the codepoint(s) but the height is always 1, regardless of how many of these codepoints there are in the text.

By the way, urwid.calc_width() treats them as zero-width symbols.

Affected versions (if applicable)
  • master branch (specify commit)
  • Latest stable version from pypi
  • Other (specify source)
Steps to reproduce (if applicable)

Text.pack():

>>> urwid.Text("123\u202812345").pack()
(5, 1)
>>> urwid.Text("123\u20281234").pack()
(4, 1)
>>> urwid.Text("1234\u20281234").pack()
(4, 1)
>>> urwid.Text("12345\u20281234").pack()
(5, 1)
>>> urwid.Text("123\u20281234\u202912").pack()
(4, 1)
>>> urwid.Text("123\u202812\u20291234").pack()
(4, 1)

Text.render():

>>> urwid.Text("123\u202812345").render(()).text
[b'123\xe2\x80\xa812', b'345  ']
>>> urwid.Text("123\u202812\u20291234").render(()).text
[b'123\xe2\x80\xa81', b'2\xe2\x80\xa9123', b'4   ']
Expected/actual outcome

I honestly care less whether they're treated as actual line breaks or just plain zero-width symbols and escaped (replaced with "?") like other whitespace codepoints such as U 0009 (Horizontal Tab, \t), all I care about is that they're handled consistently.


Thank you 😃

AnonymouX47 added a commit to AnonymouX47/urwidgets that referenced this issue Aug 31, 2024
- Fix: Use `urwid.calc_width()` instead of `urwid.Text.pack()[0]` to
  compute the screen column width of text.

  - Additionally results in a performance improvement.

Avoids buggy behaviour of `Text.pack()`
(see urwid/urwid#924) and
fixes ihabunek/toot#499.
AnonymouX47 added a commit to AnonymouX47/urwidgets that referenced this issue Aug 31, 2024
- Fix: Use `urwid.calc_width()` instead of `urwid.Text.pack()[0]` to
  compute the screen column width of text.

  - Additionally results in a performance improvement.

Avoids buggy behaviour of `Text.pack()`
(see urwid/urwid#924) and
fixes ihabunek/toot#499.
@penguinolog penguinolog added the Unicode Issues related to Unicode <-> bytes conversion label Sep 2, 2024
@penguinolog
Copy link
Collaborator

Lookks like calc_string_text_pos should re-implement full wcwidth.wcswidth internals (https://github.com/jquast/wcwidth/blob/a20c9441aaa42f3ac88be573cf8027229f1e3520/wcwidth/wcwidth.py#L160).
At this moment it not implement public methods for position calculation.

@penguinolog penguinolog self-assigned this Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Unicode Issues related to Unicode <-> bytes conversion
Projects
None yet
Development

No branches or pull requests

2 participants