Ignore linebreaks between CJK characters in source code #792

peng1999 · 2023-04-14T02:54:52Z

Typst currently support breaking a paragraph into several consecutive lines in source code, without introduce a newline in result PDF. The single newline in source code will just become a space. However, when working with scripts that do not using spaces between words, like Chinese and Japanese, we often not want to have the newline, nor the space.

In a paragraph like this:

中文
测试

In Chinese context, the most sensible result is 中文测试, without any spaces and newlines.

I propose implementing the following features in typst:

Eliminate spaces between CJK characters in source code.
Let trailing comment consume a linebreak in source code.
This is because sometimes we have a inline frame and CJK detect will not work, so we need an opt-in method. I think the changed behavior can be minimal, comments will consume new line only if it has no space before it.
For example, this code
```
abc //
def
```
becomes abc def
and
```
abc//
def
```
becomes abcdef.
Latin users often use the former one so they will not be affected.

The text was updated successfully, but these errors were encountered:

Karmac · 2023-05-12T16:18:41Z

1 for the trailing comment

That would solve my issue at #1173 and give users more control over line breaks.

zrr1999 · 2023-05-16T17:34:31Z

1

littzhch · 2023-09-20T14:57:11Z

Is there any workaround to this problem for now?

Zjl37 · 2023-10-06T12:51:04Z

1

Yes, this feature is very desirable in East Asian context, especially when one wants to manually wrap lines in source file.

However, instead of removing space between CJK characters, I propose disabling automatic space insertion at linebreaks in source code depending on the lang or script parameter of the current text.

As for the syntax of opting-out this automatic space occasionally, I think this is another topic that needs further discussion (maybe in another issue).

bb010g · 2023-12-20T09:24:29Z

If comments are going to consume a linebreak, I'd also like them to consume space at the beginning of the next line. That way,

#repr([//
  abc //
  // actual comment
  def//
])

evaluates to [abc def] and not [([ ], [abc], [ ], [ ], [ ], [def], [ ])].

I know that looks like ugly TeX, but you can't use block syntax for this purpose currently:

#repr({
  [abc ]
  // actual comment
  [def]
})

evaluates to [([abc], [ ], [def])], and you additionally have to deal with the introduction of scopes you may not want (e.g. because they throw off a #set). Variants of {} & [] that don't introduce new scopes would be a possible solution to this, but the result would still be noisy:

/{
  /[中文]/
  /[测试]/
}/

for the original example. I still think that syntax would be really useful for scripting.

laurmaedje · 2024-07-15T19:21:31Z

Somewhat related: #710

admk · 2024-08-21T16:43:58Z

Also may I suggest that spaces added between CJK characters should not be rendered. This behavior is consistent with XeLaTeX and will make vim-based motion a lot easier.

Update: I've the following hack that achieves this behavior in most cases, though it struggles with spaces around CJK punctuation and line breaks. I'm unsure how to resolve this issue, which I believe may be related to #86.

#let cjkre = regex("\s*(\p{Han} )\s*")
#show cjkre: it => it.text.match(cjkre).captures.at(0)

= 你好，我是 你的朋友。

“你好” check测试：
hello, this is a test!
世界 你好，
我很好。

YDX-2147483647 · 2024-08-22T04:29:10Z

Well, many behaviors in LaTeX are only documented in Chinese, so let me explain here.

What it is like in LaTeX

The CTeX package bundle implements common Chinese typesetting practices. It determines how LaTeX compilers interpret spaces and \n.
CTeX provides an option to control whether to keep spaces after Han characters in the generated PDF.

An example main.tex:
```
汉字 分词
技术 English Latin
```
- auto (default): remove the space if it is followed by another Han character; keep it if not.
  
  ⇒ 汉字分词技术 English Latin
- (fallback): remove spaces generated by \n, but keep other literal spaces
  
  ⇒ 汉字分词技术 English Latin
- true: always keep
  
  ⇒ 汉字分词技术 English Latin
  
  In this case, we need to add % before \n. (% in LaTeX = // in typst) Otherwise, \n will generate an extra space, because LaTeX compiler takes a single \n as a space.
- false: always remove
  
  ⇒ 汉字分词技术English Latin
CTeX use different packages for different LaTeX compilers. Consequently, the option is not fully supported by all compilers…

auto (default) (fallback) true false

XeLaTeX ✅ ❌ ✅ ❌

LuaLaTeX ❌ ✅ ❌ ❌

upLaTeX ❌ ✅ ❌ ❌

(pdf)LaTeX ✅ ❌ ✅ ✅

(Many Chinese templates only support XeLaTeX and LuaLaTeX.)

References:

CTeX doc (Chinese), §5.3 space = ⟨true|false|auto⟩.
xeCJK doc (Chinese), §3.1CJKspace = ⟨true|false⟩.
Overleaf documents (English): Chinese, Japanese, Korean.
LuaTeX-ja doc (Japanese and English), §1.2 Linebreak after a Japanese character and Spaces related to Japanese characters, and §15 Linebreak after a Japanese Character.

admk · 2024-08-26T12:51:26Z

Also may I suggest that spaces added between CJK characters should not be rendered. This behavior is consistent with XeLaTeX and will make vim-based motion a lot easier.

Update: I've the following hack that achieves this behavior in most cases, though it struggles with spaces around CJK punctuation and line breaks. I'm unsure how to resolve this issue, which I believe may be related to #86.
#let cjkre = regex("\s*(\p{Han} )\s*")
#show cjkre: it => it.text.match(cjkre).captures.at(0)

= 你好，我是 你的朋友。

“你好” check测试：
hello, this is a test!
世界 你好，
我很好。

Using the idea from #1173, I’ve made some changes that seem to handle both line breaks and spaces around CJK characters pretty well. I'm still new to typst and haven’t tested it beyond this example yet, though.

#let cjk-char = "[\p{Han}，。；：！？‘’“”（）「」【】…—]"
#let cjk-re = regex("\s*("   cjk-char   ")\s*")
#show cjk-re: it => it.text.match(cjk-re).captures.at(0)
#show: rest => {
  let ends-with-cjk = it => {
    it != none and it.has("text") and it.text.ends-with(regex(cjk-char))
  }
  let last-item = none
  for item in rest.children {
    if item == [ ] and ends-with-cjk(last-item) {
      item = []
    }
    last-item = item
    item
  }
}

= 你好，我是 你的 朋友。

“你好” check测试：
hello, world! this is a test!
this
is good.
世界
你 好              啊，
我很好。

which gets rendered as:

你好，我是你的朋友。

 “你好”check 测试：hello, world! this is a test! 世界你好啊，我很好。

YDX-2147483647 · 2024-08-26T16:50:49Z

Hi @admk! As described in #710, content is not well traverseable.

For example, #set text(lang: "zh") will create styled (instead of sequence), breaking your #show: rest => … rule.

It might be possible to inspect sequence/styled and traverse the content, but it will be a fragile hack…

laurmaedje: this wasn't really an intended use case, which is also why they aren't publicly accessible as a global.
This might change with the type rework. I'm considering to make them properly public then.

https://discord.com/channels/1054443721975922748/1088371867913572452/1205542340144537651

laurmaedje · 2024-08-26T17:03:29Z

Indeed, in particular with more usage ofcontext (which I believe will only grow). See also: #4745

YDX-2147483647 · 2024-09-14T15:17:09Z

Well, it is not so difficult to remove spaces that generated by a \n after a Han character… If this behaviour is accepted, maybe I can organize it into a PR.

Implementation

The “(fallback)” behaviour mentioned in #792 (comment) can be implemented by changing the function

typst/crates/typst-syntax/src/ast.rs

Lines 67 to 80 in d48293f

 /// The expressions. 

 pub fn exprs(self) -> impl DoubleEndedIterator<Item = Expr<'a>> { 

 let mut was_stmt = false; 

 self.0 

 .children() 

 .filter(move |node| { 

 // Ignore newline directly after statements without semicolons. 

 let kind = node.kind(); 

 let keep = !was_stmt || node.kind() != SyntaxKind::Space; 

 was_stmt = kind.is_stmt(); 

 keep 

 }) 

 .filter_map(Expr::cast_with_space) 

 }

to

        let mut could_ignore = false;
        self.0
            .children()
            .filter(move |node| {
                let kind = node.kind();
                // Ignore newline if directly after…
                let keep = !could_ignore || kind != SyntaxKind::Space;
                // …statements without semicolons,
                could_ignore = kind.is_stmt()
                // or texts ended with a Han character.
                || (kind == SyntaxKind::Text
                        && node.text().ends_with(|c: char| {
                            matches!(c.script(), Script::Han)
                                || matches!(c, '，' | '。' | '：') // and other Chinese punctuations
                        }));
                keep
            })
            .filter_map(Expr::cast_with_space)

and use unicode_script::{Script, UnicodeScript};

An example

#set page(height: auto)
#set text(lang: "zh", region: "CN", font: "Noto Serif CJK is fantastic, but spaces are more visible between tofu")

汉字 分词
技术 English Latin
Continue 汉字 
And
孔乙己

鲁镇的酒店的格局，是和别处不同的：
都是当街一个曲尺形的大柜台，
柜里面预备着热水，可以随时温酒。

鲁镇的酒店的格局，是和别处不同的： //
都是当街一个曲尺形的大柜台， //
柜里面预备着热水，可以随时温酒。

Side effects

Only one test failed, and it is expected.

❌ text-chinese-basic (tests/suite/layout/inline/cjk.typ:4)
  mismatched rendering
    live      | tests/store/render/text-chinese-basic.png
    ref       | tests/ref/text-chinese-basic.png
2225 passed, 1 failed, 0 skipped

Future work

If this is the right way to go,

'，' | '。' | '：' should be replaced by is_cjk_punctuation/is_cjk_left_aligned_punctuation/… here

typst/crates/typst/src/layout/inline/shaping.rs

Line 114 in d48293f

pub fn is_cjk_punctuation(&self) -> bool {
providing an option to disable the behaviour might be necessary, e.g. #set text(ignore-???-newline: false), as in Automatically add spacing between CJK and Latin characters #2334 .

peng1999 · 2024-09-14T15:55:30Z

Well, it is not so difficult to remove spaces that generated by a \n after a Han character… If this behaviour is accepted, maybe I can organize it into a PR.

I'm sure most CJK users will be happy to see this new behavior. Looking forward to your PR!

YDX-2147483647 · 2024-09-29T09:11:28Z

providing an option to disable the behaviour

It might be quite complicated if not impossible. According to https://github.com/typst/typst/blob/022f34c43a2fb3084b93500163a601105ab582a4/docs/dev/architecture.md, my previous approach changes parsing, before evaluating. In other words, we don't know the set rule when we are parsing.

I think there are two solutions:

Continue to fix it during parsing, and make it an inevitable breaking change. (Same as lexer change: Allow emphasis in CJK text without spaces #2648)
Find another way during layout. (Same as Automatically add spacing between CJK and Latin characters #2334)

Maybe 2 is better?

peng1999 · 2024-09-29T12:49:25Z

I think there are two solutions:

Continue to fix it during parsing, and make it an inevitable breaking change. (Same as lexer change: Allow emphasis in CJK text without spaces #2648)

Find another way during layout. (Same as Automatically add spacing between CJK and Latin characters #2334)

Maybe 2 is better?

I think solution 1 is superior for the following reasons.

Typst already merge multiple white spaces in to one in parser. The behavior is not configurable and is based on English tradition that only one space is needed between words. In that sense, an additional non-configurable behavior to erase spaces between Chinese/Japanese characters based on Chinese/Japanese tradition is very reasonable and consistent.
Users can easily workaround using #[ ] if they really need white space between characters.
It's easy to reason about the final result in solution 1 than 2. Consider the following code:
```
#let keywords(arr) = arr.join(" ")

case 1: #keywords(("数学", "语文", "物理"))
case 2: 数学 语文 物理
```
It will work as expected if we choose solution 1, but if we do the erase at layout time, we cannot distinguish case 1 from case 2, and all the white spaces will be erased, which may be surprising for users.
After all, solution 1 is easier to be implemented.

I imagine at some point in the future Typst will inevitably add something like #pragma to control global options, such as whether to use system font. After that we can add options to manage white space behavior.

YDX-2147483647 · 2024-09-29T13:25:04Z

if we do the erase at layout time, we cannot distinguish

I am persuaded.

Typst v0.11.1:

#"a
b
c"

a
b
c

merge multiple white spaces

I am afraid we have not reached consensus on what to implement? Is the following behavior OK?

#792 (comment)

汉字 分词
技术 English Latin
(fallback): remove spaces generated by \n (after CJ), but keep other literal spaces

⇒ 汉字分词技术 English Latin

laurmaedje · 2024-09-29T14:59:46Z

I imagine at some point in the future Typst will inevitably add something like #pragma to control global options, such as whether to use system font. After that we can add options to manage white space behavior.

There aren't any plans to that effect and I would like to avoid it. Can't we retain the necessary information as a field of SpaceElem?

peng1999 · 2024-09-30T03:37:45Z

@laurmaedje says:

There aren't any plans to that effect and I would like to avoid it. Can't we retain the necessary information as a field of SpaceElem?

Then I think make the behavior non-configurable is fine. No need to add additional complexity.

@YDX-2147483647 says:

I am afraid we have not reached consensus on what to implement? Is the following behavior OK?

#792 (comment)
汉字 分词
技术 English Latin
(fallback): remove spaces generated by \n (after CJ), but keep other literal spaces
⇒ 汉字分词技术 English Latin

Yes, that's what I would suggest. It matches the behavior of LuaTeX-ja and require minimal change to the current Typst.

But we have to decide behavior in some edge cases. What will happen in the following code?

= Case 1 // space is *not* desired here
中文
#footnote[分词]

= Case 2 // space is *not* desired here
*中文*
分词

= Case 3 // space is *not* desired here
#let a = [中文]
#let b = [分词]
#a
#b

= Case 4 // space is desired here
关于
#box[TeX]

= Case 5 // space is desired here, but we can safely
         // strip it because it will be added back during layout
关于
Typst

I think it's ok to produce some undesired result in edge cases, but there should be easy ways for users to workaround.

laurmaedje added feature request New feature or request syntax About syntax, parsing, etc. text Text layout, shaping, internationalization, etc. labels Apr 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignore linebreaks between CJK characters in source code #792

Ignore linebreaks between CJK characters in source code #792

peng1999 commented Apr 14, 2023

Karmac commented May 12, 2023

zrr1999 commented May 16, 2023

littzhch commented Sep 20, 2023

Zjl37 commented Oct 6, 2023

bb010g commented Dec 20, 2023 •

edited

Loading

laurmaedje commented Jul 15, 2024

admk commented Aug 21, 2024 •

edited

Loading

YDX-2147483647 commented Aug 22, 2024 •

edited

Loading

admk commented Aug 26, 2024 •

edited

Loading

YDX-2147483647 commented Aug 26, 2024

laurmaedje commented Aug 26, 2024

YDX-2147483647 commented Sep 14, 2024 •

edited

Loading

peng1999 commented Sep 14, 2024

YDX-2147483647 commented Sep 29, 2024 •

edited

Loading

peng1999 commented Sep 29, 2024

YDX-2147483647 commented Sep 29, 2024 •

edited

Loading

laurmaedje commented Sep 29, 2024

peng1999 commented Sep 30, 2024 •

edited

Loading

Ignore linebreaks between CJK characters in source code #792

Ignore linebreaks between CJK characters in source code #792

Comments

peng1999 commented Apr 14, 2023

Karmac commented May 12, 2023

zrr1999 commented May 16, 2023

littzhch commented Sep 20, 2023

Zjl37 commented Oct 6, 2023

bb010g commented Dec 20, 2023 • edited Loading

laurmaedje commented Jul 15, 2024

admk commented Aug 21, 2024 • edited Loading

YDX-2147483647 commented Aug 22, 2024 • edited Loading

What it is like in LaTeX

admk commented Aug 26, 2024 • edited Loading

YDX-2147483647 commented Aug 26, 2024

laurmaedje commented Aug 26, 2024

YDX-2147483647 commented Sep 14, 2024 • edited Loading

Implementation

An example

Side effects

Future work

peng1999 commented Sep 14, 2024

YDX-2147483647 commented Sep 29, 2024 • edited Loading

peng1999 commented Sep 29, 2024

YDX-2147483647 commented Sep 29, 2024 • edited Loading

laurmaedje commented Sep 29, 2024

peng1999 commented Sep 30, 2024 • edited Loading

bb010g commented Dec 20, 2023 •

edited

Loading

admk commented Aug 21, 2024 •

edited

Loading

YDX-2147483647 commented Aug 22, 2024 •

edited

Loading

admk commented Aug 26, 2024 •

edited

Loading

YDX-2147483647 commented Sep 14, 2024 •

edited

Loading

YDX-2147483647 commented Sep 29, 2024 •

edited

Loading

YDX-2147483647 commented Sep 29, 2024 •

edited

Loading

peng1999 commented Sep 30, 2024 •

edited

Loading