Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore linebreaks between CJK characters in source code #792

Open
peng1999 opened this issue Apr 14, 2023 · 18 comments
Open

Ignore linebreaks between CJK characters in source code #792

peng1999 opened this issue Apr 14, 2023 · 18 comments
Labels
feature request New feature or request syntax About syntax, parsing, etc. text Text layout, shaping, internationalization, etc.

Comments

@peng1999
Copy link
Contributor

Typst currently support breaking a paragraph into several consecutive lines in source code, without introduce a newline in result PDF. The single newline in source code will just become a space. However, when working with scripts that do not using spaces between words, like Chinese and Japanese, we often not want to have the newline, nor the space.

In a paragraph like this:

中文
测试

In Chinese context, the most sensible result is 中文测试, without any spaces and newlines.

I propose implementing the following features in typst:

  1. Eliminate spaces between CJK characters in source code.
  2. Let trailing comment consume a linebreak in source code.
    This is because sometimes we have a inline frame and CJK detect will not work, so we need an opt-in method. I think the changed behavior can be minimal, comments will consume new line only if it has no space before it.
    For example, this code
    abc //
    def
    
    becomes abc def
    and
    abc//
    def
    
    becomes abcdef.
    Latin users often use the former one so they will not be affected.
@laurmaedje laurmaedje added feature request New feature or request syntax About syntax, parsing, etc. text Text layout, shaping, internationalization, etc. labels Apr 17, 2023
@Karmac
Copy link

Karmac commented May 12, 2023

1 for the trailing comment

That would solve my issue at #1173 and give users more control over line breaks.

@zrr1999
Copy link
Contributor

zrr1999 commented May 16, 2023

1

@littzhch
Copy link

Is there any workaround to this problem for now?

@Zjl37
Copy link

Zjl37 commented Oct 6, 2023

1

Yes, this feature is very desirable in East Asian context, especially when one wants to manually wrap lines in source file.

However, instead of removing space between CJK characters, I propose disabling automatic space insertion at linebreaks in source code depending on the lang or script parameter of the current text.

As for the syntax of opting-out this automatic space occasionally, I think this is another topic that needs further discussion (maybe in another issue).

@bb010g
Copy link

bb010g commented Dec 20, 2023

If comments are going to consume a linebreak, I'd also like them to consume space at the beginning of the next line. That way,

#repr([//
  abc //
  // actual comment
  def//
])

evaluates to [abc def] and not [([ ], [abc], [ ], [ ], [ ], [def], [ ])].

I know that looks like ugly TeX, but you can't use block syntax for this purpose currently:

#repr({
  [abc ]
  // actual comment
  [def]
})

evaluates to [([abc], [ ], [def])], and you additionally have to deal with the introduction of scopes you may not want (e.g. because they throw off a #set). Variants of {} & [] that don't introduce new scopes would be a possible solution to this, but the result would still be noisy:

/{
  /[中文]/
  /[测试]/
}/

for the original example. I still think that syntax would be really useful for scripting.

@laurmaedje
Copy link
Member

Somewhat related: #710

@admk
Copy link

admk commented Aug 21, 2024

Also may I suggest that spaces added between CJK characters should not be rendered. This behavior is consistent with XeLaTeX and will make vim-based motion a lot easier.

Update: I've the following hack that achieves this behavior in most cases, though it struggles with spaces around CJK punctuation and line breaks. I'm unsure how to resolve this issue, which I believe may be related to #86.

#let cjkre = regex("\s*(\p{Han} )\s*")
#show cjkre: it => it.text.match(cjkre).captures.at(0)

= 你好,我是 你的朋友。

“你好” check测试:
hello, this is a test!
世界 你好,
我很好。

@YDX-2147483647
Copy link
Contributor

YDX-2147483647 commented Aug 22, 2024

Well, many behaviors in LaTeX are only documented in Chinese, so let me explain here.

What it is like in LaTeX

  1. The CTeX package bundle implements common Chinese typesetting practices. It determines how LaTeX compilers interpret spaces and \n.

  2. CTeX provides an option to control whether to keep spaces after Han characters in the generated PDF.

    An example main.tex:

    汉字 分词
    技术 English Latin
    
    • auto (default): remove the space if it is followed by another Han character; keep it if not.

      汉字分词技术 English Latin

    • (fallback): remove spaces generated by \n, but keep other literal spaces

      汉字 分词技术 English Latin

    • true: always keep

      汉字 分词 技术 English Latin

      In this case, we need to add % before \n. (% in LaTeX = // in typst) Otherwise, \n will generate an extra space, because LaTeX compiler takes a single \n as a space.

    • false: always remove

      汉字分词技术English Latin

  3. CTeX use different packages for different LaTeX compilers. Consequently, the option is not fully supported by all compilers…

    auto (default) (fallback) true false
    XeLaTeX
    LuaLaTeX
    upLaTeX
    (pdf)LaTeX

    (Many Chinese templates only support XeLaTeX and LuaLaTeX.)

References:

  • CTeX doc (Chinese), §5.3 space = ⟨true|false|auto⟩.
  • xeCJK doc (Chinese), §3.1CJKspace = ⟨true|false⟩.
  • Overleaf documents (English): Chinese, Japanese, Korean.
  • LuaTeX-ja doc (Japanese and English), §1.2 Linebreak after a Japanese character and Spaces related to Japanese characters, and §15 Linebreak after a Japanese Character.

@admk
Copy link

admk commented Aug 26, 2024

Also may I suggest that spaces added between CJK characters should not be rendered. This behavior is consistent with XeLaTeX and will make vim-based motion a lot easier.

Update: I've the following hack that achieves this behavior in most cases, though it struggles with spaces around CJK punctuation and line breaks. I'm unsure how to resolve this issue, which I believe may be related to #86.

#let cjkre = regex("\s*(\p{Han} )\s*")
#show cjkre: it => it.text.match(cjkre).captures.at(0)

= 你好,我是 你的朋友。

“你好” check测试:
hello, this is a test!
世界 你好,
我很好。

Using the idea from #1173, I’ve made some changes that seem to handle both line breaks and spaces around CJK characters pretty well. I'm still new to typst and haven’t tested it beyond this example yet, though.

#let cjk-char = "[\p{Han},。;:!?‘’“”()「」【】…—]"
#let cjk-re = regex("\s*("   cjk-char   ")\s*")
#show cjk-re: it => it.text.match(cjk-re).captures.at(0)
#show: rest => {
  let ends-with-cjk = it => {
    it != none and it.has("text") and it.text.ends-with(regex(cjk-char))
  }
  let last-item = none
  for item in rest.children {
    if item == [ ] and ends-with-cjk(last-item) {
      item = []
    }
    last-item = item
    item
  }
}

= 你好,我是 你的 朋友。

“你好” check测试:
hello, world! this is a test!
this
is good.
世界
你 好              啊,
我很好。

which gets rendered as:

你好,我是你的朋友。

 “你好”check 测试:hello, world! this is a test! 世界你好啊,我很好。 

@YDX-2147483647
Copy link
Contributor

Hi @admk! As described in #710, content is not well traverseable.

For example, #set text(lang: "zh") will create styled (instead of sequence), breaking your #show: rest => … rule.

It might be possible to inspect sequence/styled and traverse the content, but it will be a fragile hack…

laurmaedje: this wasn't really an intended use case, which is also why they aren't publicly accessible as a global.
This might change with the type rework. I'm considering to make them properly public then.

https://discord.com/channels/1054443721975922748/1088371867913572452/1205542340144537651

@laurmaedje
Copy link
Member

Indeed, in particular with more usage ofcontext (which I believe will only grow). See also: #4745

@YDX-2147483647
Copy link
Contributor

YDX-2147483647 commented Sep 14, 2024

Well, it is not so difficult to remove spaces that generated by a \n after a Han character… If this behaviour is accepted, maybe I can organize it into a PR.

Implementation

The “(fallback)” behaviour mentioned in #792 (comment) can be implemented by changing the function

/// The expressions.
pub fn exprs(self) -> impl DoubleEndedIterator<Item = Expr<'a>> {
let mut was_stmt = false;
self.0
.children()
.filter(move |node| {
// Ignore newline directly after statements without semicolons.
let kind = node.kind();
let keep = !was_stmt || node.kind() != SyntaxKind::Space;
was_stmt = kind.is_stmt();
keep
})
.filter_map(Expr::cast_with_space)
}

to

        let mut could_ignore = false;
        self.0
            .children()
            .filter(move |node| {
                let kind = node.kind();
                // Ignore newline if directly after…
                let keep = !could_ignore || kind != SyntaxKind::Space;
                // …statements without semicolons,
                could_ignore = kind.is_stmt()
                // or texts ended with a Han character.
                || (kind == SyntaxKind::Text
                        && node.text().ends_with(|c: char| {
                            matches!(c.script(), Script::Han)
                                || matches!(c, ',' | '。' | ':') // and other Chinese punctuations
                        }));
                keep
            })
            .filter_map(Expr::cast_with_space)

and use unicode_script::{Script, UnicodeScript};

An example

#set page(height: auto)
#set text(lang: "zh", region: "CN", font: "Noto Serif CJK is fantastic, but spaces are more visible between tofu")

汉字 分词
技术 English Latin
Continue 汉字 
And
孔乙己

鲁镇的酒店的格局,是和别处不同的:
都是当街一个曲尺形的大柜台,
柜里面预备着热水,可以随时温酒。

鲁镇的酒店的格局,是和别处不同的: //
都是当街一个曲尺形的大柜台, //
柜里面预备着热水,可以随时温酒。

图片

Side effects

Only one test failed, and it is expected.

❌ text-chinese-basic (tests/suite/layout/inline/cjk.typ:4)
  mismatched rendering
    live      | tests/store/render/text-chinese-basic.png
    ref       | tests/ref/text-chinese-basic.png
2225 passed, 1 failed, 0 skipped

图片

Future work

If this is the right way to go,

@peng1999
Copy link
Contributor Author

Well, it is not so difficult to remove spaces that generated by a \n after a Han character… If this behaviour is accepted, maybe I can organize it into a PR.

I'm sure most CJK users will be happy to see this new behavior. Looking forward to your PR!

@YDX-2147483647
Copy link
Contributor

YDX-2147483647 commented Sep 29, 2024

providing an option to disable the behaviour

It might be quite complicated if not impossible. According to https://github.com/typst/typst/blob/022f34c43a2fb3084b93500163a601105ab582a4/docs/dev/architecture.md, my previous approach changes parsing, before evaluating. In other words, we don't know the set rule when we are parsing.

I think there are two solutions:

  1. Continue to fix it during parsing, and make it an inevitable breaking change. (Same as lexer change: Allow emphasis in CJK text without spaces #2648)
  2. Find another way during layout. (Same as Automatically add spacing between CJK and Latin characters #2334)

Maybe 2 is better?

@peng1999
Copy link
Contributor Author

I think there are two solutions:

  1. Continue to fix it during parsing, and make it an inevitable breaking change. (Same as lexer change: Allow emphasis in CJK text without spaces #2648)

  2. Find another way during layout. (Same as Automatically add spacing between CJK and Latin characters #2334)

Maybe 2 is better?

I think solution 1 is superior for the following reasons.

  • Typst already merge multiple white spaces in to one in parser. The behavior is not configurable and is based on English tradition that only one space is needed between words. In that sense, an additional non-configurable behavior to erase spaces between Chinese/Japanese characters based on Chinese/Japanese tradition is very reasonable and consistent.
  • Users can easily workaround using #[ ] if they really need white space between characters.
  • It's easy to reason about the final result in solution 1 than 2. Consider the following code:
    #let keywords(arr) = arr.join(" ")
    
    case 1: #keywords(("数学", "语文", "物理"))
    case 2: 数学 语文 物理
    It will work as expected if we choose solution 1, but if we do the erase at layout time, we cannot distinguish case 1 from case 2, and all the white spaces will be erased, which may be surprising for users.
  • After all, solution 1 is easier to be implemented.

I imagine at some point in the future Typst will inevitably add something like #pragma to control global options, such as whether to use system font. After that we can add options to manage white space behavior.

@YDX-2147483647
Copy link
Contributor

YDX-2147483647 commented Sep 29, 2024

if we do the erase at layout time, we cannot distinguish

I am persuaded.

Typst v0.11.1:

#"a
b
c"

a
b
c

图片


merge multiple white spaces

I am afraid we have not reached consensus on what to implement? Is the following behavior OK?

#792 (comment)

汉字 分词
技术 English Latin

(fallback): remove spaces generated by \n (after CJ), but keep other literal spaces

汉字 分词技术 English Latin

@laurmaedje
Copy link
Member

I imagine at some point in the future Typst will inevitably add something like #pragma to control global options, such as whether to use system font. After that we can add options to manage white space behavior.

There aren't any plans to that effect and I would like to avoid it. Can't we retain the necessary information as a field of SpaceElem?

@peng1999
Copy link
Contributor Author

peng1999 commented Sep 30, 2024

@laurmaedje says:

There aren't any plans to that effect and I would like to avoid it. Can't we retain the necessary information as a field of SpaceElem?

Then I think make the behavior non-configurable is fine. No need to add additional complexity.

@YDX-2147483647 says:

I am afraid we have not reached consensus on what to implement? Is the following behavior OK?

#792 (comment)

汉字 分词
技术 English Latin

(fallback): remove spaces generated by \n (after CJ), but keep other literal spaces
汉字 分词技术 English Latin

Yes, that's what I would suggest. It matches the behavior of LuaTeX-ja and require minimal change to the current Typst.

But we have to decide behavior in some edge cases. What will happen in the following code?

= Case 1 // space is *not* desired here
中文
#footnote[分词]

= Case 2 // space is *not* desired here
*中文*
分词

= Case 3 // space is *not* desired here
#let a = [中文]
#let b = [分词]
#a
#b

= Case 4 // space is desired here
关于
#box[TeX]

= Case 5 // space is desired here, but we can safely
         // strip it because it will be added back during layout
关于
Typst

I think it's ok to produce some undesired result in edge cases, but there should be easy ways for users to workaround.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request syntax About syntax, parsing, etc. text Text layout, shaping, internationalization, etc.
Projects
None yet
Development

No branches or pull requests

9 participants