Hi I d like to have a string escape sequence that makes it e kotlinlang #language-evolution

Hi. I'd like to have a string escape sequence that...

cketti

06/06/2025, 12:13 PM

Hi. I'd like to have a string escape sequence that makes it easier to include astral code points (value greater than 0xFFFF). So I could write e.g.

"\u{1F995}"

instead of

"\uD83E\uDD95"

"🦕"

. Granted, often it's preferable to use the last variant. But there are cases where (programming) fonts are missing glyphs or the code point isn't associated with a visible character. In such cases, having to use escapes of the surrogate characters just doesn't make for a great developer experience. Does this sound like something that has a chance of being added to the language? Or would I be wasting my time if I wrote up a KEEP? (I already have a prototype to add support to the compiler.)

👍 1

➕ 1

cketti

06/06/2025, 12:14 PM

Somewhat related YouTrack issue: https://youtrack.jetbrains.com/issue/KT-36872/Unicode-code-point-literals

CLOVIS

06/06/2025, 12:36 PM

In my experience, these show up rarely enough that

"\uD83E\uDD95"

is good enough. Would you disagree?

cketti

06/06/2025, 12:37 PM

Obviously. Otherwise I wouldn't have typed the message 🙂

cketti

06/06/2025, 12:40 PM

A lot of other programming languages have added support for an escape sequence that supports non-BMP code points as well.

CLOVIS

06/06/2025, 12:41 PM

Can these characters be represented by a

Char

cketti

06/06/2025, 12:45 PM

No. That's kind of the "source" of the problem. But I'm not suggesting to figure out a way to support

'\u{1F995}'

. Just

"\u{1F995}"

Youssef Shoaib [MOD]

06/06/2025, 12:46 PM

What's the problem with writing the character directly in the text? It feels clearer to me, no?

cketti

06/06/2025, 12:49 PM

I believe the first message already answers that question.

cketti

06/06/2025, 12:50 PM

Imagine you want to use a code point that is currently unassigned. There is no glyph associated with that code point.

CLOVIS

06/06/2025, 12:56 PM

Is that really frequent?

cketti

06/06/2025, 1:00 PM

Does it matter? Most people don't know very much about Unicode or how strings are encoded in their programming language. That doesn't mean you shouldn't make life easier for those who do know such things. (I'm also offering to do most of the work)

CLOVIS

06/06/2025, 1:13 PM

Most of the work? Does that include the PRs to GitHub, GitLab, Bitbucket and all other editors so it displays properly?

cketti

06/06/2025, 1:16 PM

Sure, why not.

CLOVIS

06/06/2025, 1:19 PM

Alternative proposal:

Copy code

val Long.astral: String get() = TODO()

Instead of:

Copy code

"\u{1F995}"

you get:

Copy code

"${0x1F995L.astral}"

That's slightly more verbose, sure, but it is still quite simple, and more importantly doesn't require a language change.

CLOVIS

06/06/2025, 1:20 PM

(if this gets added to the stdlib it will most likely be called

.decodeUnicode

or similar, I guess)

cketti

06/06/2025, 1:23 PM

That would add runtime overhead (and deviate from what pretty much every other programming language is doing).

CLOVIS

06/06/2025, 1:25 PM

> That would add runtime overhead not necessarily, there are plenty of methods in the stdlib that are intrinsics

Klitos Kyriacou

06/06/2025, 2:46 PM

Does that include the PRs to GitHub, GitLab, Bitbucket and all other editors so it displays properly?

Why would they need to do that? If you display code in GitHub that has the line

println("\u03C0")

does GitHub display it as

println("π")

cketti

06/06/2025, 2:55 PM

Escape sequences could be highlighted using a different color. But it looks like that's currently not happening. See https://gist.github.com/cketti/db967f067bfc2d227a1f57c224c77be4

JP Sugarbroad

06/06/2025, 8:56 PM

I think adding

\U

like Python has is pretty reasonable.

cketti

06/06/2025, 9:25 PM

I like

\u{…}

better. No leading zeros required.

Edoardo Luppi

06/07/2025, 7:39 PM

Being able to write code point literals for all Unicode planes would be cool. The compiler can then handle translating the code point to the UTF-16 surrogate pair, or to whatever other representation the platform uses.

Edoardo Luppi

06/07/2025, 7:48 PM

> these show up rarely enough that

"\uD83E\uDD95"

is good enough > Well it really depends on the application or library you're developing. For example when working with Japanese or Chinese you'll definitely go above the BMP, and very often you have to use surrogate pairs, or even composite characters (but those are yet another concept).

Alejandro Serrano.Mena

06/12/2025, 9:10 AM

We've briefly discussed this internally: • first of all, please update the YouTrack ticket with more information if needed • the main work here is not on the KEEP or implementation, but actually the Quality Assurance work required to ensure that these new codepoints don't break anything in the compiler

Edoardo Luppi

06/12/2025, 1:26 PM

@Alejandro Serrano.Mena I think https://youtrack.jetbrains.com/issue/KT-36872 looks good as it is. There isn't much to add anyway. The only point to clarify would be how non-BMP codepoints are to be represented, e.g.,

\u{1F995}

or some other format.

4 Views

Open in Slack

Previous Next