I have a bytearray of UTF 16 encoded bytes read from a cinte kotlinlang #kotlin-native

I have a bytearray of UTF-16 encoded bytes read fr...

Skolson5903

01/19/2022, 12:42 AM

I have a bytearray of UTF-16 encoded bytes read from a cinterop library COpaquePointer, and need to convert the ByteArray to a String. But native doesn't seem to have Charsets much less Charsets.UTF_16. ByteArray.toKString() is for UTF8. Is there some simple way to do this? My searches are sucking 🙂

ephemient

01/19/2022, 12:54 AM

it kinda sucks but UTF-16 to UTF-8 conversion isn't too hard, although you'll have to decide what to do with invalid UTF-16 (ignoring BOM, improper use of surrogates)

ephemient

01/19/2022, 1:33 AM

100% untested, but just following the Unicode specifications,

Copy code

fun UShortArray.utf16ToUtf8(): UByteArray {
    var i = if (this.firstOrNull() == 0xFFEF.toUShort()) 1 else 0 // skip BOM
    val bytes = UByteArray((this.size - i) * 3)
    var j = 0
    while (i < this.size) {
        val codepoint = when (val unit = this[i++].toInt()) {
            in Char.MIN_HIGH_SURROGATE.code..Char.MAX_HIGH_SURROGATE.code -> {
                if (i !in this.indices) throw CharacterCodingException() // unpaired high surrogate
                val next = this[i++].toInt()
                if (next !in Char.MIN_LOW_SURROGATE.code..Char.MAX_LOW_SURROGATE.code) {
                	throw CharacterCodingException() // unpaired high surrogate
                }
                val code = unit and 0x3F shl 10 or (next and 0x3F)
                if (code !in Char.MIN_SUPPLEMENTARY_CODE_POINT..Char.MAX_CODE_POINT) {
                    throw CharacterCodingException() // non-canonical encoding
                }
                code
            }
            in Char.MIN_LOW_SURROGATE.code..Char.MAX_LOW_SURROGATE.code -> {
                throw CharacterCodingException() // unpaired low surrogate
            }
            else -> unit.toInt()
        }
        when (codepoint) {
            in 0x00..0x7F -> bytes[j++] = codepoint.toUByte()
            in 0x80..0x07FF -> {
                bytes[j++] = 0xC0.or(codepoint and 0x07C0 shr 6).toUByte()
                bytes[j++] = 0x80.or(codepoint and 0x003F).toUByte()
            }
            in 0x0800..0xFFFF -> {
                bytes[j++] = 0xE0.or(codepoint and 0xF000 shr 12).toUByte()
                bytes[j++] = 0x80.or(codepoint and 0x0FC0 shr 6).toUByte()
                bytes[j++] = 0x80.or(codepoint and 0x003F).toUByte()
            }
            in 0x10000..Char.MAX_CODE_POINT -> {
                bytes[j++] = 0xF0.or(codepoint and 0x3C0000 shr 18).toUByte()
                bytes[j++] = 0x80.or(codepoint and 0x03F000 shr 12).toUByte()
                bytes[j++] = 0x80.or(codepoint and 0x000FC0 shr 6).toUByte()
                bytes[j++] = 0x80.or(codepoint and 0x00003F).toUByte()
            }
            else -> throw IllegalStateException()
        }
    }
    return bytes.sliceArray(0 until j)
}

napperley

01/19/2022, 2:45 AM

Use the

toKStringFromUtf8

function to convert a UTF-16 byte array to a Kotlin String.

ephemient

01/19/2022, 2:56 AM

oh, now that I check the kotlin.native package… good find, but I'm pretty sure that's a bit wrong,

toKStringFromUtf16

is closer to what OP wants

napperley

01/19/2022, 3:00 AM

Looks like things have changed with UTF-16 support in the KotlinX Cinterop library. There used to be a function for doing it as a single operation. The

toKStringFromUtf16

function looks like the one but doesn't exist anymore.

ephemient

01/19/2022, 3:08 AM

but come to think of it, kotlin.Char is a UTF-16 unit (maybe not the best choice on native, but it's the only way to be compatible with Java), so actually you should be able to just convert to a CharArray and concatenate

Skolson5903

01/19/2022, 5:18 PM

Yeah, kotlinx.cinterop.toKStringFromUtf16 is exactly what I want but only seems to exist on the JVM (according to the doc I found). Thanks, I'll try the char array approach and see what happens. Surprising there isn't an easy way. They went to the effort of making a native String.utf16 value to get UTF16 bytes from a string, but didn't do the reverse. Hmm, possibly because of endian issues?

Skolson5903

01/19/2022, 7:08 PM

Hmm, did find this, but it doesn't help with endianess. I misread the doc before, found this that does exist:

fun CPointer<ShortVar>.toKStringFromUtf16(): String

which would work if I can copy my COpaquePointer to a CPointer<ShortVar> and add a null terminator. That's would strain my wimpy native knowledge, but even if I figured that out it doesn't handle the big endian vs little endian issue (independent of

Platform.isLittleEndian)

. I'm gonna try iterating the bytearray first and handle endianess myself, see how it goes.

napperley

01/20/2022, 1:49 AM

Kotlin Native's charset choice does create some friction with some target platforms that use UTF-8 like Linux for example.

napperley

01/20/2022, 1:50 AM

@Skolson5903 - Which platforms are you targeting with Kotlin Native?

Skolson5903

01/20/2022, 1:53 AM

This is a sqlcipher C library built for MacosX64 and IOS, using cinterop to invoke. Have a use case where the contents of the database can contain UTF-16 encoding either big endian or little endian depending on how it was created. I'm just as we speak trying to debug my first unit tests, and am having trouble getting the debugger to honor a breakpoint 🙂

Skolson5903

01/20/2022, 1:55 AM

The same project is using JNI for Android, which works fine but there UTF-16 is easy to deal.

napperley

01/20/2022, 1:56 AM

Android is Linux based, which means all strings will use the UTF-8 charset.

ephemient

01/20/2022, 1:57 AM

that's not really true

napperley

01/20/2022, 1:57 AM

Does Android not use the Linux kernel?

Skolson5903

01/20/2022, 1:57 AM

Unless you tell sqlite to use explicit encoding with the pragma defined for that, then you have to deal with whatever is in there 🙂

ephemient

01/20/2022, 1:58 AM

Dalvik strings, like Java, use UTF-16 internally, with MUTF-8 (UTF-8 modified to be more UTF-16-like; similar to WTF-8 in many JS engines, but with different NUL handling) on serialization boundaries

napperley

01/20/2022, 1:59 AM

Doesn't native code on Android (written using C or C++) use the UTF-8 charset instead?

Skolson5903

01/20/2022, 1:59 AM

Yup, so even on android if some data is encoded UTF-16, still have to decode it with the right endian. Doen't matter what the default encoding for the platform is. But JVM supports all of that.

ephemient

01/20/2022, 2:00 AM

depends. most direct usage of strings in JNI are MUTF-8, which is not quite UTF-8

napperley

01/20/2022, 2:00 AM

If I am correct the ART/Dalvik VM's (running apps written in Kotlin/Java) use the UTF-16 charset, and native code (written in C or C++) uses the UTF-8 charset.

ephemient

01/20/2022, 2:01 AM

https://developer.android.com/training/articles/perf-jni#utf-8-and-utf-16-strings

👍 1

ephemient

01/20/2022, 2:02 AM

modified UTF-8 != UTF-8, the MUTF-8 encoding for NUL and characters supplemental planes is illegal in UTF-8

ephemient

01/20/2022, 2:04 AM

working with real UTF-8 in JNI is honestly a bit painful because Java's UTF-16 strings can contain unpaired surrogates or illegal surrogate pairs

ephemient

01/20/2022, 2:06 AM

but it's not too bad if you handle all the conversion on the Java side. that's what's missing in Kotlin/Native…

ephemient

01/20/2022, 4:03 AM

Android IPC is also done in UTF-16: https://source.android.com/devices/architecture/aidl/aidl-annotations#utf8incpp "Strings are always transmitted as UTF16 over the wire."

357 Views

Open in Slack

Previous Next