I have a bytearray of UTF-16 encoded bytes read fr...
# kotlin-native
s
I have a bytearray of UTF-16 encoded bytes read from a cinterop library COpaquePointer, and need to convert the ByteArray to a String. But native doesn't seem to have Charsets much less Charsets.UTF_16. ByteArray.toKString() is for UTF8. Is there some simple way to do this? My searches are sucking 🙂
e
it kinda sucks but UTF-16 to UTF-8 conversion isn't too hard, although you'll have to decide what to do with invalid UTF-16 (ignoring BOM, improper use of surrogates)
100% untested, but just following the Unicode specifications,
Copy code
fun UShortArray.utf16ToUtf8(): UByteArray {
    var i = if (this.firstOrNull() == 0xFFEF.toUShort()) 1 else 0 // skip BOM
    val bytes = UByteArray((this.size - i) * 3)
    var j = 0
    while (i < this.size) {
        val codepoint = when (val unit = this[i++].toInt()) {
            in Char.MIN_HIGH_SURROGATE.code..Char.MAX_HIGH_SURROGATE.code -> {
                if (i !in this.indices) throw CharacterCodingException() // unpaired high surrogate
                val next = this[i++].toInt()
                if (next !in Char.MIN_LOW_SURROGATE.code..Char.MAX_LOW_SURROGATE.code) {
                	throw CharacterCodingException() // unpaired high surrogate
                }
                val code = unit and 0x3F shl 10 or (next and 0x3F)
                if (code !in Char.MIN_SUPPLEMENTARY_CODE_POINT..Char.MAX_CODE_POINT) {
                    throw CharacterCodingException() // non-canonical encoding
                }
                code
            }
            in Char.MIN_LOW_SURROGATE.code..Char.MAX_LOW_SURROGATE.code -> {
                throw CharacterCodingException() // unpaired low surrogate
            }
            else -> unit.toInt()
        }
        when (codepoint) {
            in 0x00..0x7F -> bytes[j++] = codepoint.toUByte()
            in 0x80..0x07FF -> {
                bytes[j++] = 0xC0.or(codepoint and 0x07C0 shr 6).toUByte()
                bytes[j++] = 0x80.or(codepoint and 0x003F).toUByte()
            }
            in 0x0800..0xFFFF -> {
                bytes[j++] = 0xE0.or(codepoint and 0xF000 shr 12).toUByte()
                bytes[j++] = 0x80.or(codepoint and 0x0FC0 shr 6).toUByte()
                bytes[j++] = 0x80.or(codepoint and 0x003F).toUByte()
            }
            in 0x10000..Char.MAX_CODE_POINT -> {
                bytes[j++] = 0xF0.or(codepoint and 0x3C0000 shr 18).toUByte()
                bytes[j++] = 0x80.or(codepoint and 0x03F000 shr 12).toUByte()
                bytes[j++] = 0x80.or(codepoint and 0x000FC0 shr 6).toUByte()
                bytes[j++] = 0x80.or(codepoint and 0x00003F).toUByte()
            }
            else -> throw IllegalStateException()
        }
    }
    return bytes.sliceArray(0 until j)
}
n
Use the
toKStringFromUtf8
function to convert a UTF-16 byte array to a Kotlin String.
e
oh, now that I check the kotlin.native package… good find, but I'm pretty sure that's a bit wrong,
toKStringFromUtf16
is closer to what OP wants
n
Looks like things have changed with UTF-16 support in the KotlinX Cinterop library. There used to be a function for doing it as a single operation. The
toKStringFromUtf16
function looks like the one but doesn't exist anymore.
e
but come to think of it, kotlin.Char is a UTF-16 unit (maybe not the best choice on native, but it's the only way to be compatible with Java), so actually you should be able to just convert to a CharArray and concatenate
s
Yeah, kotlinx.cinterop.toKStringFromUtf16 is exactly what I want but only seems to exist on the JVM (according to the doc I found). Thanks, I'll try the char array approach and see what happens. Surprising there isn't an easy way. They went to the effort of making a native String.utf16 value to get UTF16 bytes from a string, but didn't do the reverse. Hmm, possibly because of endian issues?
Hmm, did find this, but it doesn't help with endianess. I misread the doc before, found this that does exist:
fun CPointer<ShortVar>.toKStringFromUtf16(): String
which would work if I can copy my COpaquePointer to a CPointer<ShortVar> and add a null terminator. That's would strain my wimpy native knowledge, but even if I figured that out it doesn't handle the big endian vs little endian issue (independent of
Platform.isLittleEndian)
. I'm gonna try iterating the bytearray first and handle endianess myself, see how it goes.
n
Kotlin Native's charset choice does create some friction with some target platforms that use UTF-8 like Linux for example.
@Skolson5903 - Which platforms are you targeting with Kotlin Native?
s
This is a sqlcipher C library built for MacosX64 and IOS, using cinterop to invoke. Have a use case where the contents of the database can contain UTF-16 encoding either big endian or little endian depending on how it was created. I'm just as we speak trying to debug my first unit tests, and am having trouble getting the debugger to honor a breakpoint 🙂
The same project is using JNI for Android, which works fine but there UTF-16 is easy to deal.
n
Android is Linux based, which means all strings will use the UTF-8 charset.
e
that's not really true
n
Does Android not use the Linux kernel?
s
Unless you tell sqlite to use explicit encoding with the pragma defined for that, then you have to deal with whatever is in there 🙂
e
Dalvik strings, like Java, use UTF-16 internally, with MUTF-8 (UTF-8 modified to be more UTF-16-like; similar to WTF-8 in many JS engines, but with different NUL handling) on serialization boundaries
n
Doesn't native code on Android (written using C or C++) use the UTF-8 charset instead?
s
Yup, so even on android if some data is encoded UTF-16, still have to decode it with the right endian. Doen't matter what the default encoding for the platform is. But JVM supports all of that.
e
depends. most direct usage of strings in JNI are MUTF-8, which is not quite UTF-8
n
If I am correct the ART/Dalvik VM's (running apps written in Kotlin/Java) use the UTF-16 charset, and native code (written in C or C++) uses the UTF-8 charset.
modified UTF-8 != UTF-8, the MUTF-8 encoding for NUL and characters supplemental planes is illegal in UTF-8
working with real UTF-8 in JNI is honestly a bit painful because Java's UTF-16 strings can contain unpaired surrogates or illegal surrogate pairs
but it's not too bad if you handle all the conversion on the Java side. that's what's missing in Kotlin/Native…
Android IPC is also done in UTF-16: https://source.android.com/devices/architecture/aidl/aidl-annotations#utf8incpp "Strings are always transmitted as UTF16 over the wire."
337 Views