For a custom format is there a convention for encoding decod kotlinlang #serialization

For a custom format, is there a convention for enc...

Zyle Moore

11/19/2024, 3:32 AM

For a custom format, is there a convention for encoding/decoding something like a null-terminated string? Specifically, I'm struggling with the serial descriptor and deserialization. Is this where decodeSequentially comes in? I'm struggling a bit on the documentation.

ephemient

11/19/2024, 5:54 AM

decodeSequentially() = true

only works with

decodeCollectionSize() != -1

ephemient

11/19/2024, 5:55 AM

it's used for collections, not scalars like strings

ephemient

11/19/2024, 5:56 AM

what is your format and what are you trying to do?

Zyle Moore

11/19/2024, 7:21 AM

The format is the Skyrim Mod Format, and one of the datatypes in there is a 0-terminated, windows-1252 encoded string. My work in progress is here https://github.com/Zymus/cultoftheancestormoth/blob/master/serialization/src/commo[…]rd/cultoftheancestormoth/serialization/BethesdaBufferEncoder.kt

Zyle Moore

11/19/2024, 7:56 AM

The format also supports non-0-terminated strings, up to some predetermined maximum, not included in the format.

Daniel Pitts

11/19/2024, 3:29 PM

Ah, Bethesda modding. I’ve had some fun with that.

Ben Woodworth

11/19/2024, 7:16 PM

You want the encoder/decoder to know whether to serialize a null-tefminated string based on the serial descriptor, right? Because I would use the serial descriptor's

annotations

, have a

@SerialInfo

-marked

BathesdaNullTerminatedString

annotation, and maybe a

SerialDescriptor.isBathesdaNullTerminatedString

extension property that the encoder/decoder use. (And definitely pick better names for these. I haven't played much of Elder Scrolls 😅)

😲 1

Zyle Moore

11/19/2024, 9:56 PM

I haven't looked much into the annotations, but that sounds like the next thing to try. I'll give it a shot tonight.

Zyle Moore

11/23/2024, 2:01 AM

While experimenting with this, it looks like the annotations don't show up if I don't explicitly add an

@Target

annotation. The documentation mentions it's recommended, but is it actually required?

Ben Woodworth

11/23/2024, 2:09 AM

Good observation! I think so, because e.g. in a primary constructor, I'm pretty sure the language doesn't know if the annotation should apply to the constructor parameter or to the class property. The SerialInfo won't get picked up for the generated serialized unless it's applied to a property. And for annotations that can apply to both function parameters and properties, I'm pretty sure parameters take priority, which is a problem in this case. (this is going off memory, but I'm pretty sure this is correct)

👍 1

Ben Woodworth

11/23/2024, 2:12 AM

and related, but there's a proposal to change this behavior to apply to both by default: KEEP-402

Zyle Moore

11/23/2024, 3:58 AM

So I've got the annotations, and I've got an Encoder able to use and see the annotation in

encodeElement

, would I need to implement

CompositeEncoder

instead of

AbstractEncoder()

, because the abstract one has a

final

implementation of

encodeStringElement

? Otherwise, which component has the context to be able to encode an

@NullTerminated val someString: String

Ben Woodworth

11/23/2024, 4:17 AM

Another option is overriding encodeElement, and storing context in the class's state before encodeString is eventually called. Either way, it sounds like you have the right idea

Zyle Moore

11/23/2024, 4:26 AM

Awesome, thanks again for the guidance, I'll test it out some more

Zyle Moore

11/23/2024, 5:26 AM

Yup, it looks like that works just fine. It would be more convenient if the encode element method wasn't final, but it's not too bad reimplementing the same.

Zyle Moore

11/23/2024, 5:36 AM

The last structure I need feels awkward but might be obvious. I need to be able to prepend the size of data before the data itself. This would be for more complex types, so I'm thinking in the realm of

encodeSerializableValue

. Two encoded ints would have a length of 8, so it would need to encode

[8, [4 bytes], [4 bytes]]

. My current implementation is to serialize this struct independently of the "main" Encoding. Some structs have a maximum size, but I need to encode the actual size, before the actual data. The second Encoder would receive

repeat(2) { encodeInt }

, the initial Encoder would receive

repeat(9) { encodeByte }

Are there common pitfalls to doing this this way? Mostly in the realm of newing up serializers with separate Encoders mid-encoding. Or is there a more conventional way of prepending the expected payload size?

Ben Woodworth

11/23/2024, 5:26 PM

I wouldn't know about pitfalls. I'm also not sure if/how CBOR or Protobuf use length prefixes in their binary formats, but you could maybe look at those implementations in kotlinx.serialization to see if they have anything inspiring Just some thoughts with definite-length encoding, I know Java's DataOutputStream's writeUtf calculates the byte length before writing anything, so I'm inclined to think that's the performant way to do it. And since you need more than just string length, having a separate lightweight Encoder that just counts bytes that would be written could be another approach Something I do in a binary format I maintain is having reader/writer interfaces that the encoder/decoder use to interact with the data. Keeping the format rules and the data layout separate, also making it easy to support different io types. Or, for you, having a lightweight writer that just counts bytes that would be written. Here's what that looks like for me

🙌 1

ephemient

11/24/2024, 2:47 AM

note, you don't actually want to use DataOutputStream because its "UTF" is actually non-standard https://docs.oracle.com/javase/8/docs/api/java/io/DataInput.html#modified-utf-8

ephemient

11/24/2024, 2:51 AM

yeah, protobuf format would probably be a reasonable model to follow. you want a layer that handles the "tokens" of the serial format - e.g. a whole null-terminated string, or a whole length-prefixed string, or a sized numeric value, or etc. the serialization format encoder and decoder use the appropriate methods based on the serial descriptor of the value that it is working on (which may include annotations)

Zyle Moore

11/29/2024, 10:20 PM

Hmm, with some of the existing serialization formats, I think I'm losing the boundary between the serializer and format-specific reader/writer pattern. The existing formats seem to follow a pattern of • Format implements either binary or string format • Format creates ultimate data response used by client (ByteArrayOutput, .toString) • Format uses the

serializersModule

to determine which serializers to use • Format creates

(En|De)coder

instance, which uses a

(Read|Writ)er

wrapping/returning string/byte array • Encoder uses format's

serializersModule

• Encoder primarily uses

SerialDescriptor

and

SerialInfo

annotations for decision making (like the one mentioned earlier in the thread) • Encoder overrides primitive and serializable encoding methods • Encoder manipulates the state of the

Writer

, using format-specific types and constructs in the

Writer

• Writer will, based on the request, manipulate the underlying data structure (string builder, byte array output) • Depending on the format-specific type, additional data may be written before or after the call-specific data. ◦ For example, in Protobuf,

encodeInt

calls

encodeTaggedInt

which actually write two ints to the underlying data structure; so a call to

encodeInt(26)

would instead be translated to

encodeInt(some tag) + encodeInt(26)

I understand the Reader/Writer as an Anti-Corruption Layer between Kotlin Serialization and whatever the underlying structure is. But isn't this last step also a serialization? In order to encode this higher-level construct (tagged int, rather than primitive int) it still needs to determine the order of components (first Tag, then Value), and encode them in the data structure (int, int). As an example, Protobuf could decide tomorrow that Tags are now Shorts, and Values are now little-endian. but technically, the outer most call from the Format doesn't know about that. It's still just going to boil down to

encodeInt

calls from the Serializer's perspective, and the Encoder is going to translate those to

encodeTaggedInt

calls, which gets translated to whatever the format needs.

Zyle Moore

12/01/2024, 5:39 AM

As an additional example, we can look at the Json Encoder. By default, the Json Encoder encodes ints as signed values. If it encounters a descriptor for an unsigned integer, a new, different Encoder is used. One whose

encodeInt

implementation doesn't write a signed int, but instead an unsigned int as an unquoted string. Here, it seems like the Encoder is the correct place to put format-specific primitive encoding decisions, going as far as having a separate encoder for format-specific representations of primitive data types. However, the Serializable value itself can be represented differently with different Serializers. Some of the examples for a

Color

type are

"#fff"

{"r":0.1,"g":1.0,"b":0.5}

, and

red

. These Serializers would vary in their calls to the Encoder, depending on the serial representation they provide. The Json example above would start with a

beginStructure

call probably, the string could be

encodeString

encodeInline

or repeated calls to

encodeByte

or a custom

encodeHex

, or an

encodeInt

that actually calls an

encodeQuotedHexString

. So my question is, if format-specific data types require format-specific representations, where do we draw the line between the responsibility of format-specific decisions? How would I know to choose

Copy code

override fun serialize(encoder: Encoder, value: NullTerminatedString) {
    encoder.encodeString(value.string)
}
...
override fun encodeString(value: String) {
    output.writeString(value)
    output.writeByte(0)
}

over

Copy code

override fun serialize(encoder: Encoder, value: NullTerminatedString) {
    encoder.encodeString(value.string)
    encoder.encodeByte(0)
}
...
override fun encodeString(value: String) {
    output.writeString(value)
}

override fun encodeByte(value: Byte) {
    output.writeByte(value)
}

Zyle Moore

12/01/2024, 6:46 PM

Or another option, if encoding the value correctly would imply some order or structure on the format-specific representation, is there anything wrong with repeatedly bouncing between Serializer and Encoder? As an extreme example, Serializer => Encoder => Serializer => Encoder => Serializer => Encoder => Writer => StringBuilder

Zyle Moore

12/07/2024, 11:18 PM

I think I found the answer to this last part, the answer was in front of my face. The

Encoder

interface actually has has a method

encodeSerializableValue

which delegates encoding of that value to a different Serializer.

2 Views

Open in Slack

Previous Next