For a custom format, is there a convention for enc...
# serialization
z
For a custom format, is there a convention for encoding/decoding something like a null-terminated string? Specifically, I'm struggling with the serial descriptor and deserialization. Is this where decodeSequentially comes in? I'm struggling a bit on the documentation.
e
decodeSequentially() = true
only works with
decodeCollectionSize() != -1
it's used for collections, not scalars like strings
what is your format and what are you trying to do?
z
The format is the Skyrim Mod Format, and one of the datatypes in there is a 0-terminated, windows-1252 encoded string. My work in progress is here https://github.com/Zymus/cultoftheancestormoth/blob/master/serialization/src/commo[…]rd/cultoftheancestormoth/serialization/BethesdaBufferEncoder.kt
The format also supports non-0-terminated strings, up to some predetermined maximum, not included in the format.
d
Ah, Bethesda modding. I’ve had some fun with that.
b
You want the encoder/decoder to know whether to serialize a null-tefminated string based on the serial descriptor, right? Because I would use the serial descriptor's
annotations
, have a
@SerialInfo
-marked
BathesdaNullTerminatedString
annotation, and maybe a
SerialDescriptor.isBathesdaNullTerminatedString
extension property that the encoder/decoder use. (And definitely pick better names for these. I haven't played much of Elder Scrolls 😅)
😲 1
z
I haven't looked much into the annotations, but that sounds like the next thing to try. I'll give it a shot tonight.
While experimenting with this, it looks like the annotations don't show up if I don't explicitly add an
@Target
annotation. The documentation mentions it's recommended, but is it actually required?
b
Good observation! I think so, because e.g. in a primary constructor, I'm pretty sure the language doesn't know if the annotation should apply to the constructor parameter or to the class property. The SerialInfo won't get picked up for the generated serialized unless it's applied to a property. And for annotations that can apply to both function parameters and properties, I'm pretty sure parameters take priority, which is a problem in this case. (this is going off memory, but I'm pretty sure this is correct)
👍 1
and related, but there's a proposal to change this behavior to apply to both by default: KEEP-402
z
So I've got the annotations, and I've got an Encoder able to use and see the annotation in
encodeElement
, would I need to implement
CompositeEncoder
instead of
AbstractEncoder()
, because the abstract one has a
final
implementation of
encodeStringElement
? Otherwise, which component has the context to be able to encode an
@NullTerminated val someString: String
?
b
Another option is overriding encodeElement, and storing context in the class's state before encodeString is eventually called. Either way, it sounds like you have the right idea
z
Awesome, thanks again for the guidance, I'll test it out some more
Yup, it looks like that works just fine. It would be more convenient if the encode element method wasn't final, but it's not too bad reimplementing the same.
The last structure I need feels awkward but might be obvious. I need to be able to prepend the size of data before the data itself. This would be for more complex types, so I'm thinking in the realm of
encodeSerializableValue
. Two encoded ints would have a length of 8, so it would need to encode
[8, [4 bytes], [4 bytes]]
. My current implementation is to serialize this struct independently of the "main" Encoding. Some structs have a maximum size, but I need to encode the actual size, before the actual data. The second Encoder would receive
repeat(2) { encodeInt }
, the initial Encoder would receive
repeat(9) { encodeByte }
Are there common pitfalls to doing this this way? Mostly in the realm of newing up serializers with separate Encoders mid-encoding. Or is there a more conventional way of prepending the expected payload size?
b
I wouldn't know about pitfalls. I'm also not sure if/how CBOR or Protobuf use length prefixes in their binary formats, but you could maybe look at those implementations in kotlinx.serialization to see if they have anything inspiring Just some thoughts with definite-length encoding, I know Java's DataOutputStream's writeUtf calculates the byte length before writing anything, so I'm inclined to think that's the performant way to do it. And since you need more than just string length, having a separate lightweight Encoder that just counts bytes that would be written could be another approach Something I do in a binary format I maintain is having reader/writer interfaces that the encoder/decoder use to interact with the data. Keeping the format rules and the data layout separate, also making it easy to support different io types. Or, for you, having a lightweight writer that just counts bytes that would be written. Here's what that looks like for me
🙌 1
e
note, you don't actually want to use DataOutputStream because its "UTF" is actually non-standard https://docs.oracle.com/javase/8/docs/api/java/io/DataInput.html#modified-utf-8
yeah, protobuf format would probably be a reasonable model to follow. you want a layer that handles the "tokens" of the serial format - e.g. a whole null-terminated string, or a whole length-prefixed string, or a sized numeric value, or etc. the serialization format encoder and decoder use the appropriate methods based on the serial descriptor of the value that it is working on (which may include annotations)
z
Hmm, with some of the existing serialization formats, I think I'm losing the boundary between the serializer and format-specific reader/writer pattern. The existing formats seem to follow a pattern of • Format implements either binary or string format • Format creates ultimate data response used by client (ByteArrayOutput, .toString) • Format uses the
serializersModule
to determine which serializers to use • Format creates
(En|De)coder
instance, which uses a
(Read|Writ)er
wrapping/returning string/byte array • Encoder uses format's
serializersModule
• Encoder primarily uses
SerialDescriptor
and
SerialInfo
annotations for decision making (like the one mentioned earlier in the thread) • Encoder overrides primitive and serializable encoding methods • Encoder manipulates the state of the
Writer
, using format-specific types and constructs in the
Writer
• Writer will, based on the request, manipulate the underlying data structure (string builder, byte array output) • Depending on the format-specific type, additional data may be written before or after the call-specific data. ◦ For example, in Protobuf,
encodeInt
calls
encodeTaggedInt
which actually write two ints to the underlying data structure; so a call to
encodeInt(26)
would instead be translated to
encodeInt(some tag) + encodeInt(26)
I understand the Reader/Writer as an Anti-Corruption Layer between Kotlin Serialization and whatever the underlying structure is. But isn't this last step also a serialization? In order to encode this higher-level construct (tagged int, rather than primitive int) it still needs to determine the order of components (first Tag, then Value), and encode them in the data structure (int, int). As an example, Protobuf could decide tomorrow that Tags are now Shorts, and Values are now little-endian. but technically, the outer most call from the Format doesn't know about that. It's still just going to boil down to
encodeInt
calls from the Serializer's perspective, and the Encoder is going to translate those to
encodeTaggedInt
calls, which gets translated to whatever the format needs.
As an additional example, we can look at the Json Encoder. By default, the Json Encoder encodes ints as signed values. If it encounters a descriptor for an unsigned integer, a new, different Encoder is used. One whose
encodeInt
implementation doesn't write a signed int, but instead an unsigned int as an unquoted string. Here, it seems like the Encoder is the correct place to put format-specific primitive encoding decisions, going as far as having a separate encoder for format-specific representations of primitive data types. However, the Serializable value itself can be represented differently with different Serializers. Some of the examples for a
Color
type are
"#fff"
,
{"r":0.1,"g":1.0,"b":0.5}
, and
red
. These Serializers would vary in their calls to the Encoder, depending on the serial representation they provide. The Json example above would start with a
beginStructure
call probably, the string could be
encodeString
or
encodeInline
or repeated calls to
encodeByte
or a custom
encodeHex
, or an
encodeInt
that actually calls an
encodeQuotedHexString
. So my question is, if format-specific data types require format-specific representations, where do we draw the line between the responsibility of format-specific decisions? How would I know to choose
Copy code
override fun serialize(encoder: Encoder, value: NullTerminatedString) {
    encoder.encodeString(value.string)
}
...
override fun encodeString(value: String) {
    output.writeString(value)
    output.writeByte(0)
}
over
Copy code
override fun serialize(encoder: Encoder, value: NullTerminatedString) {
    encoder.encodeString(value.string)
    encoder.encodeByte(0)
}
...
override fun encodeString(value: String) {
    output.writeString(value)
}

override fun encodeByte(value: Byte) {
    output.writeByte(value)
}
Or another option, if encoding the value correctly would imply some order or structure on the format-specific representation, is there anything wrong with repeatedly bouncing between Serializer and Encoder? As an extreme example, Serializer => Encoder => Serializer => Encoder => Serializer => Encoder => Writer => StringBuilder
I think I found the answer to this last part, the answer was in front of my face. The
Encoder
interface actually has has a method
encodeSerializableValue
which delegates encoding of that value to a different Serializer.