For a custom format, is there a convention for enc...
# serialization
z
For a custom format, is there a convention for encoding/decoding something like a null-terminated string? Specifically, I'm struggling with the serial descriptor and deserialization. Is this where decodeSequentially comes in? I'm struggling a bit on the documentation.
e
decodeSequentially() = true
only works with
decodeCollectionSize() != -1
it's used for collections, not scalars like strings
what is your format and what are you trying to do?
z
The format is the Skyrim Mod Format, and one of the datatypes in there is a 0-terminated, windows-1252 encoded string. My work in progress is here https://github.com/Zymus/cultoftheancestormoth/blob/master/serialization/src/commo[…]rd/cultoftheancestormoth/serialization/BethesdaBufferEncoder.kt
The format also supports non-0-terminated strings, up to some predetermined maximum, not included in the format.
d
Ah, Bethesda modding. I’ve had some fun with that.
b
You want the encoder/decoder to know whether to serialize a null-tefminated string based on the serial descriptor, right? Because I would use the serial descriptor's
annotations
, have a
@SerialInfo
-marked
BathesdaNullTerminatedString
annotation, and maybe a
SerialDescriptor.isBathesdaNullTerminatedString
extension property that the encoder/decoder use. (And definitely pick better names for these. I haven't played much of Elder Scrolls 😅)
😲 1
z
I haven't looked much into the annotations, but that sounds like the next thing to try. I'll give it a shot tonight.
While experimenting with this, it looks like the annotations don't show up if I don't explicitly add an
@Target
annotation. The documentation mentions it's recommended, but is it actually required?
b
Good observation! I think so, because e.g. in a primary constructor, I'm pretty sure the language doesn't know if the annotation should apply to the constructor parameter or to the class property. The SerialInfo won't get picked up for the generated serialized unless it's applied to a property. And for annotations that can apply to both function parameters and properties, I'm pretty sure parameters take priority, which is a problem in this case. (this is going off memory, but I'm pretty sure this is correct)
👍 1
and related, but there's a proposal to change this behavior to apply to both by default: KEEP-402
z
So I've got the annotations, and I've got an Encoder able to use and see the annotation in
encodeElement
, would I need to implement
CompositeEncoder
instead of
AbstractEncoder()
, because the abstract one has a
final
implementation of
encodeStringElement
? Otherwise, which component has the context to be able to encode an
@NullTerminated val someString: String
?
b
Another option is overriding encodeElement, and storing context in the class's state before encodeString is eventually called. Either way, it sounds like you have the right idea
z
Awesome, thanks again for the guidance, I'll test it out some more
Yup, it looks like that works just fine. It would be more convenient if the encode element method wasn't final, but it's not too bad reimplementing the same.
The last structure I need feels awkward but might be obvious. I need to be able to prepend the size of data before the data itself. This would be for more complex types, so I'm thinking in the realm of
encodeSerializableValue
. Two encoded ints would have a length of 8, so it would need to encode
[8, [4 bytes], [4 bytes]]
. My current implementation is to serialize this struct independently of the "main" Encoding. Some structs have a maximum size, but I need to encode the actual size, before the actual data. The second Encoder would receive
repeat(2) { encodeInt }
, the initial Encoder would receive
repeat(9) { encodeByte }
Are there common pitfalls to doing this this way? Mostly in the realm of newing up serializers with separate Encoders mid-encoding. Or is there a more conventional way of prepending the expected payload size?
b
I wouldn't know about pitfalls. I'm also not sure if/how CBOR or Protobuf use length prefixes in their binary formats, but you could maybe look at those implementations in kotlinx.serialization to see if they have anything inspiring Just some thoughts with definite-length encoding, I know Java's DataOutputStream's writeUtf calculates the byte length before writing anything, so I'm inclined to think that's the performant way to do it. And since you need more than just string length, having a separate lightweight Encoder that just counts bytes that would be written could be another approach Something I do in a binary format I maintain is having reader/writer interfaces that the encoder/decoder use to interact with the data. Keeping the format rules and the data layout separate, also making it easy to support different io types. Or, for you, having a lightweight writer that just counts bytes that would be written. Here's what that looks like for me
🙌 1
e
note, you don't actually want to use DataOutputStream because its "UTF" is actually non-standard https://docs.oracle.com/javase/8/docs/api/java/io/DataInput.html#modified-utf-8
yeah, protobuf format would probably be a reasonable model to follow. you want a layer that handles the "tokens" of the serial format - e.g. a whole null-terminated string, or a whole length-prefixed string, or a sized numeric value, or etc. the serialization format encoder and decoder use the appropriate methods based on the serial descriptor of the value that it is working on (which may include annotations)
z
Hmm, with some of the existing serialization formats, I think I'm losing the boundary between the serializer and format-specific reader/writer pattern. The existing formats seem to follow a pattern of • Format implements either binary or string format • Format creates ultimate data response used by client (ByteArrayOutput, .toString) • Format uses the
serializersModule
to determine which serializers to use • Format creates
(En|De)coder
instance, which uses a
(Read|Writ)er
wrapping/returning string/byte array • Encoder uses format's
serializersModule
• Encoder primarily uses
SerialDescriptor
and
SerialInfo
annotations for decision making (like the one mentioned earlier in the thread) • Encoder overrides primitive and serializable encoding methods • Encoder manipulates the state of the
Writer
, using format-specific types and constructs in the
Writer
• Writer will, based on the request, manipulate the underlying data structure (string builder, byte array output) • Depending on the format-specific type, additional data may be written before or after the call-specific data. ◦ For example, in Protobuf,
encodeInt
calls
encodeTaggedInt
which actually write two ints to the underlying data structure; so a call to
encodeInt(26)
would instead be translated to
encodeInt(some tag) + encodeInt(26)
I understand the Reader/Writer as an Anti-Corruption Layer between Kotlin Serialization and whatever the underlying structure is. But isn't this last step also a serialization? In order to encode this higher-level construct (tagged int, rather than primitive int) it still needs to determine the order of components (first Tag, then Value), and encode them in the data structure (int, int). As an example, Protobuf could decide tomorrow that Tags are now Shorts, and Values are now little-endian. but technically, the outer most call from the Format doesn't know about that. It's still just going to boil down to
encodeInt
calls from the Serializer's perspective, and the Encoder is going to translate those to
encodeTaggedInt
calls, which gets translated to whatever the format needs.
today i learned 1
As an additional example, we can look at the Json Encoder. By default, the Json Encoder encodes ints as signed values. If it encounters a descriptor for an unsigned integer, a new, different Encoder is used. One whose
encodeInt
implementation doesn't write a signed int, but instead an unsigned int as an unquoted string. Here, it seems like the Encoder is the correct place to put format-specific primitive encoding decisions, going as far as having a separate encoder for format-specific representations of primitive data types. However, the Serializable value itself can be represented differently with different Serializers. Some of the examples for a
Color
type are
"#fff"
,
{"r":0.1,"g":1.0,"b":0.5}
, and
red
. These Serializers would vary in their calls to the Encoder, depending on the serial representation they provide. The Json example above would start with a
beginStructure
call probably, the string could be
encodeString
or
encodeInline
or repeated calls to
encodeByte
or a custom
encodeHex
, or an
encodeInt
that actually calls an
encodeQuotedHexString
. So my question is, if format-specific data types require format-specific representations, where do we draw the line between the responsibility of format-specific decisions? How would I know to choose
Copy code
override fun serialize(encoder: Encoder, value: NullTerminatedString) {
    encoder.encodeString(value.string)
}
...
override fun encodeString(value: String) {
    output.writeString(value)
    output.writeByte(0)
}
over
Copy code
override fun serialize(encoder: Encoder, value: NullTerminatedString) {
    encoder.encodeString(value.string)
    encoder.encodeByte(0)
}
...
override fun encodeString(value: String) {
    output.writeString(value)
}

override fun encodeByte(value: Byte) {
    output.writeByte(value)
}
Or another option, if encoding the value correctly would imply some order or structure on the format-specific representation, is there anything wrong with repeatedly bouncing between Serializer and Encoder? As an extreme example, Serializer => Encoder => Serializer => Encoder => Serializer => Encoder => Writer => StringBuilder
I think I found the answer to this last part, the answer was in front of my face. The
Encoder
interface actually has has a method
encodeSerializableValue
which delegates encoding of that value to a different Serializer.
p
@Zyle Moore Sorry for replying in an old thread, but I have the exact same questions about format-specific types. Maybe you have an advice, from your perspective, whether its better to lean into logic-heavy encoders, format-aware serializers or maybe doing both is necessary?
z
I don't have a great answer unfortunately. A mix of both is what I went with. I took a few stabs at modeling what the file format would look like. Drawing lines around the borders between elements helped. In my case, eventually I got to wanting to get Descriptor info, but because I subclassed from
AbstractDecoder
, I didn't get access to the
decodeStringElement
, just
decodeString
. But when it came to polymorphic serialization, I also couldn't control how it determined polymorphism. The
PolymorphicSerializer
will read element
0
using
decodeStringElement
, and then use that string value to look up the actual serializer to use.
So my Decoder had to either decide if it should return a new CompositeDecoder that could only read a fixed-length string, or stop subclassing
AbstractDecoder
and implement both
Decoder, CompositeDecoder
.
I went with dropping the AbstractDecoder.But I still have to keep and manage a reference to the last SerialDescriptor that was used in the beginStructure call. You can see some of the initial code here: https://github.com/Zymus/cultoftheancestormoth/pull/10/files#diff-7dfffa95a85d039e2877170679b0b4c8626f1a545e7fbf8c1b06a6028ecad46cR118
That way was pretty slow though, partly because of the fixed length aspect. So I improved it a bit, but it's still kind of bare bones. https://github.com/Zymus/cultoftheancestormoth/pull/10/files#diff-e1be5745f90a2b57bbfce814a321201345e978d123b361519da8c9f2d2739548R35
I would say that the descriptor is the API between Encoder and Serializer. The serializer should take care of the ordering/elements. That means though that any branching in the serializer has to translate to some expected
encode*
call.
An awkard and unclear part for me was what should be encodable, and what should be serializable. The encoder knows about structure and elements. The serializer knows about types, and how to turn them to structures and elements. I initially fell down a hole of "codec elements". I tried to encapsulate what could be encoded (int, boolean, char, inline, string, etc). They were mostly primitives, so I thought that if an 8 byte value could be a long, then an 8 byte structure wouldn't be too bad to have as a separate
encode*
method. This was similar to some of the other known encoders, like Json, since they have
encodeJsonElement
methods.
But I needed so many of those elements, my encoder became huge.
encodeTypeTag
,
encodeFieldValue
,
encodeRecordSize
,
encodeGroupFlags
, etc.
I realized though that the type of encoder was actually more important than I thought initially. Specifically, how the encoder reacts to those various method calls. The encoder is going to manage effects in some external-to-serialization kind of way. For instance, a
ConsoleEncoder
or
PrintEncoder
could react to
encodeInt
by calling
println(intValue)
. There was some kind of natural way of dealing with the requests. With the Kotlinx IO Buffer I used, there were
writeIntLe
and
readDouble
I could use to react to those encode methods. There was external state that persisted between calls. Another example someone gave me was a
ListEncoder
, which would go
encodeInt => list.add(intValue)
. This became more apparent when I saw that the format classes in the existing serializers created state objects (
JsonStringWriter
) which were used to create the encoders (like
JsonEncoder(JsonStringWriter())
).
Knowing I had to split it up to (Serializer) <> (Encoder) <> (Encoded Element Consumer) helped a lot.
Technically, the serializer just has to know about types. But format-specific types I think should be indistinguishable to the serializer, since the calls end up the same.
encodeStructure, encodeIntElement, encodeStringElement, endStructure
. Or rather, the serializer will treat
MyFormatType
and
Pair<String, String>
the same. It doesn't know that one is for your format, and one is a standard type. They're both just serializable things.
Another thing that was important for me, I came from a techstack where things like
JsonSerializer.Serialize
was a thing I could do statically. And there was only ever really the one representation, the one derived automatically from the class. but Kotlin Serializers (and descriptors) are kind of more like HTTP Resources. What I mean is, HTTP Resources can have lots of different representations.
POST /mypage
could
Accept
HTML, or CSS, or Javascript, or JSON, or XML, or CSV, or PNG, or JPG. That's 7 or more representations of the
/mypage
resource. These are kind of the same, in that, if you don't use the plugin generated Serializer, you have to manage that specific representation yourself.
If you had a
Point2D(x, y)
class, and you're totally fine with the encoder using everything as-is (types, order of elements), you get that out of the box. If you wanted to normalize the x and y, or swap the order of the fields, or include additional metadata, or map it do a different type and provide a default, you'd have to do all of that in the serializer, because it's concerned with the number of elements, the types, and the order of elements. You can also have multiple Serializers for the same type, like
BackwardsStringSerializer : KSerializer<String>
and
UppercaseStringSerializer : KSerializer<String>
Finally, for some of the existing formats like JSON, you actually don't need implement that much of the Encoder. Every Json element basically boils down to a quoted or unquoted string, and so
encodeInt
in the JsonEncoder effectively just calls
encodeString
. There's nothing more specific you need to to, because
someInt.toString()
is going to be the correct string for that value.
p
Thank you very much for such a detailed answer! It corresponds with some of my initial ideas and I've definitely got a lot more of food for thought. And BTW the tool you're making is pretty cool, thanks for supporting modding 🫡