Haven’t been able to find a good example of using ...
# io
c
Haven’t been able to find a good example of using kotlinx.io (or okio) to parse data. As a trivial/contrived example (real world scenarious have more layers/complexity), if you have a stream of records that are line-delimited: a) you need to demarcate a record - find the end of line; b) you need to parse within the line, where various pieces may or may not be required, may be further transformed, etc Yes, you could use readLine() - but that incurs copies into Strings that need to be further parsed. How to properly do the framing / parsing?
f
kotlinx-io API is primarily focused on reading binary data, so parsing data from text is indeed not so handy. One can employ Source.indexOf to seek for a delimiter and then read a field, but the functions works with a single delimiter at time, so searching for a field or a record delimiter would be challenging. There was an idea to provide functions allowing to read until a certain condition is met (#179). Perhaps, that would be a solution?
Yes, you could use readLine() - but that incurs copies into Strings that need to be further parsed.
Is there a reason not to read a whole record first (like memory or performance constraints)? Kotlin stdlib comes with a large number of functions for string processing and unlikely kotlinx-io will be able to provide the same rich set of functionality. Anyway, if you have a particular format at hand, don't hesitate sharing it to make the discussion less abstract. And feel free to file an issue, if the library can not provide the solution for you particular case: https://github.com/Kotlin/kotlinx-io/issues
c
Thanks for the feedback. While the use case below is a text byte stream, binary protocols would still be challenging where they aren’t fixed.
readShort
etc is easy;
readUntilDelimiter
or such are more challenging for binary / text formats. In terms of a concrete use case: This is an HL7 v2 message, represented as a Java String to show the line (segment) termination char:
Copy code
"MSH|^~\\&|RMS|***|POWERSCRIBE|***|13/02/2008 14:17||ORU^R01|00468020SWEETTEN|P|2.3|2.3|\r"+ 
"PID|||292552||SWEETTEN^JOHN^||19340627|M|||^^^^^^|\r"+ 
"PV1||#|||||^^|BMI^UNKNOWN^UNKNOWN|^^||||||||||G080212080612721|\r"+ 
"ORC|RE|00468020|||ZZ|\r"+ 
"OBR||00468020|00468020|TGHSYNSCP^X-RAY CHEST (MOBILE)||20080210003842|20080212080612|20080212080612||||^^20080213141410||||^^||00468020|1|1||20080213141410090|||F|||||||^Dr. Mineesh^Datta|mdatta^Datta^Dr. Mineesh|^^|n/a^<None>^|\r"+ 
"OBX|1|FT||1|CHEST X-RAY.||||||F|||||mdatta^Datta^Dr. Mineesh|\r"+ 
"OBX|2|FT||1|||||||F|||||mdatta^Datta^Dr. Mineesh|\r"+ 
"OBX|3|FT||1|||||||F|||||mdatta^Datta^Dr. Mineesh|\r"+ 
"OBX|4|FT||1|CLINICAL NOTES:||||||F|||||mdatta^Datta^Dr. Mineesh|\r"+ 
"OBX|5|FT||1|? Bowel obstruction.||||||F|||||mdatta^Datta^Dr. Mineesh|\r"+ 
"OBX|6|FT||1|||||||F|||||mdatta^Datta^Dr. Mineesh|\r"+ 
"OBX|7|FT||1|||||||F|||||mdatta^Datta^Dr. Mineesh|\r"+ 
"OBX|8|FT||1|REPORT:||||||F|||||mdatta^Datta^Dr. Mineesh|\r"+ 
"OBX|9|FT||1|AP rotated expiratory projection, degraded by breathing artefact. Sternotomy wires are||||||F|||||mdatta^Datta^Dr. Mineesh|\r"+ 
"OBX|10|FT||1|noted. The lungs are underinflated. Nonetheless bilateral opacities are seen, on the||||||F|||||mdatta^Datta^Dr. Mineesh|\r"+ 
"OBX|11|FT||1|left probably pleural in nature, on the right possibly intrapulmonary.||||||F|||||mdatta^Datta^Dr. Mineesh|\r"+ 
"OBX|12|FT||1|||||||F|||||mdatta^Datta^Dr. Mineesh|\r"+ 
"OBX|13|FT||1|No free subphrenic gas within technical limit is appreciated. However, a formal repeat||||||F|||||mdatta^Datta^Dr. Mineesh|\r"+ 
"OBX|14|FT||1|chest x-ray is strongly recommended.||||||F|||||mdatta^Datta^Dr. Mineesh|\r"+ 
"OBX|15|FT||1|||||||F|||||mdatta^Datta^Dr. Mineesh|\r"+ 
"OBX|16|FT||1|||||||F|||||mdatta^Datta^Dr. Mineesh|\r"+ 
"OBX|17|FT||1|||||||F|||||mdatta^Datta^Dr. Mineesh|\r"+ 
"OBX|18|FT||1|Dr. Mineesh Datta||||||F|||||mdatta^Datta^Dr. Mineesh|\r"+ 
"OBX|19|FT||1|MBBS||||||F|||||mdatta^Datta^Dr. Mineesh|\r"+ 
"OBX|20|FT||1|FRANZCR||||||F|||||mdatta^Datta^Dr. Mineesh|\r"
Key points: • each line (segment) is terminated by
\r
• the whole message can be read from a socket (MLLP - single byte header, 2 byte footer) or delivered via a REST API or similar (as an entire message w/o the header / footer framing bytes); • segments contain fields delimited by
|
• not shown here - field data can contain escape sequences, e.g.
\F
is the field separator (
|
),
\E\
is the escape char, etc. These need to be transformed to the correct character on decoding; • segments can repeat (as shown above); • the number, type of segments is variable; • the number of fields for a segment type tends to be fixed (some sending systems may use older spec versions and send less fields), and can be 30-50 fields; • encoding is typically UTF-8, but could be set to ISO-8859-1 (or other) based on the sending system; • interesting usage patterns: * not all segments are used by the application; * not all fields within any used segment are used by the application; Goal: avoid copying / allocating to the extent possible: - reference the underlying message bytes - lazy parse segments (after the 3 char segment type); - lazy parse fields (avoid checking for escape chars, decode into String unless needed); Have a sample working w/ Source, using
indexOf
as suggested; it has a few wrinkles: • each segment will almost always be smaller that the split threshold (1,024 bytes), resulting in the underlying byte array being split/copied anyhow, for each segment; • requires checking each byte all the way to the end of a segment via indexOf - which was potentially already checked for the socket framing, and will be checked again for field parsing; Toying with implementing some form of ByteStringSlice to maintain offsets, sharing a single underlying ByteString. What’s missing are patterns to efficiently parse the multiple layers in as few passes as possible. This could perhaps be simplified with the noted “read until predicate” feature, providing it has some means of indicating what was matched.
👀 1