Hi! Is there a pattern/best practice to parse arbi...
# arrow
c
Hi! Is there a pattern/best practice to parse arbitrary text into a Kotlin object? I currently have a
fun parse(List<String>): SomeDomainObject
That consists of many, many, many calls to
startsWith
and similar in an enormous
when
. The data I'm parsing was not designed to be parsed by a program, but to be read by humans, so there's no automated way (like Kotlinx.Serialization)... Is there anything in Arrow or any programming pattern that could help clean up this code so it's easier to maintain?
j
Idk if there are any good parser combinator libs in kotlin. I have a library but that is undocumented and likely not that great: https://github.com/1Jajen1/Kparsec Either way you are definitely looking for parser combinators. I plan on rewriting my lib over the holidays together with docs etc but for now I would not recommend using it.
👌🏼 1
If its not for any important project I can walk you through how to use it though^^
c
If I understand correctly, parser combinators are meant for structured textual information? The things I'm trying to parse are not structured at all, so I don't know if it fits 🤔 The project I'm writing will be fully open source and so far only a dozen people know about it, so in that sense it's not very important, but I'm also doing root operations in a Java program, so in that sense... 😅
j
So you are doing more text analysis rather than parsing right? As in you are looking for specific things like keywords and such, plus maybe context around them? That would be a bit harder
c
Sorry, I don't know the correct words. Essentially, I have a few command line utilities that I'm creating a UI around: I'm reading what they're writing to the terminal, and finding information there that I can expose in Kotlin (so they can be displayed later, etc). Because the different programs I can run are nothing alike, and even tasks within a program can be completely different, my current code looks more or less like this; • each task is a function that returns a Flow of information, • these functions work by calling the program they correspond to, then they forward the standard output to a state machine that corresponds to that task, with states like INIT, DOWNLOAD, or EXECUTE, where which state is a big when that catch all possible lines the program could output.
Here is an example of one such
when
. (The current version uses data classes' copy, which I guess is not optimal, but it has the benefits that the code is really easy to maintain: a builder would require to write all data points once in a data class constructor, once in the builder constructor, and once in the
build
method. So I'm betting that the IO operations I'm parsing are slower than my low quality memory management 😅)
This example has lines that look like
<header> : <value>
but I have other places where the data is
Copy code
downloading <file>
installing <file> in <directory>
And many, many other patterns
j
I'm betting that the IO operations I'm parsing are slower than my low quality memory management 😅
You'd be surprised by just how fast recent IO, especially file IO, has become^^ But either way it probably does not matter here 😅
Imo parser combinators, especially ones that backtrack automatically are you best choice. But then again I am not aware of kotlin parser combinators that are flexible enough to do this. Not knowing if this fits you problem, but I'd probably structure this by creating a triple of a command + state machine + parser (specific for that command). Then have global start and finish states. Every parser returns
Transitions
which modify the state machine into the next state. This may allow easy reuse of functions while still keeping parsing and the state machine separated enough to deal with this level of difference between them. But yeah, no idea if this fits^^
c
That seems somewhat similar to what I'm doing, but more cleverer ^^ do you have any blog article/vulgarization of what you're talking about? My full knowledge of parsers is a small class on Bison/Yacc, and I'm afraid this goes a bit beyond it 😅
j
Ah parser combinators are quite different from parser generators, they usually allow much more freedom at the cost of performance usually. Combinators are quite literally code, and usually are also at least applicative/selective and sometimes monadic. This allows arbitrary computation while parsing. Are you familiar with haskell code? Not in-depth but good enough syntax knowledge to read it. If so check out this https://markkarpov.com/tutorial/megaparsec.html to see just how flexible parser combinators can be. That is also the haskell version library I ported to kotlin and linked above, so this tutorial actually somewhat covers me as well 🙈 As for the architecture pattern here, I have no clue how this would look in detail, I can try whip something up tomorrow which may better explain my thoughts, but its too late now 😅
s
The main idea behind parser combinators is that you start with a small set of primitive parsers and a way of combining them together, then you build larger and larger parsers from those.
I was playing around with a toy example in Kotlin that looks like:
Copy code
typealias Parser<A> = (String) -> Pair<A, String>?

fun <A> of(a: A): Parser<A> = { rest -> a to rest }

fun <A, B> Parser<A>.map(f: (A) -> B): Parser<B> = {
    this(it)?.let { (a, rest) -> f(a) to rest }
}

fun <A, B> Parser<(A) -> B>.ap(other: Parser<A>): Parser<B> = {
    this(it)?.let { (f, restF) -> other(restF)?.let { (a, restA) -> f(a) to restA } }
}

fun <A, B, C> Parser<A>.map2(other: Parser<B>, f: (A, B) -> C): Parser<C> = {
    this(it)?.let { (a, restA) -> other(restA)?.let { (b, restB) -> f(a, b) to restB } }
}

fun <A, B, C> Parser<A>.map2LZ(other: () -> Parser<B>, f: (A, B) -> C): Parser<C> = {
    this(it)?.let { (a, restA) -> other()(restA)?.let { (b, restB) -> f(a, b) to restB } }
}

fun <A, B> Parser<A>.apL(other: Parser<B>): Parser<A> = this.map2(other) { a, _ -> a }

fun <A, B> Parser<A>.apR(other: Parser<B>): Parser<B> = this.map2(other) { _, b -> b }

fun <A, B> Parser<A>.flatMap(f: (A) -> Parser<B>): Parser<B> = {
    this(it)?.let { (a, rest) -> f(a).invoke(rest) }
}

fun <A> Parser<A>.alt(other: Parser<A>): Parser<A> = {
    this(it) ?: other(it)
}

fun satisfy(f: (Char) -> Boolean): Parser<Char> = {
    if (f(it[0])) it[0] to it.substring(1) else null
}

fun <A> Parser<A>.some(): Parser<NonEmptyList<A>> = this.map2LZ({ this.many() }, ::NonEmptyList)

fun <A> Parser<A>.many(): Parser<List<A>> = this.some().alt(of(emptyList()))

fun char(c: Char): Parser<Char> = satisfy { it == c }

val digit: Parser<Char> = satisfy { it.isDigit() }

val int: Parser<Int> = digit.some().map { it.foldLeft(0) { i, c -> (i * 10) + c.toInt() } }

fun string(str: String): Parser<String> = {
    if (it.startsWith(str)) str to it.removePrefix(str)
    else null
}

fun oneOf(chars: String): Parser<Char> = satisfy { chars.contains(it) }

val hex: Parser<Char> = digit.alt(oneOf("abcdef"))

fun <A> Parser<A>.optional(): Parser<A?> = {
    this(it) ?: null to it
}

val eof: Parser<Unit> = { if (it == "") Unit to it else null }

fun <A> Parser<A>.repeat(n: Int): Parser<List<A>> = {
    if (n <= 0) emptyList<A>() to it
    else this(it)?.let { (a, restA) ->
        this.repeat(n - 1).invoke(restA)?.let { (aa, restAA) -> (listOf(a) + aa) to restAA } }
}

fun Parser<List<Char>>.asString(): Parser<String> = this.map { it.joinToString("") }

val uuid: Parser<UUID> =
    of { a: String -> { b: String -> { c: String -> { d: String -> { e: String ->
        UUID.fromString("$a-$b-$c-$d-$e")
    } } } } }
        .ap(hex.repeat(8).apL(char('-')).asString())
        .ap(hex.repeat(4).apL(char('-')).asString())
        .ap(hex.repeat(4).apL(char('-')).asString())
        .ap(hex.repeat(4).apL(char('-')).asString())
        .ap(hex.repeat(12).asString())

fun main() {
    println(uuid.apL(eof)("12345678-abcd-1234-abcd-1234567890ef")?.first)
}
This
typealias Parser<A> = (String) -> Pair<A, String>
follows the rhyme of "A parser of things is a function from strings to maybe a pair of things and strings".
The nice thing about it is that you can get something up and running with the simple constructions like above, and then if necessary optimise the primitive parsers.
j
That is a nice description of parser combinators 👍 The example is also nice, however the problem with those constructs is sadly no stack-safety, backtracking isn't as simple, error messages and performance isn't great. You can deal with stacksafety by using
Eval
but that kills performance even further,
AndThen
is also a good idea, but can't get you stacksafety on
flatMap
. Error messages require more parameters and a lot more work and performance requires different primitives 😕 I thought about combinators in kotlin quite a bit and to reach a megaparsec level of performance, error messages and power is something that feels pretty impossible on the jvm.
Also a bit unrelated atm, but do you have a good usecase for
Parser.flatMap
? Not having it allows for stacksafety and more importantly for static analysis. In the few parsers I wrote with combinators (haskell and kotlin) I always got away without it, though it was very convenient to use things like do-syntax. It also allows effects while parsing, which no other op can, so its a tradeoff^^
s
I don't know whether you'll be able to reach "good enough" performance or not, that somewhat depends on the use case. The main appeal that I still see is that you have the ability to change the runtime behaviour behind the main type alias, by updating a handful of primitives and your parsing logic remains largely the same. I'd be interested to see if the Arrow Streams work could play nicely here to recover some performance.
j
I don't know whether you'll be able to reach "good enough" performance or not, that somewhat depends on the use case.
Ah yes, good enough performance is certainly reachable, though it it'll fall short of larger workloads. My https://github.com/1Jajen1/Kparsec is fast enough for my use case of parsing data class toString output (and later prettyprinting it) and has the api of megaparsec, but it needs a rewrite and docs^^
I'd be interested to see if the Arrow Streams work could play nicely here to recover some performance.
Yes a streaming parser interface is the next step, megaparsec is also just that, and so is my kotlin impl of it. I don't think arrow-streams or flow will help implementation wise though
The main appeal that I still see is that you have the ability to change the runtime behaviour behind the main type alias, by updating a handful of primitives and your parsing logic remains largely the same.
That falls flat with just a type alias however an interface, or even composition of multiple interfaces might do that. But I am not too worried about it since most parser libs follow similar if not same naming conventions
Though on the topic of fast enough, you still need stacksafety, even on smaller parsers, which is quite hard, and maybe even impossible on the jvm with the
flatMap
operation, everything else can be stacksafe tho
c
Why is stack safety such a big deal? Is it just for performance, or am I missing something?
s
Stack safety refers to not using unbounded nested function calls. One of the more common occurrence you'll see this appear is with recursive function calls. This can be avoided where it is possible to mark the recursive function as
tailrec
, which allows the compiler to replace the nested calls with something equivalent to a
while
loop that doesn't generate a new stack entry for each call. So if you're working with functions that aren't able to be written to make use of tail-call recursive optimisation and you don't control the depth of the potential recursion then they aren't considered stack-safe. For simple applications, this may still be okay depending on the use, though if the maximum stack-depth/size is exceeded then the runtime will throw a stack overflow exception.
e.g. if I'm deploying something to run in a production system, you'll want to make sure the functions you are calling are either stack-safe or the recursion is somehow bound to a finite depth. If I'm just creating scripts for data mangling that I run locally, I'm less concerned about it.
j
For parsers specifically stacksafety becomes a problem in two ways: • parsers are sometimes stack-unsafe on construction. For example a popular encoding of
many
is similar to
some().map2(many()) { a, xs -> a +  xs }
. Like in @Scott Christopher's example
repeat
is mutually recursive and thus adds many many layers for the stack if
many
parses a larger dataset. This type of stacksafety is best avoided by laziness and later trampolining* as it is usually not tail-recursive in a sense which
tailrec
accepts. • parsers that are lazy and thus stacksafe on construction can still lead to a deep stack on evaluation/parsing. To avoid this we have a few options: Trampoline every operator we have or trampoline at certain points. The former isn't great but works, and the later can grow quite complex. *trampoling refers to a way of execution functions in a loop. Instead of returning a value
A
, we return a continuation
A -> R
which continues using the result. But we don't just invoke it, we put it on a stack and loop over the stack with while and then invoke the continuations. This effectively gives each execution of
((A)->R).invoke()
its own flat callstack. back to your use case @CLOVIS: I actually think that you could very well get away with a stack-unsafe, sub-par perf parser as long as your input does not get too large, which as its designed for human interaction likely won't. The encoding using
(String) -> Pair<a, String>?
falls flat very quickly on larger parsers and/or larger inputs, both of which I have to deal with, which is why I am so focused on stacksafety, as that is by far the hardest problem to solve on the jvm without loosing too much perf.
👍 1
Also important to note is that afaik stack-depth is not a constant across all jvms, with
arrow.core.AndThen
which implements trampolining for
(A) -> R
we usually go for
127
as a max stackdepth which is very conservative but allows other functions to add onto that slighty. Our stacksafety tests run at 5k-20k depth in arrow, and we still have false-positives sometimes 🙈
c
I'm expecting output in the order of hundreds of lines, not much more. But since the information is very simple (there's almost no nesting in data, most of it is “one line is one information”) so I guess it shouldn't be too problematic. I'll try reading the Haskell doc, but I really don't know much about it 😅
j
The haskell tutorial is an amazing tutorial for parser combinators, it is obviously tailored for megaparsec, but much of its common operators works just the same with other parser combinator libs. For a start using @Scott Christopher's snippet above might be a good idea, if you run into overflows or the performance is too bad (although it should not be for your input) you'll have to search for a better lib (and who knows maybe at that point I got one^^). For any questions feel free to ask in this channel, or this thread^^
c
Thanks a lot ^^