CLOVIS
12/17/2020, 5:10 PMfun parse(List<String>): SomeDomainObject
That consists of many, many, many calls to startsWith
and similar in an enormous when
.
The data I'm parsing was not designed to be parsed by a program, but to be read by humans, so there's no automated way (like Kotlinx.Serialization)... Is there anything in Arrow or any programming pattern that could help clean up this code so it's easier to maintain?Jannis
12/17/2020, 5:25 PMJannis
12/17/2020, 5:25 PMCLOVIS
12/17/2020, 6:37 PMJannis
12/17/2020, 7:36 PMCLOVIS
12/17/2020, 8:33 PMCLOVIS
12/17/2020, 8:38 PMwhen
. (The current version uses data classes' copy, which I guess is not optimal, but it has the benefits that the code is really easy to maintain: a builder would require to write all data points once in a data class constructor, once in the builder constructor, and once in the build
method. So I'm betting that the IO operations I'm parsing are slower than my low quality memory management 😅)CLOVIS
12/17/2020, 8:40 PM<header> : <value>
but I have other places where the data is
downloading <file>
installing <file> in <directory>
And many, many other patternsJannis
12/17/2020, 8:42 PMI'm betting that the IO operations I'm parsing are slower than my low quality memory management 😅You'd be surprised by just how fast recent IO, especially file IO, has become^^ But either way it probably does not matter here 😅
Jannis
12/17/2020, 8:53 PMTransitions
which modify the state machine into the next state. This may allow easy reuse of functions while still keeping parsing and the state machine separated enough to deal with this level of difference between them. But yeah, no idea if this fits^^CLOVIS
12/17/2020, 9:59 PMJannis
12/17/2020, 10:18 PMScott Christopher
12/18/2020, 10:09 AMScott Christopher
12/18/2020, 10:09 AMtypealias Parser<A> = (String) -> Pair<A, String>?
fun <A> of(a: A): Parser<A> = { rest -> a to rest }
fun <A, B> Parser<A>.map(f: (A) -> B): Parser<B> = {
this(it)?.let { (a, rest) -> f(a) to rest }
}
fun <A, B> Parser<(A) -> B>.ap(other: Parser<A>): Parser<B> = {
this(it)?.let { (f, restF) -> other(restF)?.let { (a, restA) -> f(a) to restA } }
}
fun <A, B, C> Parser<A>.map2(other: Parser<B>, f: (A, B) -> C): Parser<C> = {
this(it)?.let { (a, restA) -> other(restA)?.let { (b, restB) -> f(a, b) to restB } }
}
fun <A, B, C> Parser<A>.map2LZ(other: () -> Parser<B>, f: (A, B) -> C): Parser<C> = {
this(it)?.let { (a, restA) -> other()(restA)?.let { (b, restB) -> f(a, b) to restB } }
}
fun <A, B> Parser<A>.apL(other: Parser<B>): Parser<A> = this.map2(other) { a, _ -> a }
fun <A, B> Parser<A>.apR(other: Parser<B>): Parser<B> = this.map2(other) { _, b -> b }
fun <A, B> Parser<A>.flatMap(f: (A) -> Parser<B>): Parser<B> = {
this(it)?.let { (a, rest) -> f(a).invoke(rest) }
}
fun <A> Parser<A>.alt(other: Parser<A>): Parser<A> = {
this(it) ?: other(it)
}
fun satisfy(f: (Char) -> Boolean): Parser<Char> = {
if (f(it[0])) it[0] to it.substring(1) else null
}
fun <A> Parser<A>.some(): Parser<NonEmptyList<A>> = this.map2LZ({ this.many() }, ::NonEmptyList)
fun <A> Parser<A>.many(): Parser<List<A>> = this.some().alt(of(emptyList()))
fun char(c: Char): Parser<Char> = satisfy { it == c }
val digit: Parser<Char> = satisfy { it.isDigit() }
val int: Parser<Int> = digit.some().map { it.foldLeft(0) { i, c -> (i * 10) + c.toInt() } }
fun string(str: String): Parser<String> = {
if (it.startsWith(str)) str to it.removePrefix(str)
else null
}
fun oneOf(chars: String): Parser<Char> = satisfy { chars.contains(it) }
val hex: Parser<Char> = digit.alt(oneOf("abcdef"))
fun <A> Parser<A>.optional(): Parser<A?> = {
this(it) ?: null to it
}
val eof: Parser<Unit> = { if (it == "") Unit to it else null }
fun <A> Parser<A>.repeat(n: Int): Parser<List<A>> = {
if (n <= 0) emptyList<A>() to it
else this(it)?.let { (a, restA) ->
this.repeat(n - 1).invoke(restA)?.let { (aa, restAA) -> (listOf(a) + aa) to restAA } }
}
fun Parser<List<Char>>.asString(): Parser<String> = this.map { it.joinToString("") }
val uuid: Parser<UUID> =
of { a: String -> { b: String -> { c: String -> { d: String -> { e: String ->
UUID.fromString("$a-$b-$c-$d-$e")
} } } } }
.ap(hex.repeat(8).apL(char('-')).asString())
.ap(hex.repeat(4).apL(char('-')).asString())
.ap(hex.repeat(4).apL(char('-')).asString())
.ap(hex.repeat(4).apL(char('-')).asString())
.ap(hex.repeat(12).asString())
fun main() {
println(uuid.apL(eof)("12345678-abcd-1234-abcd-1234567890ef")?.first)
}
Scott Christopher
12/18/2020, 10:11 AMtypealias Parser<A> = (String) -> Pair<A, String>
follows the rhyme of "A parser of things is a function from strings to maybe a pair of things and strings".Scott Christopher
12/18/2020, 10:12 AMJannis
12/18/2020, 10:54 AMEval
but that kills performance even further,AndThen
is also a good idea, but can't get you stacksafety on flatMap
.
Error messages require more parameters and a lot more work and performance requires different primitives 😕 I thought about combinators in kotlin quite a bit and to reach a megaparsec level of performance, error messages and power is something that feels pretty impossible on the jvm.Jannis
12/18/2020, 11:07 AMParser.flatMap
? Not having it allows for stacksafety and more importantly for static analysis. In the few parsers I wrote with combinators (haskell and kotlin) I always got away without it, though it was very convenient to use things like do-syntax. It also allows effects while parsing, which no other op can, so its a tradeoff^^Scott Christopher
12/18/2020, 11:08 AMJannis
12/18/2020, 11:17 AMI don't know whether you'll be able to reach "good enough" performance or not, that somewhat depends on the use case.Ah yes, good enough performance is certainly reachable, though it it'll fall short of larger workloads. My https://github.com/1Jajen1/Kparsec is fast enough for my use case of parsing data class toString output (and later prettyprinting it) and has the api of megaparsec, but it needs a rewrite and docs^^
I'd be interested to see if the Arrow Streams work could play nicely here to recover some performance.Yes a streaming parser interface is the next step, megaparsec is also just that, and so is my kotlin impl of it. I don't think arrow-streams or flow will help implementation wise though
Jannis
12/18/2020, 11:19 AMThe main appeal that I still see is that you have the ability to change the runtime behaviour behind the main type alias, by updating a handful of primitives and your parsing logic remains largely the same.That falls flat with just a type alias however an interface, or even composition of multiple interfaces might do that. But I am not too worried about it since most parser libs follow similar if not same naming conventions
Jannis
12/18/2020, 11:20 AMflatMap
operation, everything else can be stacksafe thoCLOVIS
12/19/2020, 9:06 AMScott Christopher
12/19/2020, 11:00 AMtailrec
, which allows the compiler to replace the nested calls with something equivalent to a while
loop that doesn't generate a new stack entry for each call. So if you're working with functions that aren't able to be written to make use of tail-call recursive optimisation and you don't control the depth of the potential recursion then they aren't considered stack-safe. For simple applications, this may still be okay depending on the use, though if the maximum stack-depth/size is exceeded then the runtime will throw a stack overflow exception.Scott Christopher
12/19/2020, 11:04 AMJannis
12/19/2020, 11:29 AMmany
is similar to some().map2(many()) { a, xs -> a + xs }
. Like in @Scott Christopher's example repeat
is mutually recursive and thus adds many many layers for the stack if many
parses a larger dataset. This type of stacksafety is best avoided by laziness and later trampolining* as it is usually not tail-recursive in a sense which tailrec
accepts.
• parsers that are lazy and thus stacksafe on construction can still lead to a deep stack on evaluation/parsing. To avoid this we have a few options: Trampoline every operator we have or trampoline at certain points. The former isn't great but works, and the later can grow quite complex.
*trampoling refers to a way of execution functions in a loop. Instead of returning a value A
, we return a continuation A -> R
which continues using the result. But we don't just invoke it, we put it on a stack and loop over the stack with while and then invoke the continuations. This effectively gives each execution of ((A)->R).invoke()
its own flat callstack.
back to your use case @CLOVIS: I actually think that you could very well get away with a stack-unsafe, sub-par perf parser as long as your input does not get too large, which as its designed for human interaction likely won't. The encoding using (String) -> Pair<a, String>?
falls flat very quickly on larger parsers and/or larger inputs, both of which I have to deal with, which is why I am so focused on stacksafety, as that is by far the hardest problem to solve on the jvm without loosing too much perf.Jannis
12/19/2020, 11:31 AMarrow.core.AndThen
which implements trampolining for (A) -> R
we usually go for 127
as a max stackdepth which is very conservative but allows other functions to add onto that slighty. Our stacksafety tests run at 5k-20k depth in arrow, and we still have false-positives sometimes 🙈CLOVIS
12/19/2020, 11:49 AMJannis
12/19/2020, 11:56 AMCLOVIS
12/20/2020, 9:46 AM