I am building an app maximizing my Arrow usage to ...
# arrow
k
I am building an app maximizing my Arrow usage to try and learn how to use it. I start with an
IO { File("/path/to…").walk()) }
and apply IO transforms to the sequence to end up with an
IO<Sequence<IO<Foo>>>
and then use
flatMap
and
sequence
to transform it to
IO<Sequence<Foo>>
. When I
unsafeRunSync
the result, I get a
StackOverflowError
, presumably from the part where I “lift” the nested IO out of the Sequence (has to iterate over all the elements in the sequence I think). Given that you’re ideally going to have a single place at the “edge” to run the IO, is there a better way I should structure it? Assume I’m starting with
IO<Sequence<IO<Foo>>>
and can do whatever to it. Is
myIO.unsafeRunSync().forEach { it.unsafeRunSync() }
the way I should go about doing it at the edge? Basically run the top-level IO to get the
Sequence
and then start taking values, whcih are themselves
IO
, and run them? Here is a simplified version of what I have now:
Copy code
fun buildFoo(file: File): IO<String> = IO { "test" }
val program: IO<Sequence<File>> = IO { File("/Users/kylegoetz/Desktop").walk() }
val transformedProgram: IO<Sequence<IO<String>>> = program.map {
    it.map {
        buildFoo(it)
    }
}
val noNestedIO: IO<Sequence<String>> = transformedProgram.flatMap {
    it.filterNotNull().k().sequence(IO.applicative()).fix()
}

noNestedIO.unsafeRunSync() // StackOverflowError
What I’m really doing in that inner IO is applying custom transforms to the file data based on filetype to get a fingerprint for dupe detection.
j
Can you try forcing the entire sequence the unsafe way with
filterNotNull().k().map { it.unsafeRunSync() }.toList()
inside
transformedProgram.flatMap { ... }
(just wrap it in
IO { ... }
to get types right. If that does not fail, there is something off with
.sequence()
. Just as a quick check to see where the problem actually is. If that does not overflow it's likely
sequence
, if it does it's somewhere else... Sequences behave pretty bad with
traverse
in terms of laziness, so it might be there 😕
k
OK, I’ll give that a try. Just a heads up, my program ran all night for who knows how long while I was asleep before finally overflowing, so I probably can’t get back to you soon with a result. Will have to let it run for a long time.
j
Ah ok, so how long is the sequence? I recently had trouble with folding/rebuilding a really long sequence, which is what traverse/sequence does
k
It’s insanely long. I’m not sure how long. The sequence length is however many files there are, recursively, in my Documents folder, which includes a lot of Node.js projects with an outlandish number of nested files in the dependencies. Tens of thousands, possibly more?
My goal is to learn how to use FP appropriately. This problem is just my first attempt to being strict on FP stuff. Trying to figure out how to handle
IO<Sequence<IO<CostlyOperation>>>
at the edge, like
top.unsafeRunSync().forEach { it.unsafeRunSync() }
j
Basically
sequence() = traverse(::identity) = foldRight(Applicative.just(emptyList())) { (acc, v) -> v.ap(acc.map { { el -> sequenceOf(el) + it } }) }
. The key point here is that in
foldRight
the sequence is rebuilt and the + operator on sequences is not stacksafe. I'll look through the sources to see if that is actually correct, but I think that may be it 😕
traverse
and
sequence
are the right choice, they just don't play super nicely with
Sequence
k
I’m not so experienced with sequences and lazily evaluated stuff, but my mental model of how sequence must work in order to “pull” the nested IO out of the Sequence is that the entire Sequence would have to be evaluated and then re-wrapped in an IO
j
^^ That is corrected. However it is done step by step using
IO.ap
sequence
is an alias for
traverse(::identity)
which for sequence you can see here: https://github.com/arrow-kt/arrow/blob/master/modules/core/arrow-core-data/src/main/kotlin/arrow/core/SequenceK.kt#L32 It is implemented as a fold which folds over the content, combines it with
ap
and puts the sequence back together after combining
And it does use
Sequence.plus
which is not stacksafe, which is a fault on kotlin's side I think. Basically
superLongSequence().fold(emptySequence()) { acc, v -> sequenceOf(v) + acc }.first() == Stackoverflow
. At least when I last tested it
Here is a small way to reproduce it:
Copy code
generateSequence(0) { it + 1 }.take(100000).asSequence().fold(emptySequence<Int>()) { acc, v ->
        acc + v
      }.first()
This sucks and should not happen, btw this is using kotlin in-builts only
k
This will be cooler once I get smart enough to understand this deep level 🙂
j
Btw I am not even sure this is the cause. The way
foldRight
and
traverse
are implemented for
Sequences
is just not great. There is also no easy way to improve it afaik. So tbh I think going for
transformedProgram.flatMap { IO { it.filterNotNull().map { it.unsafeRunSync() } } }
isn't too bad, you just loose all the concurrency guarantees and benefits arrow gives when actually using
IO
the intended way
k
So you’re saying JB’s implementation of the Sequence.plus operator fun is not stack safe but there’s some other option out there that would be, and it’s a fixable issue that should have a bug filed, or it’s something unfixable and inherent to how one would do Sequence.plus ?
j
JB's plus operator is definitly not stack-safe (at least not this overload
fun Sequence<A>.plus(a: A): Sequence<A>
) the same thing works with
fun Sequence<A>.plus(seq: Sequence<A>): Sequence<A>
btw 😄 Fun times. We are using the latter in
traverse
which is why I think there might be something else going on. This is all very weird, you are not the only one lost in this 🙈
🙂 1
k
OK. But if I understand FP theory from Arrow docs correctly, all the `run`s should take place at the “edge” in one place, so I should really be doing
Copy code
fun main() {
  val program: IO<Sequence<IO<Any>>> = TODO()
  program.unsafeRunSync().map { seq -> seq.forEach { it.unsafeRunSync() } }
}

right?
so all my unsafe runs are in the same place?
rather than trying to move that IO out of the Seq and then flatmapping it away wherever the top-level IO is constructed in myc ode
j
Well if you can sure. However the best way is to only have one call to any of the
unsafe
methods. But this can't really be avoided when working with
IO
in sequences because of these issues. If only stackoverflows weren't so cryptic
k
Because the way my program is structured, for unit testing everything, is I have an App class that takes a bunch of use cases returning IOs. App.execute returns all the calculations wrapped in a single IO. This is the
IO<Sequence<IO<Whatever>>>
. So I can unit test the App.execute by injecting mocks. In my main() that calls App(…).execute() I have full control. Regarding `unsafe`: is it called that because the contents should be considered impure/have side effects, or is it called that because it is not stack-safe inside?
j
The former. Calling an unsafe method performs side-effects.
IO<A>
is always pure because it does not perform side-effects, but when calling any of the
unsafe
methods that changes.
k
OK, I understand that part. IO<A> is basically a value representing a function that will be impure, but IO<A> itself is not because it’s a value not an actual function-being-run. To use non-technical language.
💯 3
j
IO
is stacksafe, functional programs cannot really afford not being stacksafe 😉 We have tests that explicitly check that. However for sequences thats a different story because the issues are likely somewhere else
k
So theoretically
assertTrue(expectedIO, actualIO)
could be done in a unit test instead of
assertTrue(expectedValue, actualIO.unsafeRunSync())
unless IO has some eq operator that would prohibit it
r
If you encapsulate running it then yes but IO values are not comparable because they are lazy
I mean you can achieve that DSL but you have to run IO to compare it
j
Eq
for
IO
is undefined, we have a custom
Eq
instance for
IO
only for tests which uses
unsafeRunTimed
internally
k
I guess I was thinking of IO { list of instructions } as something like an array of instructions, and you could compare two IOs as equal if they have the same set of instructions, but the results of those IOs’ unsafeRunSync() could still be different because the contents of the effect { … } would be different in each one.
like
IO { !effect { repository.getAll() } }
would be equal to
IO { !effect { repository.getAll() } }
because they’re the same instruction sets, but that doesn’t mean the
unsafeRunSync
results will be the same because then the effects are applied. But thats’ my flawed conception of what laziness actually is 🙂
j
Btw I am trying to reproduce something,
traverse
on sequences is really slow. Like the actual call to
traverse
not even running the resulting datatype. Not entirely sure what is causing it, but imo that is unacceptable levels of slowness. There is no real alternative, but this just sucks. @kyleg the best solution is to simply not use
traverse
on sequences and just straight up run whatever code you need to run. Just wrap the entire thing in
IO
so you get some control over your side-effects back 😞
k
Cool beans. Thanks.
j
You are not wrong with
IO { list of instructions }
. In the end
IO
is just a powerful wrapper around a function. That is also where the problem is, there is no practial way to implement function equality, especially not without running it and once you run a function from
IO
you have side-effects and those are even harder to catch and test
So after the pr with
lazyAp
is merged you can actually call
traverse
on sequences again, it will still be very very slow, but it will run your code. If you don't need the resulting sequence it's usually better to call
traverse_
which throws away the sequence and makes it just as fast as
traverse
on any other datatype. (it also has a
sequence_
version). I'll link the pr later, but earliest this will be in arrow is next snapshots, or 0.10.5 next release. In short: use `traverse_`/`sequence_` whenever possible with sequences and using
traverse
on sequences now does not hang and throw on without ever running any effect, it also only evaluates sequences as far as needed which means it can also work on infinite sequences. Will link the pr later
With that pr merged there is no more stackoverflows, traverse on sequences is now fast and unless you care about memory usage `traverse_`/`traverse` does not matter. Basically all problems fixed
arrow 1
🔝 1
🎉 1
r
Thanks Jannis!