but i didnt see the gotchas section of Either cast...
# arrow
j
but i didnt see the gotchas section of Either casting documentation if there is one
r
Hi Jim, In general there should be no need to cast on Either since Either can only be Left or Right and
Nothing
is a bottom type that is there for inference. It’s a subtype of all types so it should be valid if your program is properly typed. Do you have a small example that shows why you have to cast it?
j
i will take a step back up because im attempting something gnarly. this RowHandle is a working abstraction as an Array<Any?> that only gets me as far as 400k rows before it exceeds jvm heap. I need to keep the parts that work and interject a lazier array/indexable that is cold all the way back to the mmap source. I chose Either<Flow<Any?>,Array<Any?>> as a refactoring attempt. https://github.com/jnorthrup/columnar/blob/81b440d93ca70907657d6b54ebff1679823a531d/src/main/java/com/fnreport/mapper/Columnar.kt#L48
if All RowHandles are lazy driver calls to mmap the disk will thrash on group/pivot operations. however if all are resident the heap will explode. i would prefer to use Flow<Any?> than to insert softrefs at this juncture
typically the access patterns will tablescan for keys, and can encode lambdas for later access sideeffects
the idea i have is that RowHandle could use Flows in the non-key operations and the tablescans will remain Array
the leak happens from within this flow, re-flowing, and similar other places.
@raulraja i was using Either as a ghetto Union class here, sounds like this is not the intent.
r
I think the underlying issue here is that you need to perform transformations and move data around but you are constrained by memory because all these data types are strict and therefore compute eagerly already computed values. In your case my recommendation will be to use just Streams which I see you are using Flow. Flow should be the wrapper of it all assuming the Flow impl operates in constant space in memory across transformations.
If Flow is not constant space in its transformation which I’m unsure since don’t use it there are other implementations of Streams in the JVM in several langs that are. @simon.vergauwen is currently working in bringing Streams to Arrow Fx but the impl isn’t finished yet
Even IO is operates in unbounded memory so if you have an issue of memory I can’t think on any other solutions than streaming with something that guarantees composition and all your ops run in constant space.
An alternative is turning all those functional combinators into foreach and using mutable state all around
The code will be much more optimal that allocating all those Flows in map which captures it’s outer scope, removing the functional combinators like map from there will reduce the allocation rate probably and will also help without so much dynamic dispatching since some of those functions like map may declare their lambda arguments as
noinline
in order to being able to capture the surrounding context
j
This current incarnation meets the goals of performing the operations that lag pandas the hardest, but with kotlin and JVM this is the bottom of the performance curve, the python architecture is already approach the limit without a ton of brittle imports from Arrow (Apache). The presence of Array in the code is the result of eliminating almost all the spread operators and map{} operations - saved 50% heap this way over any stray arraylists. likewise, for loops are boss here. I see the potential tradeoffs aas follows without changing too much compasble structure, thoug perhaps iterating to cleaner editions; • AbstractList would trade hard refs for page fault and context switch bursts. • Pair<(Int)->Any?,Int> would behave similar • Flow<Any?> would compose well but has some overheads like you cited. current flows occupy 400 mb, the arrays allocated occupy 17gigs. would arrive in the middle of stateless access and array access • Array<Any?> wins when there is sufficient heap, and risks everything betting on adequate RAM keys and keyed clusters work well with Array reification and FLows or flows of Indexable lambdas seems like the favorable data reification strategy.
I have reviewed a bunch of kotlin and java ndframe architectures and they're all vastly bigger than 500 lines of code. i did start out with composable demand-decoded access functors but there needs to be a driver layer and a router layer like you are saying that have different composable shapes. the decoder functors are only necessary one level deep, and do not resemble the terminals. i couldn't satisfy both goals with one layer so the design is slightly deeper than i hoped. keys cannot drive deterministic pivot and group/clustering outcomes by any other means than at least one reification. this seems to imply that two data models is optimal, hot and cold pluggable strategies. this definitely has the potential to definitively compare and contrast FP code metrics versus many existing OO approaches using the same language and vm.
i'd like to know what kind of scale arrow-aql has been tested with
r
AQL is just a wrapper around MTL typeclasses which is typeclasses that expose the operations of data types. The scale factor is that of the data type plus a dynamic dispatch to go from the interface to it
You can plug your own data type of you can provide the type classes aql depends on
The level of power is directly related to the amount of type classes a data structure can implement
Having filtering option only for things like List or others that can have an empty identity
It has not been formally benchmarked because there isn't anything like it that I'm aware of for Kotlin
Also it's an incubating and experimental project that does not have even all instances it can target
j
i didnt see pivot in the docs. when i pivot my dayjob dataset i am going from 2 column source rows to 18000 empty columns+2values as a straw-man. in the opposite senario, groupby for 3 years produces a list in each of those columns combining 2.5 million values into 1100. the kotlin/jvm memory model doesn't have pointers and references like c++ casts to * and && and *& which could chain together index operator overloads and curry each column's mapper+transform as an stl iterator chain. It is not hard to "tolerate" these kind of language limitations until you hit scale at these volumes.
this is a good benchmark for different coordination translation strategies to be sure