but i didnt see the gotchas section of Either casting docume kotlinlang #arrow

Join Slack

but i didnt see the gotchas section of Either cast...

# arrow

jimn

12/08/2019, 12:44 PM

but i didnt see the gotchas section of Either casting documentation if there is one

raulraja

12/08/2019, 12:58 PM

Hi Jim, In general there should be no need to cast on Either since Either can only be Left or Right and

Nothing

is a bottom type that is there for inference. It’s a subtype of all types so it should be valid if your program is properly typed. Do you have a small example that shows why you have to cast it?

jimn

12/08/2019, 1:04 PM

i will take a step back up because im attempting something gnarly. this RowHandle is a working abstraction as an Array<Any?> that only gets me as far as 400k rows before it exceeds jvm heap. I need to keep the parts that work and interject a lazier array/indexable that is cold all the way back to the mmap source. I chose Either<Flow<Any?>,Array<Any?>> as a refactoring attempt. https://github.com/jnorthrup/columnar/blob/81b440d93ca70907657d6b54ebff1679823a531d/src/main/java/com/fnreport/mapper/Columnar.kt#L48

jimn

12/08/2019, 1:06 PM

if All RowHandles are lazy driver calls to mmap the disk will thrash on group/pivot operations. however if all are resident the heap will explode. i would prefer to use Flow<Any?> than to insert softrefs at this juncture

jimn

12/08/2019, 1:07 PM

typically the access patterns will tablescan for keys, and can encode lambdas for later access sideeffects

jimn

12/08/2019, 1:09 PM

the idea i have is that RowHandle could use Flows in the non-key operations and the tablescans will remain Array

jimn

12/08/2019, 1:16 PM

the leak happens from within this flow, re-flowing, and similar other places.

jimn

12/08/2019, 3:05 PM

@raulraja i was using Either as a ghetto Union class here, sounds like this is not the intent.

raulraja

12/08/2019, 4:43 PM

I think the underlying issue here is that you need to perform transformations and move data around but you are constrained by memory because all these data types are strict and therefore compute eagerly already computed values. In your case my recommendation will be to use just Streams which I see you are using Flow. Flow should be the wrapper of it all assuming the Flow impl operates in constant space in memory across transformations.

raulraja

12/08/2019, 4:44 PM

If Flow is not constant space in its transformation which I’m unsure since don’t use it there are other implementations of Streams in the JVM in several langs that are. @simon.vergauwen is currently working in bringing Streams to Arrow Fx but the impl isn’t finished yet

raulraja

12/08/2019, 4:45 PM

Even IO is operates in unbounded memory so if you have an issue of memory I can’t think on any other solutions than streaming with something that guarantees composition and all your ops run in constant space.

raulraja

12/08/2019, 4:47 PM

An alternative is turning all those functional combinators into foreach and using mutable state all around

raulraja

12/08/2019, 4:49 PM

The code will be much more optimal that allocating all those Flows in map which captures it’s outer scope, removing the functional combinators like map from there will reduce the allocation rate probably and will also help without so much dynamic dispatching since some of those functions like map may declare their lambda arguments as

noinline

in order to being able to capture the surrounding context

jimn

12/08/2019, 6:36 PM

This current incarnation meets the goals of performing the operations that lag pandas the hardest, but with kotlin and JVM this is the bottom of the performance curve, the python architecture is already approach the limit without a ton of brittle imports from Arrow (Apache). The presence of Array in the code is the result of eliminating almost all the spread operators and map{} operations - saved 50% heap this way over any stray arraylists. likewise, for loops are boss here. I see the potential tradeoffs aas follows without changing too much compasble structure, thoug perhaps iterating to cleaner editions; • AbstractList would trade hard refs for page fault and context switch bursts. • Pair<(Int)->Any?,Int> would behave similar • Flow<Any?> would compose well but has some overheads like you cited. current flows occupy 400 mb, the arrays allocated occupy 17gigs. would arrive in the middle of stateless access and array access • Array<Any?> wins when there is sufficient heap, and risks everything betting on adequate RAM keys and keyed clusters work well with Array reification and FLows or flows of Indexable lambdas seems like the favorable data reification strategy.

jimn

12/08/2019, 7:01 PM

I have reviewed a bunch of kotlin and java ndframe architectures and they're all vastly bigger than 500 lines of code. i did start out with composable demand-decoded access functors but there needs to be a driver layer and a router layer like you are saying that have different composable shapes. the decoder functors are only necessary one level deep, and do not resemble the terminals. i couldn't satisfy both goals with one layer so the design is slightly deeper than i hoped. keys cannot drive deterministic pivot and group/clustering outcomes by any other means than at least one reification. this seems to imply that two data models is optimal, hot and cold pluggable strategies. this definitely has the potential to definitively compare and contrast FP code metrics versus many existing OO approaches using the same language and vm.

jimn

12/08/2019, 8:38 PM

i'd like to know what kind of scale arrow-aql has been tested with

raulraja

12/08/2019, 8:41 PM

AQL is just a wrapper around MTL typeclasses which is typeclasses that expose the operations of data types. The scale factor is that of the data type plus a dynamic dispatch to go from the interface to it

raulraja

12/08/2019, 8:42 PM

You can plug your own data type of you can provide the type classes aql depends on

raulraja

12/08/2019, 8:42 PM

The level of power is directly related to the amount of type classes a data structure can implement

raulraja

12/08/2019, 8:43 PM

Having filtering option only for things like List or others that can have an empty identity

raulraja

12/08/2019, 8:43 PM

It has not been formally benchmarked because there isn't anything like it that I'm aware of for Kotlin

raulraja

12/08/2019, 8:44 PM

Also it's an incubating and experimental project that does not have even all instances it can target

jimn

12/09/2019, 3:39 AM

i didnt see pivot in the docs. when i pivot my dayjob dataset i am going from 2 column source rows to 18000 empty columns+2values as a straw-man. in the opposite senario, groupby for 3 years produces a list in each of those columns combining 2.5 million values into 1100. the kotlin/jvm memory model doesn't have pointers and references like c++ casts to * and && and *& which could chain together index operator overloads and curry each column's mapper+transform as an stl iterator chain. It is not hard to "tolerate" these kind of language limitations until you hit scale at these volumes.

jimn

12/09/2019, 3:41 AM

this is a good benchmark for different coordination translation strategies to be sure

Open in Slack

Previous Next