I ve finished first iteration of Table API Current version i kotlinlang #datascience

I've finished first iteration of Table API. Curren...

altavir

02/16/2020, 12:25 PM

I've finished first iteration of Table API. Current version is here: https://github.com/mipt-npm/dataforge-core/tree/dev/dataforge-tables/src/commonMain/kotlin/hep/dataforge/tables. It features reasonable (not full) type safety for table building and reading. Also it is basically an interface, so it could be binded to different table implementations. For now there is no documentation and a single IO test here: https://github.com/mipt-npm/dataforge-core/blob/dev/dataforge-tables/src/jvmTest/kotlin/hep/dataforge/tables/io/TextRowsTest.kt. Thread in Slack Conversation

👍 3

jimn

02/18/2020, 11:59 PM

If this is a jvm-specific codebase kotlin concurrency is probably going to underperform at transaction granularity dispatch. I started along the lines of similar table manipulation using threadlocals and CoroutineContext composition and had to back out any suspend code and treat the coroutinecontexts as non-concurrent set objects to get sensible profiler results.

jimn

02/19/2020, 12:06 AM

my specific metric was to get function sizes down below escape analysis metrics and identify any potential concurrency wins for thread-based swim lanes. suspension appears to induce excessive context protection overheads which have large inline-challenging functions even on explicitly single-threaded code.

altavir

02/19/2020, 12:35 PM

Sorry, I've missed your comments. I did not add manipulation yet and I still think about how to better add it if add at all.

altavir

02/19/2020, 12:37 PM

Yesterday we had a discussion with @Zelenyi about that and came to some decisions. The key difference of what I am doing to, say, python, is that tables are effectively immutable, so when you do any operation on columns, you are actually constructing a new table, which can reuse columns from the old table. I am not sure, that we actually have a valid case, when we change only some values in the table. I would welcome any examples.

jimn

02/19/2020, 12:38 PM

mutable tables seems unimportant in the usecases im familiar with. iiuc

altavir

02/19/2020, 12:39 PM

@jimn Could you elaborate about what do you mean by mutable tables. Do you mutate single valuse in columns, or the whole column?

jimn

02/19/2020, 12:42 PM

i'm working on a pandas replacement personally. i have in the past used pandas to mutate cell values, perhaps a column at a time, or to do search and replace, but authoring such a facility would open the door to heap objects and a main reason for replacing pandas

jimn

02/19/2020, 12:42 PM

cell values [in a dataframe] ^

altavir

02/19/2020, 12:43 PM

I am not concerned about heap for now, since I am developing an interface, which could later wrap other implementations. And implementations are responsible for memory, not API.

altavir

02/19/2020, 12:45 PM

Column mutation is possible at the moment, without violating immutability. You just create a new table, reusing unchanged columns. We can also add implementation specific mutation methods later.

altavir

02/19/2020, 12:47 PM

I would welcome any suggestions about API.

jimn

02/19/2020, 12:48 PM

i'm relativevly happy with what i've got to solve python problems with kotlin, and arrived at an immutable view manipulation framework. I'm more dissapointed with kotlin libraries as a whole, the suspension overhead is a 3x the cost of doing an atomic data access as is the suspension/flow capture overheads. the base language is not bad but carrying the jdk collections is surely just to appease the jvm/android momentum

altavir

02/19/2020, 12:49 PM

If you have something ready, I can try to integrate it with my API and see how it works. Suspesion is a very powerful tool, but is should be used with care and understanding.

jimn

02/19/2020, 12:49 PM

while it's one thing to build a bulletproof memory model and give every possible data usecase some attention at a cost of technical debt i think the reality is that a dataframe application should rely on doing a transform and live a short brutal life guaranteed to be cleaned up at a process level

jimn

02/19/2020, 12:50 PM

this one https://github.com/jnorthrup/columnar

jimn

02/19/2020, 12:53 PM

the input structuring is basic as is to be expected for a first cut. I don't know if this is anything more thana technology demo for the kotlin typealiasing capabilities tbh. i have soemthing that can scale up beyond where pandas dies, the premise is simplistic and portable to rust or c++ which makes sense as any significant expansion of scope with the lessons learned here.

altavir

02/19/2020, 12:53 PM

I am not sure that I can elaborate on that without actual problem background. It is obvious that you can't find a universally good solution. Thanks for the reference, I will look into it later

jimn

02/19/2020, 12:55 PM

i would say this usecase is most applicable for spark datasets on a diet.

jimn

02/19/2020, 12:56 PM

i intended to test out y,x storage along with x,y as an orthogonal, to choose IO models appropriately, however kotlin and IO do not mix.

altavir

02/19/2020, 12:57 PM

kotlin and IO do not mix

I do not agree, but we discussed it already. There should be some problem deffinition to discuss it further.

jimn

02/19/2020, 12:59 PM

key point is great functional programming language, generous jvm optimization potential to smooth over fundamental awkward libraries, and bijection-centric design principals that can occupy typealias conveniences absent from java proper. using NIO.

altavir

02/19/2020, 1:06 PM

We already discussed it. And I can repeat that I belive that in many cases you need to read data differently, not use different tool. But, again, we need to discuss specific use cases.

jimn

02/19/2020, 1:16 PM

the codebase is foundational in orthogonal access patterns for tabular data. as a first cut, the parts that were most easily improved happened by removing kotlin libraries and coroutines.

jimn

02/19/2020, 1:18 PM

i eliminated capture and suspension overhead. it looks like the simplest possible parralellization to utilize underutilized cores is going to be java streams with threads. simpler still is port to c++ since it is now single threaded code, and insert openmp pragmas on a few inner loops.

jimn

02/19/2020, 1:24 PM

i have considered that now the tabular y,x code is solid, i can create a file per column and unify these across combined cursors. one of my benchmarks converts 2.8 millionx7 cells into 1100x9800 give or take. so the intermediary would be generating, and mmap handles for at least 9800 additional small files. at this point those files no longer need the mmap code, and i don't have small-file access driver code at this time, so, until i see a really burning issue i'm ok with the existing costs up front. adding a seperate writable device is the cheapest IO upgrade

jimn

02/19/2020, 1:26 PM

when i compare fwf textual input io to binary NIO field input of the same data, there is a single noinline keyword on the reducer eliminating any gains from hitting a smaller binary file, i believe.

jimn

02/19/2020, 1:27 PM

this code is likely as optimal as jvm kotlin is going to get without an elaborate queuing of io zones for for batching up partial clusters

4 Views

Open in Slack

Previous Next