< jimn> Wait is that a way to read python dataframes in kotl kotlinlang #datascience

Join Slack

<@U65FK6QNB> Wait is that a way to read python dat...

# datascience

bjonnh

11/08/2019, 11:16 PM

@jimn Wait is that a way to read python dataframes in kotlin?

altavir

11/09/2019, 5:33 AM

What is the problem with reading dataframes in Kotlin? There are libraries for all possible formats.

jimn

11/09/2019, 5:40 AM

@altavir maybe there is a faq for such things that i missed, im not seeing kotlin on https://github.com/topics/fixed-width or really anywhere else. having off-heap mmap on fwf is somehow a missed opportunity in pandas for random access sparse data. looks like same for kotlin.

altavir

11/09/2019, 5:41 AM

Pandas has multiple ways to serialize data. Which one are you talking about?

jimn

11/09/2019, 5:42 AM

serializtion is one thing. performing well is quite another with time series pivot and resequencing

jimn

11/09/2019, 5:42 AM

these are things pandas commits excessive overhead to

altavir

11/09/2019, 5:43 AM

The question was about access, not performance.

altavir

11/09/2019, 5:44 AM

If you are looking into in-memory formats, you should try Apache arrow. It looks really interesting.

jimn

11/09/2019, 5:45 AM

in pandas,. csv is optimized, but fwf is not. that's actually the less intuitive outcome, since fwf is basically free ISAM to memory map and csv requries extra state to accomodate variable lengths

jimn

11/09/2019, 5:45 AM

i know apache arrow. i broke it on the first try in kotlin. there's a few tickets where i help them experience the jetty NIO bugs first-hand

jimn

11/09/2019, 5:46 AM

yes arrow is very fast.

altavir

11/09/2019, 5:46 AM

Well, talking about performance of text-based formats seems strange to me. They are not made for performance.

jimn

11/09/2019, 5:47 AM

jdbc->arrow is also a neighbor of the jdbc->fwf code im writing now. the arrow team admits that when you work with 10's of thousands of columns it's not ideal or really commonly tested

altavir

11/09/2019, 5:47 AM

And still @bjonnh question was about reading, not performance.

jimn

11/09/2019, 5:47 AM

what is uncommon about text formats in data science?

jimn

11/09/2019, 5:48 AM

@bjonnh this project started as couchdb conversion utilities but lately it's been a catch-all for pandas dataloading headaches as well, chiefly getting jdbc data.

jimn

11/09/2019, 5:50 AM

@altavir if you can use the hardware to map a file for random access, and you choose not to, that's not on me.

jimn

11/09/2019, 5:57 AM

Alex as a point of reference, with pandas, I need 350 gig swapfile to map 4 years of sales data across 70000 products using real world data. the source data is only a few gigs of csv, and even "a few gigs" is enough to rule out the success of loading most of the pandas data loading formats

jimn

11/09/2019, 5:59 AM

fwf is an ISAM copmatible solution, and the order of magnitude is near to csv as far as on-disk layout. pandas should not be conflating the data by 100 times on pivot, but it doesz, maybe there's user error, but the pool of examples and aparticularly pandas examples at scale is nearly zero. datascience has a lot of tarpit example cases - 100k data file works well in tensorflow, where numpy overheads for even those could get into the gigabytes

jimn

11/09/2019, 6:01 AM

@altavir also, i can sort fwf files using gnusort tools, etc. there is no better option, even relational is no match for unix on text files

bjonnh

11/11/2019, 6:14 PM

"What is the problem with reading dataframes in Kotlin? There are libraries for all possible formats." I didn't know that

altavir

11/11/2019, 6:16 PM

What format do you have in mind? Pickle?

bjonnh

11/11/2019, 6:16 PM

none in special, I was just interested in being able to exchange data between kotlin and pandas

👍 1

bjonnh

11/11/2019, 6:17 PM

being able to get my df out and having python to plot for example

altavir

11/11/2019, 6:18 PM

You just take your favorite serialization format and find Java library to read it. By the way, beakerx has an autotranslate feature, that allows to load data in Java/Groovy/kotlin and pass open it in python in the same notebook

altavir

11/11/2019, 6:19 PM

https://github.com/twosigma/beakerx/blob/c434e0c0619c6eea4f3dcf28530260da5041016e/doc/groovy/GeneralAutotranslation.ipynb

altavir

11/11/2019, 6:22 PM

The favorite format for pandas serialization is csv/tsv. There are a lot of libraries for that in java. Also there python-like libraries like https://github.com/jtablesaw/tablesaw.

bjonnh

11/11/2019, 6:28 PM

I wanted to keep multi-levels, types etc

bjonnh

11/11/2019, 6:28 PM

maybe I'll just use HDF

bjonnh

11/11/2019, 6:29 PM

much faster to query than csv anyway

jimn

11/12/2019, 3:37 AM

hdf and avro and msgpack have been deprecated for arrow

jimn

11/12/2019, 3:39 AM

jdk arrow libs are not entirely without some babysitting - the netty NIO libraries require some jvm switches to behave last i checked. 10's of thousands of columns is outside the goals of the format

2 Views

Open in Slack

Previous Next