<@U65FK6QNB> Wait is that a way to read python dat...
# datascience
b
@jimn Wait is that a way to read python dataframes in kotlin?
a
What is the problem with reading dataframes in Kotlin? There are libraries for all possible formats.
j
@altavir maybe there is a faq for such things that i missed, im not seeing kotlin on https://github.com/topics/fixed-width or really anywhere else. having off-heap mmap on fwf is somehow a missed opportunity in pandas for random access sparse data. looks like same for kotlin.
a
Pandas has multiple ways to serialize data. Which one are you talking about?
j
serializtion is one thing. performing well is quite another with time series pivot and resequencing
these are things pandas commits excessive overhead to
a
The question was about access, not performance.
If you are looking into in-memory formats, you should try Apache arrow. It looks really interesting.
j
in pandas,. csv is optimized, but fwf is not. that's actually the less intuitive outcome, since fwf is basically free ISAM to memory map and csv requries extra state to accomodate variable lengths
i know apache arrow. i broke it on the first try in kotlin. there's a few tickets where i help them experience the jetty NIO bugs first-hand
yes arrow is very fast.
a
Well, talking about performance of text-based formats seems strange to me. They are not made for performance.
j
jdbc->arrow is also a neighbor of the jdbc->fwf code im writing now. the arrow team admits that when you work with 10's of thousands of columns it's not ideal or really commonly tested
a
And still @bjonnh question was about reading, not performance.
j
what is uncommon about text formats in data science?
@bjonnh this project started as couchdb conversion utilities but lately it's been a catch-all for pandas dataloading headaches as well, chiefly getting jdbc data.
@altavir if you can use the hardware to map a file for random access, and you choose not to, that's not on me.
Alex as a point of reference, with pandas, I need 350 gig swapfile to map 4 years of sales data across 70000 products using real world data. the source data is only a few gigs of csv, and even "a few gigs" is enough to rule out the success of loading most of the pandas data loading formats
fwf is an ISAM copmatible solution, and the order of magnitude is near to csv as far as on-disk layout. pandas should not be conflating the data by 100 times on pivot, but it doesz, maybe there's user error, but the pool of examples and aparticularly pandas examples at scale is nearly zero. datascience has a lot of tarpit example cases - 100k data file works well in tensorflow, where numpy overheads for even those could get into the gigabytes
@altavir also, i can sort fwf files using gnusort tools, etc. there is no better option, even relational is no match for unix on text files
b
"What is the problem with reading dataframes in Kotlin? There are libraries for all possible formats." I didn't know that
a
What format do you have in mind? Pickle?
b
none in special, I was just interested in being able to exchange data between kotlin and pandas
👍 1
being able to get my df out and having python to plot for example
a
You just take your favorite serialization format and find Java library to read it. By the way, beakerx has an autotranslate feature, that allows to load data in Java/Groovy/kotlin and pass open it in python in the same notebook
The favorite format for pandas serialization is csv/tsv. There are a lot of libraries for that in java. Also there python-like libraries like https://github.com/jtablesaw/tablesaw.
b
I wanted to keep multi-levels, types etc
maybe I'll just use HDF
much faster to query than csv anyway
j
hdf and avro and msgpack have been deprecated for arrow
jdk arrow libs are not entirely without some babysitting - the netty NIO libraries require some jvm switches to behave last i checked. 10's of thousands of columns is outside the goals of the format