After a little time away I started a new Kotlin Da...
# datascience
a
After a little time away I started a new Kotlin DataFrame (Beta3) project inside the IDEA built-in Kotlin Notebook - any memory management tips? My data set isn't too large (16 million rows, 8 columns, just short strings, LocalDates, Ints, Longs) but even after doubling the default memory size to 6.5GB I consistently got heap and out of memory errors. Tried loading from database and from Parquet file (file = 180mb). I was only able to get the original dataframe created after doubling RAM but doing a simple groupBy/aggregation threw memory errors again (even though this operation would reduce the row count by 1/7). I'm on a Mac with 36gb RAM and can see "java" process taking up high 6 or low 7 gb RAM when the data is first read, before the aggregation step. Loading the same Parquet file in Python is low 2gb. I would like to try using the Kotlin Jupyter kernel in VSCode without the overhead of IDEA or Kotlin Notebook but that will have to wait a few days as I'm running on an older version of the Jupyter kernel. I have used Kotlin DataFrame very successfully for some projects with smaller datasets so obviously would like to continue doing so. Thanks!!
a
Basically, you should not use DataFrame or other tools that load the whole data into memory for this task. Python people do that, but they do not have a lot of choice because they do not have good streaming processors. The correct way is to filter/preprocess data on loading. I am not sure if DataFrame could do that by itself since it uses column storage.
👍 1
I use our own implementation tables API https://github.com/SciProgCentre/dataforge-core/tree/dev/tables-kt. It allows both column-based storage and row-based storage (with lazy loading). It is not well-developed though. And there is no support for parquet yet.
d
I mainly use notebooks to do numeric signal recordings analysis, that are in some proprietary formats and sometimes up to 1Go or more. We have our own readers for these formats using streams. As Alexander I use my first cell prepossessing the data by extracting what interests me depending on the task at hand, like trimming cues, etc. before turning them to dataframes. Note that even then I often allocate 16Go of memory to my Jupyter kernel... 😛
a
Parquet is more of a convenience than a necessity, it allows me to read a data file in about 5 seconds that would be re-running the same 30-60 second SQL query each time we reset the analysis and start over. But the rest, I already filter, group and pre-aggregate a lot of the data, otherwise we'd be talking 200-500 million rows instead of 16. I would be fine with some kind of a streaming interface but all the DataFrame .read() functions pull in the entire dataset. I can write a lot of vendor-specific SQL to create the data structures we need (a lot of Maps) but the further we go down that road, it just makes more sense to do the whole thing in SQL, albeit quite messy.
@altavir I appreciate your reply here, thank you. Might have to temporarily just follow @Didier Villevalois’s lead and dedicate more RAM (thanks for your reply too)
gratitude thank you 1
a
Indeed, increasing memory is always a possibility. Also not everything could be done in with row-wise operations. Aggregations are better row-wise, but if you want to operate columns, it is not that easy ( in tables we use virtual columns (they are not quite usable in Python). If you could list operation that you want to do, it would be quite interesting.
n
Hi! Could you
Capture Memory Snapshot
in
Profiler
toolwindow for
org.jetbrains.kotlinx.jupyter.IKotlinKt
process (running Kotlin Notebook) and share a screenshot? Here i have a dataframe with 10 million rows, mostly String columns. Based on your profile we can figure something. As an idea, for this specific dataframe custom String interning could reduce footprint a lot because most columns only have about 10 unique String values. > I would like to try using the Kotlin Jupyter kernel in VSCode without the overhead of IDEA or Kotlin Notebook I want to recommend trying it in regular Gradle project with compiler plugin enabled. Read parquet data in notebook, call
df.generateDataClasses()
and copy the schema in the project. Given initial schema, dataframe will be able to provide typesafe results, much like in notebooks, for lots of operations like add, convert, remove, groupBy+aggregate, ..., with exceptions being some split overloads, pivot, and similar As a result: 1. your pipeline will run in its own process 2. gc might collect more intermediate objects 3. maybe it'll work well for you in general? 🙂
a
Thanks again for the replies, I really appreciate you guys offering suggestions to help. took the memory snapshots but can't open the profiler right this minute, will have to figure that out. Yes I agree a Gradle project would probably be more successful in terms of running straight through however I'm in dev mode just starting the analysis of the data which is why I'm utilizing the Notebook. As far as classes I do use @DataSchema to annotate the classes being loaded in and then also post-transformation. I'll try to test with the compiler plugin and also the Strings. Thanks for the suggestions
👍 2
thank you color 1