Are there any particular considerations in the kotlin datafr kotlinlang #datascience

Are there any particular considerations in the kot...

sean

12/15/2023, 5:22 PM

Are there any particular considerations in the kotlin dataframe library for parallel processing an operation across a dataframe. Likewise w/ vectorized operations

altavir

12/15/2023, 7:50 PM

Vectorization works automatically for all primitive arrays. I don't remember if DataFrame uses them by default, but if it does, it should work. Parallel processing usually does nothing good for simple operations.

Jolan Rensen [JB]

01/05/2024, 11:22 AM

Currently no, DataFrame doesn't perform processing in parallel, nor does it use primitive arrays (but we're thinking about it, because indeed vectorization might improve performance). At the moment large-scale performance isn't the main priority of DataFrame. DataFrame's strengths lie in its in-memory capabilities and readable API, which make data exploration easier. For large-scale processing I'd recommend Apache Arrow (for which DataFrame has good interop), or Apache Spark (for which there's also interop possible)

altavir

01/05/2024, 11:28 AM

Nice to see my issues still has some traction (I forgot about it). And it is not about vectorization only, it is about memory indirection. We can discuss it separately, but the main problem is the boxing. The expected performance gain for processing raw numbers is 4-10 times, but it requires a significant change to API to make the most of it. Smaller improvements to CPU (but significant for memory consumption) could be achieved by using number arrays as storage. Escape analysis (especially on GraalVM) provides significant boost even when the array is masked by a boxing class.

8 Views

Open in Slack

Previous Next