Are there any particular considerations in the kot...
# datascience
s
Are there any particular considerations in the kotlin dataframe library for parallel processing an operation across a dataframe. Likewise w/ vectorized operations
a
Vectorization works automatically for all primitive arrays. I don't remember if DataFrame uses them by default, but if it does, it should work. Parallel processing usually does nothing good for simple operations.
j
Currently no, DataFrame doesn't perform processing in parallel, nor does it use primitive arrays (but we're thinking about it, because indeed vectorization might improve performance). At the moment large-scale performance isn't the main priority of DataFrame. DataFrame's strengths lie in its in-memory capabilities and readable API, which make data exploration easier. For large-scale processing I'd recommend Apache Arrow (for which DataFrame has good interop), or Apache Spark (for which there's also interop possible)
a
Nice to see my issues still has some traction (I forgot about it). And it is not about vectorization only, it is about memory indirection. We can discuss it separately, but the main problem is the boxing. The expected performance gain for processing raw numbers is 4-10 times, but it requires a significant change to API to make the most of it. Smaller improvements to CPU (but significant for memory consumption) could be achieved by using number arrays as storage. Escape analysis (especially on GraalVM) provides significant boost even when the array is masked by a boxing class.