Hi, do you think DataFrame library is suitable to ...
# datascience
v
Hi, do you think DataFrame library is suitable to be used in Backend application as intermediate data representation? My backend ussually does data pipelines that require getting data from muliple sources, joining them and doing some transformations, as i dont like having classes that don't represent the data i create multiple dtos and mappings between them, but data frame seems like it could solve it. Any one used it this way?
i
I'm doing it for work right now, and have been asking multiple questions in this channel because of it. I just posted a long question right after you posted yours
v
And what are your thougts on it? Do you think it simplified your workflow?
Have you notice any performance benefits/drawbacks? Also i am concerned that it is still 0.14, so the api might change drastically
i
I'm pretty sure it will change drastically, but for me I'm using it for a prototype, so API changes won't be too disruptive to prod. As far as performance is concerned, I've not noticed anything bad, but this is also because I am dealing with a small amount of data. I would actually be interested to know how Kotlin's dataframes handle large datasets, and whether they are competitive with newer dataframe libraries like polars. Regarding your first question, I'll answer separately, as it might get a bit long
❤️ 2
> And what are your thougts on it? Do you think it simplified your workflow? Yes, it has definitely simplified my workflow. Without the dataframes, I would probably have had to spin up a database just to store a tiny amount of data and perform data manipulations like joins. Another nice thing is that, because it's all Kotlin, you don't have to worry about things not integrating well with each other. You can reuse Kotlin syntax like the
filter
method. This, however, has its drawbacks too. If you look at the documentation, you'll see that the dataframe syntax is very non-standard. An example is the
add
method, which in PySpark would be
withColumn
. I think it is quite reasonable to assume that anyone who's considering Kotlin dataframes would already have exposure to dataframe libraries like pandas, PySpark or polars – all of which generally have similar syntax. Because the syntax here is different, you'll have to spend a lot of time reading the docs to do operations that are simple in SQL. If you're working alone it's not a problem, but when other people have to read your code, I don't know how much friction that might cause. My biggest pain point is that, when I want to use Kotlin's dataframes for something that's more complicated than the examples in the docs, I often run into issues. See the question I posted just now. The compiler doesn't recognise my dataframe's columns; I was able to keep working in the notebook despite this lack of recognition, while in the Kotlin source file the compiler is not letting me do anything else until I resolve it. So, to conclude, I think Kotlin dataframes have a lot of potential. It is really helpful to approach data manipulation the way that data analysts and scientists do. However, because it's still not mature, whatever simplifications you enjoy are counterbalanced by an unfamiliar syntax and lack of suitable example code for what you want to do.
In summary: maybe it's not the best idea to use Kotlin dataframes in prod unless you have a lot of time to figure stuff out, and are willing to write documentation for your team to supplement the official docs
Oh btw I still have to use DTOs. I just convert them to dataframes later
v
Yea, That's what I was afraid of. Probbably will try it on something smaller and wait for the api and docks to stabilize.
I hope i would need less of dtos. Only use them at the start and end of pipeline and in the middle have dataframes. My pain point is ussualy i have dto and want to add one attribute to it, in next phase add another one, and this is really annoing as it requires either using big hierarchies, a lot of nesting or a lot of mapping.
Anyways really appreciate your answer, thank you 🙏
👍 1
i
I guess you'd need the DTOs to enable the parsing of your JSONs (or source data). Once they've been converted to Kotlin objects and you convert those into dataframes, the number of DTOs you need to write is reduced. Looking at my code again, I only created the DTOs for I/O. The intermediate dataframe-based operations didn't need them
v
Exactly what i want to use it for 😄
r
@Václav Škorpil Hello! I probably answered your question in my talk at the latest KotlinConf:

https://youtu.be/vQM6pQF8W1s?si=fsrd08VE7KXVhgnJ

👀 1
Actually with the Dataframe Compiler plugin you will need no DTO at all, in some cases. In others, only for Input
i
Hello Roman, thanks for your talk! It was one of the first things I watched before starting on my project. I actually tried reading in JSON, but because it had a somewhat large schema and three or four levels of nesting, I encountered memory-related errors that didn't go away even after I increased the RAM allocation. It worked for simple schemas though! That's why I had to switch back to DTOs
j
@Ian Koh like I mentioned before in https://kotlinlang.slack.com/archives/C4W52CFEZ/p1725362461925379 this issue can likely be solved by setting
keyValuePaths
in
readJson()
, as json can potentially generate hundreds+ of columns and accessors which can get heavy really fast if they are not converted to key/value columns
👍 2
i
Oh yeah @Jolan Rensen [JB], you did mention that. Apologies! I was in a rush then and most likely missed out a few paragraphs. Great then, we can further simplify our workflows without using DTOs while reading in JSON
j
hopefully 🙂 do let us know if there's anything else you come across.!
👍 2