Hi, I'm currently doing a data analysis with Kandy...
# datascience
e
Hi, I'm currently doing a data analysis with Kandy and Dataframe at work. Everything is going quite well so far. Is it possible to solve complex binning + groupby tasks with the Dataframe library? This is how it would look like in Python:
Copy code
gb = df.groupby([pd.cut(df.col1, 1000), pd.cut(df.col2, 200)])
binned = gb.col3.median() - gb.col4.max()
z
Hi, happy to hear about your experience, could you please provide more details, what exactly is this code in Python doing? What is your goal
j
While advanced
groupBy
functionality is available in DataFrame https://kotlin.github.io/dataframe/groupby.html we don't have great support for statistics like binning/cutting yet. There is the function
col1.digitize()
which can tell you which bin each value in the column belongs to (as attached), but it cannot generate bins for you. There is however better binning support from the Kandy library https://kotlin.github.io/kandy/statistics-guide.html, since it can build histograms etc. from your DataFrames. There's the
statBin
function which can distribute data from columns across a given number of bins, like you specified. So, to create the bins you want, you'd need to depend on kandy-statistics, and then you can do something like:
Copy code
val col1Bins = statBin(
    x = df.col1,
    binsOption = BinsOption.byNumber(1000),
).Stat.x.toList()

val col2Bins = statBin(
    x = df.col2,
    binsOption = BinsOption.byNumber(200),
).Stat.x.toList()
which gives you the x's bins for the cols you specify (I'm not a 100% sure about binsAlignment, but you can probably figure that out with the kandy documentation) and then you can
groupBy
and aggregate those bins like:
Copy code
df.groupBy {
    col1.digitize(col1Bins) and col2.digitize(col2Bins)
}.aggregate {
    (col3.median() - col4.max()) into "result"
}
(note
digitize
just gives the index of the bin, to get the name (like, this], you'd need to do something like
.map { "(${col1Bins.getOrElse(it - 1) { 0.0 }}, ${col1Bins.getOrElse(it) { 100.0 }}]" }
) So, to conclude, statistics are not a strong point of DataFrame at the moment yet, but with Kandy and some workarounds, you can hopefully still achieve some advanced things 🙂
❤️ 2