A modern programming language that makes developers happier.

kotlinlang

Hi, I'm currently doing a data analysis with Kandy and Dataframe at work. Everything is going quite well so far. Is it possible to solve complex binning + groupby tasks with the Dataframe library?
This is how it would look like in Python:

```gb = df.groupby([pd.cut(df.col1, 1000), pd.cut(df.col2, 200)])
binned = gb.col3.median() - gb.col4.max()```


Hi, happy to hear about your experience, could you please provide more details, what exactly is this code in Python doing? What is your goal

image.png

While advanced `groupBy` functionality is available in DataFrame <https://kotlin.github.io/dataframe/groupby.html>
we don't have great support for statistics like binning/cutting yet. There is the function `col1.digitize()` which can tell you which bin each value in the column belongs to (as attached), but it cannot generate bins for you.

There is however better binning support from the Kandy library <https://kotlin.github.io/kandy/statistics-guide.html>, since it can build histograms etc. from your DataFrames. There's the `statBin` function which can distribute data from columns across a given number of bins, like you specified.

So, to create the bins you want, you'd need to depend on kandy-statistics, and then you can do something like:
```val col1Bins = statBin(
    x = df.col1,
    binsOption = BinsOption.byNumber(1000),
).Stat.x.toList()

val col2Bins = statBin(
    x = df.col2,
    binsOption = BinsOption.byNumber(200),
).Stat.x.toList()```
which gives you the x's bins for the cols you specify (I'm not a 100% sure about binsAlignment, but you can probably figure that out with the kandy documentation)

and then you can `groupBy` and aggregate those bins like:
```df.groupBy {
    col1.digitize(col1Bins) and col2.digitize(col2Bins)
}.aggregate {
    (col3.median() - col4.max()) into "result"
}```
(note `digitize` just gives the index of the bin, to get the name (like, this], you'd need to do something like `.map { "(${col1Bins.getOrElse(it - 1) { 0.0 }}, ${col1Bins.getOrElse(it) { 100.0 }}]" }` )

So, to conclude, statistics are not a strong point of DataFrame at the moment yet, but with Kandy and some workarounds, you can hopefully still achieve some advanced things :slightly_smiling_face: