esionecneics
07/16/2024, 5:05 PMgb = df.groupby([pd.cut(df.col1, 1000), pd.cut(df.col2, 200)])
binned = gb.col3.median() - gb.col4.max()
zaleslaw
07/18/2024, 9:05 AMJolan Rensen [JB]
07/24/2024, 11:28 AMgroupBy
functionality is available in DataFrame https://kotlin.github.io/dataframe/groupby.html
we don't have great support for statistics like binning/cutting yet. There is the function col1.digitize()
which can tell you which bin each value in the column belongs to (as attached), but it cannot generate bins for you.
There is however better binning support from the Kandy library https://kotlin.github.io/kandy/statistics-guide.html, since it can build histograms etc. from your DataFrames. There's the statBin
function which can distribute data from columns across a given number of bins, like you specified.
So, to create the bins you want, you'd need to depend on kandy-statistics, and then you can do something like:
val col1Bins = statBin(
x = df.col1,
binsOption = BinsOption.byNumber(1000),
).Stat.x.toList()
val col2Bins = statBin(
x = df.col2,
binsOption = BinsOption.byNumber(200),
).Stat.x.toList()
which gives you the x's bins for the cols you specify (I'm not a 100% sure about binsAlignment, but you can probably figure that out with the kandy documentation)
and then you can groupBy
and aggregate those bins like:
df.groupBy {
col1.digitize(col1Bins) and col2.digitize(col2Bins)
}.aggregate {
(col3.median() - col4.max()) into "result"
}
(note digitize
just gives the index of the bin, to get the name (like, this], you'd need to do something like .map { "(${col1Bins.getOrElse(it - 1) { 0.0 }}, ${col1Bins.getOrElse(it) { 100.0 }}]" }
)
So, to conclude, statistics are not a strong point of DataFrame at the moment yet, but with Kandy and some workarounds, you can hopefully still achieve some advanced things 🙂