Is there a way for Dataframe to remove nesting in ...
# datascience
d
Is there a way for Dataframe to remove nesting in json? Say I have:
Copy code
{"nodes": {"name": "...", ...., "pods": {...} } }
how can I get a table with all the pods and the name of the node they're on?
j
You can "flatten" the table. It has an optional argument
keepParentNameForColumns = true
. However, in version 1.13.1 it had the wrong order, so to get the right order, you can manually rename each column with its path before flattening, like:
Copy code
df
    .rename { colsAtAnyDepth() }.into { it.path.joinToString(".") }
    .flatten()
However, to give a more specific solution, could you share some more json? If you say "all nodes", it would assume it to have an array of objects, like
{ "nodes": [{ "name": "", pods: [{}, {}] }] }
. In that case, you get what you want, you could explode and pivot your table, like in the attached picture.
r
I guess in the original question just
explode
function will do the trick
d
It's the output of kubectl-resource_capacity, the schema DataFrame gives me is this:
Copy code
nodes: *
    name: String
    cpu:
        requests: String
        requestsPercent: String
        limits: String
        limitsPercent: String
    memory:
        requests: String
        requestsPercent: String
        limits: String
        limitsPercent: String
    pods: *
        name: String
        namespace: String
        cpu:
            requests: String
            requestsPercent: String
            limits: String
            limitsPercent: String
        memory:
            requests: String
            requestsPercent: String
            limits: String
            limitsPercent: String


clusterTotals:
    cpu:
        requests: String
        requestsPercent: String
        limits: String
        limitsPercent: String
    memory:
        requests: String
        requestsPercent: String
        limits: String
        limitsPercent: String
A bunch of explodes did the trick more or less, but it was painful... and took a bunch of cells before I could do any real processing.
Also the explodes only helped to aim in to node with pods and only memory columns, say..
r
To better understand your question. What should be the output schema?
d
I want to ignore the cluster totals, and just have the name of the node beside the pod name , namespace and it's stats like memoryRequests, etc not nested, but as columns.
The node name will be repeated for each pod in that node
@Jolan Rensen [JB] Your flatten() (with the rename) gave this:
Copy code
name: String
cpu.requests: String
cpu.requestsPercent: String
cpu.limits: String
cpu.limitsPercent: String
memory.requests: String
memory.requestsPercent: String
memory.limits: String
memory.limitsPercent: String
pods: *
    name: String
    namespace: String
    cpu:
        requests: String
        requestsPercent: String
        limits: String
        limitsPercent: String
    memory:
        requests: String
        requestsPercent: String
        limits: String
        limitsPercent: String
I think it just flattened the node's stats, which is already not bad for analysing nodes. But right now, what I need is pods
j
I think I understand the question now. you can also
.explode { pods }
which will convert the frame column into a column group, repeating the values of other columns where needed. I made a small notebook to show what I mean: https://gist.github.com/Jolanrensen/b3f6f4004c57600e25018b19f1f254c7
d
This is MUCH better... thanks! Almost a one-liner for something that took a good part of the notebook. But it gives funny names:
Copy code
name: String
name.pods: String
namespace.pods: String
requests.cpu: String
requestsPercent.cpu: String
limits.cpu: String
limitsPercent.cpu: String
requests.memory: String
requestsPercent.memory: String
limits.memory: String
limitsPercent.memory: String
And when I tried your renaming trick (which I'm not sure how it works in the first place...), it put pods before and after...
Copy code
name: String
pods.name.pods: String
pods.namespace.pods: String
pods.cpu.requests.pods.cpu: String
pods.cpu.requestsPercent.pods.cpu: String
pods.cpu.limits.pods.cpu: String
pods.cpu.limitsPercent.pods.cpu: String
pods.memory.requests.pods.memory: String
pods.memory.requestsPercent.pods.memory: String
pods.memory.limits.pods.memory: String
pods.memory.limitsPercent.pods.memory: String
j
Yes, the names are funny with the
keepParentNameForColumns
argument currently 😅. The trick works by selecting all columns at any depth (meaning also inside column groups, but not in nested dataframes) and renaming them to their path. So to do the renaming trick, you'd need to explode first, so all frame columns are turned into column groups, then rename and flatten, like: https://gist.github.com/Jolanrensen/fd0c3a2e5ff91d3b51fe6b1132d8776e
👍🏼 1
👍 1