Is there a way for Dataframe to remove nesting in json Say I kotlinlang #datascience

Is there a way for Dataframe to remove nesting in ...

dave08

06/16/2024, 2:54 PM

Is there a way for Dataframe to remove nesting in json? Say I have:

Copy code

{"nodes": {"name": "...", ...., "pods": {...} } }

how can I get a table with all the pods and the name of the node they're on?

Jolan Rensen [JB]

06/17/2024, 10:03 AM

You can "flatten" the table. It has an optional argument

keepParentNameForColumns = true

. However, in version 1.13.1 it had the wrong order, so to get the right order, you can manually rename each column with its path before flattening, like:

Copy code

df
    .rename { colsAtAnyDepth() }.into { it.path.joinToString(".") }
    .flatten()

However, to give a more specific solution, could you share some more json? If you say "all nodes", it would assume it to have an array of objects, like

{ "nodes": [{ "name": "", pods: [{}, {}] }] }

. In that case, you get what you want, you could explode and pivot your table, like in the attached picture.

roman.belov

06/17/2024, 10:55 AM

I guess in the original question just

explode

function will do the trick

dave08

06/17/2024, 10:59 AM

It's the output of kubectl-resource_capacity, the schema DataFrame gives me is this:

Copy code

nodes: *
    name: String
    cpu:
        requests: String
        requestsPercent: String
        limits: String
        limitsPercent: String
    memory:
        requests: String
        requestsPercent: String
        limits: String
        limitsPercent: String
    pods: *
        name: String
        namespace: String
        cpu:
            requests: String
            requestsPercent: String
            limits: String
            limitsPercent: String
        memory:
            requests: String
            requestsPercent: String
            limits: String
            limitsPercent: String


clusterTotals:
    cpu:
        requests: String
        requestsPercent: String
        limits: String
        limitsPercent: String
    memory:
        requests: String
        requestsPercent: String
        limits: String
        limitsPercent: String

dave08

06/17/2024, 11:00 AM

A bunch of explodes did the trick more or less, but it was painful... and took a bunch of cells before I could do any real processing.

dave08

06/17/2024, 11:01 AM

Also the explodes only helped to aim in to node with pods and only memory columns, say..

roman.belov

06/17/2024, 11:05 AM

To better understand your question. What should be the output schema?

dave08

06/17/2024, 11:07 AM

I want to ignore the cluster totals, and just have the name of the node beside the pod name , namespace and it's stats like memoryRequests, etc not nested, but as columns.

dave08

06/17/2024, 11:08 AM

The node name will be repeated for each pod in that node

dave08

06/17/2024, 11:18 AM

@Jolan Rensen [JB] Your flatten() (with the rename) gave this:

Copy code

name: String
cpu.requests: String
cpu.requestsPercent: String
cpu.limits: String
cpu.limitsPercent: String
memory.requests: String
memory.requestsPercent: String
memory.limits: String
memory.limitsPercent: String
pods: *
    name: String
    namespace: String
    cpu:
        requests: String
        requestsPercent: String
        limits: String
        limitsPercent: String
    memory:
        requests: String
        requestsPercent: String
        limits: String
        limitsPercent: String

dave08

06/17/2024, 11:20 AM

I think it just flattened the node's stats, which is already not bad for analysing nodes. But right now, what I need is pods

Jolan Rensen [JB]

06/17/2024, 11:35 AM

I think I understand the question now. you can also

.explode { pods }

which will convert the frame column into a column group, repeating the values of other columns where needed. I made a small notebook to show what I mean: https://gist.github.com/Jolanrensen/b3f6f4004c57600e25018b19f1f254c7

dave08

06/17/2024, 11:49 AM

This is MUCH better... thanks! Almost a one-liner for something that took a good part of the notebook. But it gives funny names:

Copy code

name: String
name.pods: String
namespace.pods: String
requests.cpu: String
requestsPercent.cpu: String
limits.cpu: String
limitsPercent.cpu: String
requests.memory: String
requestsPercent.memory: String
limits.memory: String
limitsPercent.memory: String

And when I tried your renaming trick (which I'm not sure how it works in the first place...), it put pods before and after...

dave08

06/17/2024, 11:53 AM

Copy code

name: String
pods.name.pods: String
pods.namespace.pods: String
pods.cpu.requests.pods.cpu: String
pods.cpu.requestsPercent.pods.cpu: String
pods.cpu.limits.pods.cpu: String
pods.cpu.limitsPercent.pods.cpu: String
pods.memory.requests.pods.memory: String
pods.memory.requestsPercent.pods.memory: String
pods.memory.limits.pods.memory: String
pods.memory.limitsPercent.pods.memory: String

Jolan Rensen [JB]

06/17/2024, 11:53 AM

Yes, the names are funny with the

keepParentNameForColumns

argument currently 😅. The trick works by selecting all columns at any depth (meaning also inside column groups, but not in nested dataframes) and renaming them to their path. So to do the renaming trick, you'd need to explode first, so all frame columns are turned into column groups, then rename and flatten, like: https://gist.github.com/Jolanrensen/fd0c3a2e5ff91d3b51fe6b1132d8776e

👍🏼 1

👍 1

11 Views

Open in Slack

Previous Next