Using DataFrame v0 14 1 how could I read the classic sleep p kotlinlang #datascience

Using DataFrame v0.14.1, how could I read the clas...

holgerbrandl

10/14/2024, 5:55 AM

Using DataFrame v0.14.1, how could I read the classic sleep patterns dataset?

Copy code

DataFrame.readDelim(
        FileInputStream(File("src/main/resources/data/msleep.csv")),
        csvType = CSVType.DEFAULT,
        colTypes = mapOf("brainwt" to ColType.Double),
        parserOptions = ParserOptions(nullStrings = setOf("NA"))
    )

It seems to guess that the brainweight is guessed to be a big decimal (should be double) and also struggles with the NA despite the provided parser option:

java.lang.IllegalStateException: Couldn't parse 'NA' into type kotlin.Double

What am I doing wrong?

msleep.csv

👀 2

👍 1

Jolan Rensen [JB]

10/14/2024, 1:30 PM

Hmm interesting. Seems the Double parser does not take

nullStrings

into account (for the current csv implementation). Only the date-time parsers do. DataFrame, at the moment, uses

NumberFormat.getInstance(locale).parse()

to parse doubles with some manual conversions, like "inf", "nan". "NA" is not recognized as a

Double

, unfortunately, only "NaN" is. We're working on a completely new CSV implementation at the moment, for which I'll add this file as a test-case. The next version of DF will likely have the new implementation it as an experimental opt-in. Until then, the best intermediate solution would be to read the column as String, and convert it to double manually. If you want NA to become null, try

Copy code

val df = DataFrame.readCSV(
    "path/to/msleep.csv",
    colTypes = mapOf("brainwt" to ColType.String),
).convert { "brainwt"<String>() }.with { it.toDoubleOrNull() }

or if you want to make it NaN:

Copy code

val df = DataFrame.readCSV(
    "path/to/msleep.csv",
    colTypes = mapOf("brainwt" to ColType.String),
).convert { "brainwt"<String>() }.with { it.toDoubleOrNull() ?: Double.NaN }

Jolan Rensen [JB]

10/14/2024, 2:41 PM

Copy code

val df = DataFrame.readCSV(
    "/mnt/data/Projects/dataframe/examples/idea-examples/json/src/main/resources/msleep.csv",
    colTypes = mapOf("brainwt" to ColType.String),
).parse(ParserOptions(nullStrings = setOf("NA")))

interestingly, the normal

.parse

operation does understand

nullStrings

, though, it still becomes a BigDecimal.

Jolan Rensen [JB]

10/14/2024, 2:44 PM

The reason it resorts to BigDecimal for your specific data is because NumberFormat.parse() doesn't seem to understand

3e-04

unfortunately.

Double.parseDouble("3e-04")

does work, hence why my first example returns the correct result

Jolan Rensen [JB]

10/14/2024, 6:15 PM

made an issue with all observations https://github.com/Kotlin/dataframe/issues/921 thanks for reporting 🙂

👍 1

holgerbrandl

10/14/2024, 9:38 PM

Thanks @Jolan Rensen [JB] for your kind analysis and support

6 Views

Open in Slack

Previous Next