Using DataFrame v0.14.1, how could I read the clas...
# datascience
h
Using DataFrame v0.14.1, how could I read the classic sleep patterns dataset?
Copy code
DataFrame.readDelim(
        FileInputStream(File("src/main/resources/data/msleep.csv")),
        csvType = CSVType.DEFAULT,
        colTypes = mapOf("brainwt" to ColType.Double),
        parserOptions = ParserOptions(nullStrings = setOf("NA"))
    )
It seems to guess that the brainweight is guessed to be a big decimal (should be double) and also struggles with the NA despite the provided parser option:
java.lang.IllegalStateException: Couldn't parse 'NA' into type kotlin.Double
What am I doing wrong?
👀 2
👍 1
j
Hmm interesting. Seems the Double parser does not take
nullStrings
into account (for the current csv implementation). Only the date-time parsers do. DataFrame, at the moment, uses
NumberFormat.getInstance(locale).parse()
to parse doubles with some manual conversions, like "inf", "nan". "NA" is not recognized as a
Double
, unfortunately, only "NaN" is. We're working on a completely new CSV implementation at the moment, for which I'll add this file as a test-case. The next version of DF will likely have the new implementation it as an experimental opt-in. Until then, the best intermediate solution would be to read the column as String, and convert it to double manually. If you want NA to become null, try
Copy code
val df = DataFrame.readCSV(
    "path/to/msleep.csv",
    colTypes = mapOf("brainwt" to ColType.String),
).convert { "brainwt"<String>() }.with { it.toDoubleOrNull() }
or if you want to make it NaN:
Copy code
val df = DataFrame.readCSV(
    "path/to/msleep.csv",
    colTypes = mapOf("brainwt" to ColType.String),
).convert { "brainwt"<String>() }.with { it.toDoubleOrNull() ?: Double.NaN }
Copy code
val df = DataFrame.readCSV(
    "/mnt/data/Projects/dataframe/examples/idea-examples/json/src/main/resources/msleep.csv",
    colTypes = mapOf("brainwt" to ColType.String),
).parse(ParserOptions(nullStrings = setOf("NA")))
interestingly, the normal
.parse
operation does understand
nullStrings
, though, it still becomes a BigDecimal.
The reason it resorts to BigDecimal for your specific data is because NumberFormat.parse() doesn't seem to understand
3e-04
unfortunately.
Double.parseDouble("3e-04")
does work, hence why my first example returns the correct result
made an issue with all observations https://github.com/Kotlin/dataframe/issues/921 thanks for reporting 🙂
👍 1
h
Thanks @Jolan Rensen [JB] for your kind analysis and support