Anyone knows if DataFrame readExcel supports reading large f kotlinlang #datascience

Anyone knows if DataFrame.readExcel supports readi...

Richard Glen

04/08/2024, 2:04 AM

Anyone knows if DataFrame.readExcel supports reading large file? Like 250k+ rows? How can I read through each row efficienly?

Richard Glen

04/08/2024, 2:12 AM

I see it uses Apache POI?

Richard Glen

04/08/2024, 2:20 AM

For more context, I need to get reference to the wrapped DataSchema class. I'm using toListOf<SchemaClass>().forEach. Would I run into problems if I use this instead of using forEach directly? I imagine this latter option would read line by line instead of the former which looks like it loads the whole file then converts to list.

Jolan Rensen [JB]

04/09/2024, 3:26 PM

DataFrame always reads the entire file into memory, this is also the case with

readExcel

. So there's no difference running forEach directly on a DF or first converting it to a list in terms of types. The reason DF always reads the entire file is because it uses each value in a column in the data to infer the type of that column. That will set the

type()

property of each column in the DF, which you can see by calling

schema()

on the DF. To access the columns in a type-safe manner, you can define your own

@DataSchema

interface representing your data, or you can let DF build this interface for you; DF can build this using either `@file:ImportDataSchema`or a Gradle task (both are described here https://kotlin.github.io/dataframe/schemasgradle.html#schema-inference). Both methods work by also reading the entire file to generate

@DataSchema

interfaces from the inferred types of the column values. As seen in the docs, they also generate direct accessors to your file, so you can simply call

YourGivenName.readExcel()

. Under the good this will call

DataFrame.readExcel("path/to/the/file").cast<YourGivenName>()

. 250k lines might be much, but you can definitely try. It's not millions of rows after all. If it is too large to just infer a data schema, it's also possible to use those schema inference methods with a smaller "sample" excel file representative of the big file (in terms of types). Something like:

Copy code

@file:ImportDataSchema(
  "SchemaName",
  "path/to/small.xlsx",
)

...

val df = SchemaName.readExcel("path/to/large.xlsx")

Hope this clears up some things 🙂 Let me know if anything is still unclear

Richard Glen

04/09/2024, 5:10 PM

Oh. Okay. Thanks for the reply.

3 Views

Open in Slack

Previous Next