Anyone knows if DataFrame.readExcel supports readi...
# datascience
r
Anyone knows if DataFrame.readExcel supports reading large file? Like 250k+ rows? How can I read through each row efficienly?
I see it uses Apache POI?
For more context, I need to get reference to the wrapped DataSchema class. I'm using toListOf<SchemaClass>().forEach. Would I run into problems if I use this instead of using forEach directly? I imagine this latter option would read line by line instead of the former which looks like it loads the whole file then converts to list.
j
DataFrame always reads the entire file into memory, this is also the case with
readExcel
. So there's no difference running forEach directly on a DF or first converting it to a list in terms of types. The reason DF always reads the entire file is because it uses each value in a column in the data to infer the type of that column. That will set the
type()
property of each column in the DF, which you can see by calling
schema()
on the DF. To access the columns in a type-safe manner, you can define your own
@DataSchema
interface representing your data, or you can let DF build this interface for you; DF can build this using either `@file:ImportDataSchema`or a Gradle task (both are described here https://kotlin.github.io/dataframe/schemasgradle.html#schema-inference). Both methods work by also reading the entire file to generate
@DataSchema
interfaces from the inferred types of the column values. As seen in the docs, they also generate direct accessors to your file, so you can simply call
YourGivenName.readExcel()
. Under the good this will call
DataFrame.readExcel("path/to/the/file").cast<YourGivenName>()
. 250k lines might be much, but you can definitely try. It's not millions of rows after all. If it is too large to just infer a data schema, it's also possible to use those schema inference methods with a smaller "sample" excel file representative of the big file (in terms of types). Something like:
Copy code
@file:ImportDataSchema(
  "SchemaName",
  "path/to/small.xlsx",
)

...

val df = SchemaName.readExcel("path/to/large.xlsx")
Hope this clears up some things 🙂 Let me know if anything is still unclear
r
Oh. Okay. Thanks for the reply.