Richard Glen
04/08/2024, 2:04 AMRichard Glen
04/08/2024, 2:12 AMRichard Glen
04/08/2024, 2:20 AMJolan Rensen [JB]
04/09/2024, 3:26 PMreadExcel
. So there's no difference running forEach directly on a DF or first converting it to a list in terms of types.
The reason DF always reads the entire file is because it uses each value in a column in the data to infer the type of that column.
That will set the type()
property of each column in the DF, which you can see by calling schema()
on the DF.
To access the columns in a type-safe manner, you can define your own @DataSchema
interface representing your data, or you can let DF build this interface for you; DF can build this using either `@file:ImportDataSchema`or a Gradle task (both are described here https://kotlin.github.io/dataframe/schemasgradle.html#schema-inference). Both methods work by also reading the entire file to generate @DataSchema
interfaces from the inferred types of the column values.
As seen in the docs, they also generate direct accessors to your file, so you can simply call YourGivenName.readExcel()
. Under the good this will call DataFrame.readExcel("path/to/the/file").cast<YourGivenName>()
.
250k lines might be much, but you can definitely try. It's not millions of rows after all.
If it is too large to just infer a data schema, it's also possible to use those schema inference methods with a smaller "sample" excel file representative of the big file (in terms of types).
Something like:
@file:ImportDataSchema(
"SchemaName",
"path/to/small.xlsx",
)
...
val df = SchemaName.readExcel("path/to/large.xlsx")
Hope this clears up some things 🙂 Let me know if anything is still unclearRichard Glen
04/09/2024, 5:10 PM