Is there a plugin which can auto generate dataschema based o kotlinlang #datascience

Join Slack

Is there a plugin which can auto generate datasche...

# datascience

Yogeshvu

08/13/2024, 9:12 PM

Is there a plugin which can auto generate dataschema based on csv file?(similar to jsonschema to kotlin data classes)

Yogeshvu

08/13/2024, 9:18 PM

usecase: to ingest and transform various kinds of csv files and to store resultant data into a database…

Andrei Kislitsyn

08/14/2024, 10:21 AM

Yes, see https://kotlin.github.io/dataframe/gradlereference.html#dsl-reference

Andrei Kislitsyn

08/14/2024, 10:21 AM

https://github.com/Kotlin/dataframe?tab=readme-ov-file#getting-started-with-data-schema

Yogeshvu

08/15/2024, 2:07 PM

This is neat! Thanks

Yogeshvu

08/19/2024, 8:08 PM

Is there a way to supply source directory path instead of the filepath for the plugin? in my scenario I have a folder with multiple csv files which I would like to traverse and convert to the dataschema..

Jolan Rensen [JB]

08/20/2024, 11:12 AM

Are the csv's in the directory all of the same types? Then you can point the plugin to any of the CSV files in the directory, which will generate the data schema accordingly. To read all files in the directory adhering to the generated schema, you can then do something like:

Copy code

val df = Path("path/to/dir").listDirectoryEntries("*.csv").map {
    YourGeneratedType.readCSV(it.absolutePathString())
}.concat()

It is a good idea though, I made an issue for it https://github.com/Kotlin/dataframe/issues/826

Yogeshvu

08/20/2024, 11:57 AM

Thanks. But my question was abt generating data schema classes for different kind of csv s from a folder using the plugin..

Jolan Rensen [JB]

08/20/2024, 12:06 PM

Ah, so a different data schema for each file in the folder?

Jolan Rensen [JB]

08/20/2024, 12:13 PM

If you're using the gradle plugin, this can be done fairly easily from build.gradle.kts:

Copy code

...
dataframes {
    val csvs = Path("path/to/folder").listDirectoryEntries("*.csv")
    for (csv in csvs) {
        schema {
            data = csv.absolutePathString()
            name = "your.package.${csv.toFile().nameWithoutExtension.capitalize()}"
        }
    }
    ...
}
...

is that what you meant? 🙂

Yogeshvu

08/20/2024, 4:34 PM

Yes, this is what I was looking for. thanks!

🙂 1

Yogeshvu

08/20/2024, 7:18 PM

Ran into an issue, wherein the filename for the csv is sometthing like: “filename.1.csv”, it fails with the error that :

contains illegal characters: .

Jolan Rensen [JB]

08/20/2024, 7:45 PM

Ah, you can also

.replace('.', '_')

. That should help 🙂 I'll try to see if we have a public api somewhere that can make interface names valid

Yogeshvu

08/20/2024, 8:45 PM

thanks.. another query.. when generating schema, somehow it generates following for the ID, my id values has

M123

as an example value

Yogeshvu

08/20/2024, 8:45 PM

image.png

Jolan Rensen [JB]

08/21/2024, 10:09 AM

Ah yep, it's a known bug for some csv files: https://github.com/Kotlin/dataframe/issues/687 Maybe you could edit the files manually first to remove the BOM character. We're working on a solution on our side in the meantime :)

Yogeshvu

08/21/2024, 2:32 PM

thanks.. what is a BOM character?

Jolan Rensen [JB]

08/21/2024, 2:37 PM

it's an invisible character which marks the file encoding, like "UTF-8". It's not that common anymore, but it appears from time to time. If you open your CSV in IntelliJ, there's actually an option to remove it:

File

File Properties

Remove BOM

Yogeshvu

08/21/2024, 2:42 PM

thanks, I was able to follow remove the BOM character using tail command as discussed here: https://stackoverflow.com/questions/45240387/how-can-i-remove-the-bom-from-a-utf-8-file

👍 1

Yogeshvu

08/21/2024, 2:43 PM

so does one need to ensure that the files always need to be without BOM to ensure we can read using the readcsv?

Jolan Rensen [JB]

08/21/2024, 2:44 PM

For now, yes, but I made a fix for it: https://github.com/Kotlin/dataframe/pull/831 It will likely be fixed once version 0.14 of dataframe hits 🙂

Yogeshvu

08/21/2024, 2:45 PM

amazing , thanks! is there a dev version of the build one can try this out?

Jolan Rensen [JB]

08/21/2024, 2:45 PM

once the PR is merged, yes 🙂

Yogeshvu

08/21/2024, 2:46 PM

cool!

Yogeshvu

08/21/2024, 2:46 PM

thanks

Jolan Rensen [JB]

08/21/2024, 2:46 PM

np 🙂

Yogeshvu

08/21/2024, 7:36 PM

this is most likely a bug in the library… when the csv file is empty it generates following error:

9 Views

Open in Slack

Previous Next