Is there a plugin which can auto generate datasche...
# datascience
y
Is there a plugin which can auto generate dataschema based on csv file?(similar to jsonschema to kotlin data classes)
usecase: to ingest and transform various kinds of csv files and to store resultant data into a database…
a
y
This is neat! Thanks
Is there a way to supply source directory path instead of the filepath for the plugin? in my scenario I have a folder with multiple csv files which I would like to traverse and convert to the dataschema..
j
Are the csv's in the directory all of the same types? Then you can point the plugin to any of the CSV files in the directory, which will generate the data schema accordingly. To read all files in the directory adhering to the generated schema, you can then do something like:
Copy code
val df = Path("path/to/dir").listDirectoryEntries("*.csv").map {
    YourGeneratedType.readCSV(it.absolutePathString())
}.concat()
It is a good idea though, I made an issue for it https://github.com/Kotlin/dataframe/issues/826
y
Thanks. But my question was abt generating data schema classes for different kind of csv s from a folder using the plugin..
j
Ah, so a different data schema for each file in the folder?
If you're using the gradle plugin, this can be done fairly easily from build.gradle.kts:
Copy code
...
dataframes {
    val csvs = Path("path/to/folder").listDirectoryEntries("*.csv")
    for (csv in csvs) {
        schema {
            data = csv.absolutePathString()
            name = "your.package.${csv.toFile().nameWithoutExtension.capitalize()}"
        }
    }
    ...
}
...
is that what you meant? 🙂
y
Yes, this is what I was looking for. thanks!
🙂 1
Ran into an issue, wherein the filename for the csv is sometthing like: “filename.1.csv”, it fails with the error that :
contains illegal characters: .
j
Ah, you can also
.replace('.', '_')
. That should help 🙂 I'll try to see if we have a public api somewhere that can make interface names valid
y
thanks.. another query.. when generating schema, somehow it generates following for the ID, my id values has
M123
as an example value
image.png
j
Ah yep, it's a known bug for some csv files: https://github.com/Kotlin/dataframe/issues/687 Maybe you could edit the files manually first to remove the BOM character. We're working on a solution on our side in the meantime :)
y
thanks.. what is a BOM character?
j
it's an invisible character which marks the file encoding, like "UTF-8". It's not that common anymore, but it appears from time to time. If you open your CSV in IntelliJ, there's actually an option to remove it:
File
->
File Properties
->
Remove BOM
y
thanks, I was able to follow remove the BOM character using tail command as discussed here: https://stackoverflow.com/questions/45240387/how-can-i-remove-the-bom-from-a-utf-8-file
👍 1
so does one need to ensure that the files always need to be without BOM to ensure we can read using the readcsv?
j
For now, yes, but I made a fix for it: https://github.com/Kotlin/dataframe/pull/831 It will likely be fixed once version 0.14 of dataframe hits 🙂
y
amazing , thanks! is there a dev version of the build one can try this out?
j
once the PR is merged, yes 🙂
y
cool!
thanks
j
np 🙂
y
this is most likely a bug in the library… when the csv file is empty it generates following error: