Hello friends I beseech your wisdom blush I have to tackle a kotlinlang #datascience

Hello friends! I beseech your wisdom! :blush: I h...

Paulo Cereda

11/25/2022, 8:38 AM

Hello friends! I beseech your wisdom! 😊 I have to tackle a

.csv

file (exported from a third-party system) which has integer columns with

as thousands separator (e.g,

47,302

48,000

). Needless to say, this is potentially problematic. 😅 When I load my

.csv

file into my Jupyter notebook, I believe

dataframe

relies on my system locale (

pt_BR

) and thus parses these integer columns as doubles —

pt_BR

has

as decimal separator and

as thousands separator. I end up having wrong values in those columns (

.csv

is of course to blame, not

dataframe

). So I was wondering if I could (a) disable type inference for either the entire

.csv

or selected columns and get everything as string, so I can manually parse these values, (b) change the underlying locale and see if it helps the type inference mechanism, or (c) have parsing rules associated to certain columns. Any suggestions are highly appreciated! I apologise in advance if this is trivial, but I failed to identify a similar scenario in the documentation. Cheers! 🙏

altavir

11/25/2022, 9:16 AM

What is the size of csv you are using?

altavir

11/25/2022, 9:18 AM

It is possible to load the file as text and then replace appropriate values. I usually do that in excell. Also you can change the system locale programmatically before loading.

Paulo Cereda

11/25/2022, 9:20 AM

I have two, actually... one with 200 rows / 58 columns. The other has 300 rows / 70 columns. I am actually parsing the

.csv

files before loading into Jupyter (I wrote a Kotlin script for this), but I was wondering if I could find an easier way. 🙂

Paulo Cereda

11/25/2022, 9:22 AM

I could try changing the system locale! Are there any pointers on how to do that? Sorry, I've never done that before... 😅

altavir

11/25/2022, 9:22 AM

Locale.setDefault()

Paulo Cereda

11/25/2022, 9:23 AM

Cool, will try now! Thanks!

altavir

11/25/2022, 9:23 AM

I've done it for this exact reason. In Russian locale default decimal separator is coma, so it breaks a lot of things

Paulo Cereda

11/25/2022, 9:25 AM

That's awsome, Alexander! It worked like a charm! Thank you so much! 🙏

altavir

11/25/2022, 9:35 AM

If I remember correctly, it also should be possible to define a custom column type. But if you need it only once, it does not worth the effort.

Paulo Cereda

11/25/2022, 9:55 AM

Agreed. 🙂 I think it would be nice to have another way of doing this (e.g, pre-processing the column), but it's good to have a working solution for now. 🙂

Nikita Klimenko [JB]

11/25/2022, 7:12 PM

Hi 🙂 a) it's possible to disable type inference for specific columns

Copy code

val df = DataFrame.readCSV(
    "datasets/decimals.csv",
    colTypes = mapOf("colName" to ColType.String)
)

b) it's also possible to provide locale as a parameter to readCSV

Copy code

val df = DataFrame.readCSV(
    "datasets/decimals.csv",
    parserOptions = ParserOptions(locale = <http://Locale.UK|Locale.UK>),
)

Paulo Cereda

11/26/2022, 9:05 AM

That's fantastic, Nikita! Thank you so much! 🥳 🎉

16 Views

Open in Slack

Previous Next