Hello friends! I beseech your wisdom! :blush: I h...
# datascience
p
Hello friends! I beseech your wisdom! 😊 I have to tackle a
.csv
file (exported from a third-party system) which has integer columns with
,
as thousands separator (e.g,
47,302
,
48,000
). Needless to say, this is potentially problematic. πŸ˜… When I load my
.csv
file into my Jupyter notebook, I believe
dataframe
relies on my system locale (
pt_BR
) and thus parses these integer columns as doubles β€”
pt_BR
has
,
as decimal separator and
.
as thousands separator. I end up having wrong values in those columns (
.csv
is of course to blame, not
dataframe
). So I was wondering if I could (a) disable type inference for either the entire
.csv
or selected columns and get everything as string, so I can manually parse these values, (b) change the underlying locale and see if it helps the type inference mechanism, or (c) have parsing rules associated to certain columns. Any suggestions are highly appreciated! I apologise in advance if this is trivial, but I failed to identify a similar scenario in the documentation. Cheers! πŸ™
a
What is the size of csv you are using?
It is possible to load the file as text and then replace appropriate values. I usually do that in excell. Also you can change the system locale programmatically before loading.
p
I have two, actually... one with 200 rows / 58 columns. The other has 300 rows / 70 columns. I am actually parsing the
.csv
files before loading into Jupyter (I wrote a Kotlin script for this), but I was wondering if I could find an easier way. πŸ™‚
I could try changing the system locale! Are there any pointers on how to do that? Sorry, I've never done that before... πŸ˜…
a
Locale.setDefault()
p
Cool, will try now! Thanks!
a
I've done it for this exact reason. In Russian locale default decimal separator is coma, so it breaks a lot of things
p
That's awsome, Alexander! It worked like a charm! Thank you so much! πŸ™
a
If I remember correctly, it also should be possible to define a custom column type. But if you need it only once, it does not worth the effort.
p
Agreed. πŸ™‚ I think it would be nice to have another way of doing this (e.g, pre-processing the column), but it's good to have a working solution for now. πŸ™‚
n
Hi πŸ™‚ a) it's possible to disable type inference for specific columns
Copy code
val df = DataFrame.readCSV(
    "datasets/decimals.csv",
    colTypes = mapOf("colName" to ColType.String)
)
b) it's also possible to provide locale as a parameter to readCSV
Copy code
val df = DataFrame.readCSV(
    "datasets/decimals.csv",
    parserOptions = ParserOptions(locale = <http://Locale.UK|Locale.UK>),
)
p
That's fantastic, Nikita! Thank you so much! πŸ₯³ πŸŽ‰