https://kotlinlang.org logo
Title
h

holgerbrandl

06/26/2020, 6:23 AM
Hi there, What about missing values in kotlin? I constantly struggle with this question when working on krangl. For double we could follow the pythonic way and treat NaN as missing value, but for integers there is no equivalent. For sure, nullability is a core feature of the language, but where possible it feels wrong to me (from a performance and memory perspective) to replace IntArray with Array<Int?> to support missing values as null (which would be most pretty though). Any idea, KIPs, etc? To give another example,
Double.NaN.toInt()
evaluating to 0 is just plain wrong (and a bug imho). (
NaN.roundToInt
at least throws an exception)
a

altavir

06/26/2020, 6:58 AM
It is a really good question. I do not think that using NaN for something other than Double is a correct way. Persontally I desgin APIs in a way that they could handle nulls, but it does not mean that underlying structure is nullable. There could be additional methond like
notNull()
which gets non-nullable buffer. From here you can go several ways: 1. A list of nullables - you will suffer performance issues, but only on batch operations, there should be no problems with single value get. 2. Non-nullable structure optimized for performance but without missing values. 3. Non-nullable structure with additiona missing value map, which contains indices of missing values. It is important though that the API for all cases should be the same so we could substitute implementation. Maybe this question could be interesting for @elizarov
an accidently duplicated "not". NaNs are bad.
e

elizarov

06/26/2020, 7:58 AM
What are your use-cases for integers in data science?
a

altavir

06/26/2020, 7:59 AM
The most common case is a time-series (it is not only for "data" science). You have a sequence of numbers with fixed time shift between them. Some of those values could be missing from the data set.
The problem is the same for Doubles. NaN is not a solution since you usually want to do operations like smooth or window-based processing on those numbers and a single NaN could lead to terribele concequences: the analysis will continue without errors, but all valuse will be NaNs.
e

elizarov

06/26/2020, 10:39 AM
What’s the accepted solution in other ecosystems? How do you deal with missing values in dataframes?
a

altavir

06/26/2020, 10:46 AM
Well, the other ecosystem is Python and it does not make much sense since python does not operates with types. Here is the parto of documentation about that: https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html. Basic solution is to use NaN for floating point and transform integers to floats for data with missing values, which I think is not acceptable. Basically, Python does not guarantee data physical representation or precision. I do not think it is the way we want for Kotlin. Also, the problem of accsidently poluting the result with a NaN (for example in fold operation) is really annoying both in Python and C++ (we recently has a major problem with that).
e

elizarov

06/26/2020, 10:59 AM
My intuition says that a Kotlin way would be to represent missing ints with
null
and so to expose an
Int?
type to the users if values could be missing. Internally, if a memory-efficient representation is needed, it can be represented as a pair of
IntArray
and
BooleanArray
.
👍 1
a

altavir

06/26/2020, 11:01 AM
Yes, I also think that API should expose nullables and give optional access to underlying represntation for optimized operations. An implementation with a boolean mask should be rather efficient.
h

holgerbrandl

06/27/2020, 7:45 AM
I do not think that using NaN for something other than Double is a correct way.
+1, I can't imagine an int usecase for NaN either. But nullability/NA must be supported for both since missing values are a defining aspect of DS. 🙂
A list of nullables - you will suffer performance issues, but only on batch operations, there should be no problems with single value get.
Since krangl intends to be a pandas/dplyr lib for kotlin, column operations in tables are performance-critical. To me the NA model defines how DS APIs need to be designed in general (i.e. based on which base types).
What are your use-cases for integers in data science?
Life Sciences is all about counts (peptides, sequencing fragements, cells). Also in many other domains (manufacturing, social sciences) count data are omnipresent. So ints are imho mandatory for any stack/language that is serious about DS.
What’s the accepted solution in other ecosystems?
Alex's link is a good starter, but imho this one is better https://pandas.pydata.org/pandas-docs/dev/user_guide/gotchas.html#choice-of-na-representation . It's a great summary and provides so much great insight into the matter
My intuition says that a Kotlin way would be to represent missing ints with null and so to expose an Int? type to the users if values could be missing. Internally, if a memory-efficient representation is needed, it can be represented as a pair of IntArray and BooleanArray
+1 this would be my preferred solution as well with the addition that the same should apply to Double for sake of consistency sins NaN!=null. I just wonder if I should try to implement this as library developer (e.g. in krangl) or if this should be baked into the core APIs/language? In particular since kotlin is pointing to DS as a first class citizen, I'd favor the latter (but I may be just too lazy here :-) itizen, I'd favor the latter (but I may be just too lazy here :-)
a

altavir

06/27/2020, 7:48 AM
Core language gives us nullables. We can choose how to work with them. I think you should start with API design and we will think later about how to do it fast. Most of the things could be done under the hood without changing the API.
Staged compillation in Jupyter notebook opens even broader opportunities. We can optimize existing structures on cell compilation.
h

holgerbrandl

06/27/2020, 8:00 AM
Core language gives us nullables.
1. It does but performance in numpy/pandas/dplyr is all about vectorization. And IntArray/DoubleAray are vectorized whereas Array<Any> is not and I'm not sure if compilers are clever enough to change this 2. If Kotlin is targeting DS, than the newbies will start with core array types such as IntArray expecting NA support, which would not be present and just be provided in different implementations by third party libs. Not sure if this will consistent/convincing picture.
e

elizarov

06/27/2020, 8:04 AM
I’m not sure I’m following your point about novices. Novices in DS should be using high-level libraries that are internally optimized and vectorized for data-science. There’s zero chance that novices will manage to write an efficient code for DS themselves using core language primitives (this is true in any of the ecosystems I’m aware of).
a

altavir

06/27/2020, 8:05 AM
Again, we have two different things: user API and internal optimizations. User has two regimes of accessing the data: • single items • convolution operations The nullability does not affect single item access, so I think that it should be kept nullable. As for convolution operation, we can do default implementation-independent operations. and then we could have specific optimized operations for specific implementations. In kmath I do it for doubles. I explicitely check if implementation supports non-boxing access and use it if possible.
The actual problem is to provide a reasonable way to configure the missing data treatment because the desirable behavior could be different for different operations
h

holgerbrandl

06/27/2020, 8:07 AM
In particular, performance implications are not theoretical but very real in benchmarks I've implemented comparing krangl with (the c++ backed) dplyr. Clearly krangl implementations are partially not yet there, but in many cases (simple column math) I currently suspect collections APIs (compared to vectorized array operations in dplyr) to be the cause. Maybe I should wrap up my findings and share them to build a more scientifically solid ground for discussion.
Missing data treatment is imho the next step and can be covered by the various DS APIs. However, first we'd need a consistent way to represent NAs.
a

altavir

06/27/2020, 8:09 AM
I think the best way is to schedule a remote meeting about that and discuss the details. I've done some preliminary performance evalaition of different implementations in kmath and it ranges from numpy-like to something like 2 times worse depending on how to do it.
Also it depends on the VM. Graal has much better escape analysis and it mitigates nullability problems in many cases.
By the way, benchmarking is a sore point. kotlinx-benchmarks does not work more times, than it does.
And I am pretty sure we (even I) could do very effective nan/null implementation using masks like Roman shown earlier. I will try to add POC implementation of kmath buffer later today and get back to you.
e

elizarov

06/27/2020, 8:14 AM
(The alternative it to designate something like Int.MIN_VALUE as “missing”. Need to benchmarks which one will be better)
a

altavir

06/27/2020, 8:15 AM
I do not like it this way, because you have the same problems as with NaNs. You can accidently forget to treat that specific value and get completely wrong result without knowing about it
e

elizarov

06/27/2020, 8:15 AM
It can be only internal. You can still expose it as Int? outside.
intenalValue.takeIf { it != Int.MIN_VALUE }
a

altavir

06/27/2020, 8:16 AM
Yes. I mean for fold operations. For single value access, performance is not that important.
e

elizarov

06/27/2020, 8:17 AM
you’ll have to check it for folds. Either booleans or designed values — still a special checking on folds will be needed to skip missing value (skipping missing values on folds looks like a good default to me)
a

altavir

06/27/2020, 8:17 AM
OK, let me do that POC, we can ask a student to check performance later.
I was thinking about byte mask. A single byte could handle more than one flag like NaN, null, infinity, etc
skipping is not always a possibility. For example, it could not be simply done for time serries windowing.
e

elizarov

06/27/2020, 8:20 AM
True. It all depends and is also very domain-specific. I’ve worked a lot with financial time-series and they have their own rules conventions on missing values which do not necessarily apply to other domains.
a

altavir

06/27/2020, 9:22 AM
Here is a POC implementation: https://github.com/mipt-npm/kmath/blob/dev/kmath-core/src/commonMain/kotlin/scientifik/kmath/structures/FlaggedBuffer.kt. I do not have time to write tests, but I will try to add them later.