holgerbrandl
06/26/2020, 6:23 AMDouble.NaN.toInt()
evaluating to 0 is just plain wrong (and a bug imho). ( NaN.roundToInt
at least throws an exception)altavir
06/26/2020, 6:58 AMnotNull()
which gets non-nullable buffer. From here you can go several ways:
1. A list of nullables - you will suffer performance issues, but only on batch operations, there should be no problems with single value get.
2. Non-nullable structure optimized for performance but without missing values.
3. Non-nullable structure with additiona missing value map, which contains indices of missing values.
It is important though that the API for all cases should be the same so we could substitute implementation.
Maybe this question could be interesting for @elizarovelizarov
06/26/2020, 7:58 AMaltavir
06/26/2020, 7:59 AMelizarov
06/26/2020, 10:39 AMaltavir
06/26/2020, 10:46 AMelizarov
06/26/2020, 10:59 AMnull
and so to expose an Int?
type to the users if values could be missing. Internally, if a memory-efficient representation is needed, it can be represented as a pair of IntArray
and BooleanArray
.altavir
06/26/2020, 11:01 AMholgerbrandl
06/27/2020, 7:45 AMI do not think that using NaN for something other than Double is a correct way.+1, I can't imagine an int usecase for NaN either. But nullability/NA must be supported for both since missing values are a defining aspect of DS. 🙂
A list of nullables - you will suffer performance issues, but only on batch operations, there should be no problems with single value get.Since krangl intends to be a pandas/dplyr lib for kotlin, column operations in tables are performance-critical. To me the NA model defines how DS APIs need to be designed in general (i.e. based on which base types).
What are your use-cases for integers in data science?Life Sciences is all about counts (peptides, sequencing fragements, cells). Also in many other domains (manufacturing, social sciences) count data are omnipresent. So ints are imho mandatory for any stack/language that is serious about DS.
What’s the accepted solution in other ecosystems?Alex's link is a good starter, but imho this one is better https://pandas.pydata.org/pandas-docs/dev/user_guide/gotchas.html#choice-of-na-representation . It's a great summary and provides so much great insight into the matter
My intuition says that a Kotlin way would be to represent missing ints with null and so to expose an Int? type to the users if values could be missing. Internally, if a memory-efficient representation is needed, it can be represented as a pair of IntArray and BooleanArray+1 this would be my preferred solution as well with the addition that the same should apply to Double for sake of consistency sins NaN!=null. I just wonder if I should try to implement this as library developer (e.g. in krangl) or if this should be baked into the core APIs/language? In particular since kotlin is pointing to DS as a first class citizen, I'd favor the latter (but I may be just too lazy here :-) itizen, I'd favor the latter (but I may be just too lazy here :-)
altavir
06/27/2020, 7:48 AMholgerbrandl
06/27/2020, 8:00 AMCore language gives us nullables.1. It does but performance in numpy/pandas/dplyr is all about vectorization. And IntArray/DoubleAray are vectorized whereas Array<Any> is not and I'm not sure if compilers are clever enough to change this 2. If Kotlin is targeting DS, than the newbies will start with core array types such as IntArray expecting NA support, which would not be present and just be provided in different implementations by third party libs. Not sure if this will consistent/convincing picture.
elizarov
06/27/2020, 8:04 AMaltavir
06/27/2020, 8:05 AMholgerbrandl
06/27/2020, 8:07 AMaltavir
06/27/2020, 8:09 AMelizarov
06/27/2020, 8:14 AMaltavir
06/27/2020, 8:15 AMelizarov
06/27/2020, 8:15 AMintenalValue.takeIf { it != Int.MIN_VALUE }
altavir
06/27/2020, 8:16 AMelizarov
06/27/2020, 8:17 AMaltavir
06/27/2020, 8:17 AMelizarov
06/27/2020, 8:20 AMaltavir
06/27/2020, 9:22 AM