Hi everyone. Is it possible to sum average values ...
# announcements
n
Hi everyone. Is it possible to sum average values of several Int lists in more beautiful way, to reduce code duplication?
Copy code
val red = data.red.map { it.toUByte().toInt() }.average()
val green = data.green.map { it.toUByte().toInt() }.average()
val blue = data.blue.map { it.toUByte().toInt() }.average()
val luminance = red + green + blue
j
Haven't tested but something like this perhaps?
Copy code
val lumincance = listOf(data.red, data.green, data.blue).sumOf { cs ->
    cs.map { 
        it.toUByte().toInt()
    }.average()
}
m
Or how about this:
Copy code
val luminance = with(data) {
    sequenceOf(red, green, blue)
        .map { it.map(SomeClass::toUByte).map(UByte::toInt)}
        .map { it.average() }
        .sum()
}
New to Kotlin myself … didn’t know the sumOf function. Nice!
🍻 1
n
Thanks to both of you!
n
one thing to note here, if your data is very very large, all 3 approaches here will create new arrays the same size as the original pixel data
👍 1
n
Is it possible to avoid it?
n
You can avoid this in e.g. Jonathan's approach by changing
cs.asSequence().map
now the map becomes lazy, and when you take the average it should just do it via accumulation and never store the new array (assuming a sane approach in the stdlib)
n
Wow, sounds great, thanks!
n
np
m
@Nir unfortunately sequences don’t support average calculations, do they?
Yes, they do. Neat! 🙂
It would be nice to have an averageOf, but I suppose we cannot have all the agg operations built in 🙂
Something like
Copy code
listOf("a", "xx").averageOf { it.length }
Makes sense as an extension function, to hide the asSequence call.
n
Yes, that would be a nice addition
thankfully it's easy to write on our own
extension functions are pretty fantastic 🙂
m
Yes. This is how it would look like, following the sumOf implementation. Kind of messy, but optimum (inlined) performance.
Copy code
inline fun <T> kotlin.collections.Iterable<T>.averageOf(selector: (T) -> <http://kotlin.Int|kotlin.Int>): kotlin.Double {
    var sum: Long = 0.toLong()
    var count = 0
    for (element in this) {
        sum += selector(element)
        count++
    }
    return sum.toDouble()/count
}

listOf("a", "xx").averageOf { it.length }
n
@Michael Böiers there are actually a bunch of issues with that, FYI
sorry, not a bunch, just one 🙂
you can have overflow issues very easily
You want to use an algorithm where you continuously update the average most likely
m
The obvious issue is dealing with an empty iterable 🙂
Hm … would a sum of ints ever overflow a long?
n
the empty iterable is a fair point, but presumably it could be like the reduce (I think? reduce, not fold) family algorithms where it's just an error to use with an empty container
m
Here’s the official average implementation, I think you had a fair point regarding the overflow. 🙂
n
it's possible that they would not I suppose, I'd have to think about it, but even then, it would only work for the very specific case of Int
there might still be other issues there
that's actually surprisingly naive
m
How so? I don’t think the average can be calculated without storing sum and count.
n
yeah, you use an update algorithm
you basically have an estimate of the mean that you update. And, even better than that you have algorithms like this that further try to compensate for numeric issues: https://en.wikipedia.org/wiki/Kahan_summation_algorithm
j
Not sure I’d expect a standard average function to do something fancy at any cost of accuracy though. 🤔 Don’t think the jdk IntStream implementation uses update? I might be wrong.
n
it's the other way around, it will be more accurate
j
You mean in the overflow case?
n
there are all kinds of cases
not just overflow
there's a whole body of prior art on these things, and usually the most naive approach is not used
j
Even for Ints? I wouldn't expect an issue in most cases (except for overflow). Regardless, I agree that current implementation seems a bit on the sparse side. The one for Double is exactly the same... Also, shouldn't
sum
overflow way before
count
?
Edit: right,
sum
is
Double
in current implementation
(JDK IntPipeline doesn't seem to use compensation, DoublePipeline however does for sum. No overflow checks there at all)
🤷‍♂️ Interesting nonetheless. Kaham summation added to reading list. 👍
👍 1
m
n
yeah, it's quite the rabbit hole
you can get into the weeds but I guess one thing that's fairly clear, you would want for example that if you simply feed the same number over and over and over, that you will get that number as the average, no matter how long the list is, that seems like a reasoanble requirement (to me)
and just adding things up and dividing can fail that very easily
m
I’m not sure. I hear what you’re saying, but another way to look at this is: “I would want a function like
average()
to do the naive computation that everybody knows, which is summing all the values and dividing by the count. Then we could have other functions that are named by algorithm, so that clients know what they can expect.”
j
I think the point that Nir made is that if you are summing up Doubles the way the stdlib is doing you will end up with unexpected (read imprecise) results for larger data sets. The java stdlib uses long when summing Ints, and compensates for imprecision when summing up Doubles.
In other words: the naive approach doesn't work for floats.
Or at least not very well. 🙂
n
Indeed. I think that a function called
average
should be a "reasonable" implementation of computing the average. That is, it should do a pretty good job, and perform pretty well. Looking at the prior art, you can hugely improve results at minor performance cost. So I don't see an intrinsic argument for the naive approach.
It's a bit like arguing that
sort
should be bubble sort, or insertion sort. These are after all the simplest algorithms.
In practice it's pretty much the opposite, most standard library sorts tend to use more complex algorithms than any of the basic ones, and/or hybrid algorithms.
They don't use the absolute craziest algorithms that exist, perhaps, both because of implementation complexity, because you start to enter a regime where there are major trade-offs between approaches, etc. But they do use reasonably sophisticated approaches.