Hi there, why would 2 different lists return the s...
# stdlib
h
Hi there, why would 2 different lists return the same hash?
Copy code
println(listOf(3, 263).hashCode())
 println(listOf(5, 201).hashCode())
// -> 1317

println(Arrays.asList(3, 263).hashCode())
println(Arrays.asList(5, 201).hashCode())
// --> 40830
It's a more general problem as also the java example suffers from exactly the same problem. Interestingly, arrays with the same data give differnt hashes as documented at https://github.com/holgerbrandl/krangl/issues/75
i
Given that
hashCode
function converts an infinite range of input values to the limited range of
Int
values, there could certainly be two lists that produce the same hash code value.
If the question is about why these two particular lists produce the same hash code, it's because for two-element lists the hash code formula is specified as
((1 * 31) + element1.hashCode) * 31) + element2.hashCode
, and an Int value's hash code is that value itself. Thus
(1 * 31 + 3)*31 + 263
is the same as
(1 * 31 + 5)*31 + 201
d
I'm impressed you have a collision example. Nvm, just saw the implementation. 😅
Arrays don't really override
hashCode
, which is why they are different.
g
Looks like a library bug, even if those lists has the same hashCode it doesn’t mean that they are equal, you cannot use hashCode to identify equal object, only different ones For example check Java HashMap implementation
k
Indeed, the javadoc is very clear about this. For example,
return 0
is a valid
hashcode
implementation.
h
Thanks gentleman for your help. I now understand that I misused hashCode when I actually wanted to assess equality. So even hashing to
Long
would give rise to an infinite number of collisions.
So it seems for my usecase in krangl where I want to group data-frames by a couple of selector columns (like in sql), there is no way to avoid working with the complete grouping tuples. I liked the simplification when using the hashes, but as you helped me to understand this approach was just wrong.
k
If I understand your use case that's just
groupBy
from the stdlib.
h
Indeed @karelpeeters
groupBy
seems to be the way forward. My original motivation to reduce the data with
hashCode
was better performance (without much thinking or benchmarking).