If I want to track whether I encountered some obje...
# getting-started
c
If I want to track whether I encountered some object already with fast lookups, I'm inclined to write
val mapOfExistence = _mapOf_<MyObject, Boolean>()
but is there a way to just have
val mapOfExistence = _mapOf_<MyObject>()
which would just be my hint if it's true/false depending on whether it exists in the map?
v
You mean
listOf<MyObject>()
?
🚫 1
Or
setOf<MyObject>()
?
👆 1
y
a map of objects to booleans is practically the definition of a set. That's because a set's
contains
is a function
(MyObject) -> Boolean
, which is the equivalent of the map's
get
with a default value of false
c
wait. can I use a list?
no. I want constant time lookups... right. so a map is what I want. And cool. a set seems like it should work. lemme give it a go
j
Constant time lookup of existence calls for a set
c
thank you all. brain fog is real.
s
TLDR: you can't get constant lookup, the best you can do is O(log n) which is close enough
j
@Stephan Schröder what do you mean? Hash sets and hash maps definitely have constant time existence lookup (hash collisions aside), that's the whole point
s
@Joffrey Hash sets don't point to the element, they point to a short list containing the element.Of course you have a load factor smaller than 1, so you try to have enough lists so that most of them only contain max one element, but it's pretty probably (hand-wavy math) that some colision do occure, so you're access time isn't constant. The (extremely improbable) worstcase is even that all your items end up in the same list since all there hashCodes modulo the number of lists just happends to be the same value. I haven't done the math to what the average access time is, I picked O(log n) because for me it's one level above constant access (O(1)), but thinking about it if could be smaller like O(log(log(n))), but it's not the same for each element, so it's not constant. Simply super smal in practice.
j
Collisions depend on the hash function of course, but in practice it's considered constant time because there are extremely few collisions with the default hash functions. Definitely not O(log(n))!
v
The essence is, he wanted something that has the same lookup-time as a
HashMap
and that is a
HashSet
. That hash code collisions lower the access time is of course correct. 🙂
If you have a well-distributed hash-code, lookup is
O(1)
, if you have a badly-distributed (i.e. constant) hash-code, lookup is
O(n)
But all that is irrelevant anyway, as OP does not talk about
HashMap
, but about
mapOf()
which happens to be implemented using
HashMap
on JVM backend. But I don't think that is guaranteed. 😄
s
@Joffrey the effective hash within the collection isn't the hash of the value but the hash of the value modulo number of slots/lists within the collection. let's see if I can still do the math of a simple question: how big is the change that in a hashSet of capacity 8 (with a loadFactor of 0,76) 6 entries aren't colliding (and therefore put into the same slot/list). the anser should be: 8/8*7/8*6/8*5/8*4/8*3/8 (so the first item has 8 slots to "choose" from, the second 7 because one slot is already filled, the third has 6...) =7*6*4*3/8*8*8*8=504/4096=0,123 so there's only a 12,3% change of finding all elements alone in their slots/lists (or in "direct access time") So my only argument is, that "constant access" isn't correct, but my "O(log n)" statement might well also be to pessimistic. I've left university too long ago to remember how to compute probabitliy distributions. HashMap/Set are the best default maps/sets to choose.
v
You are still assuming implementation details, which are not guaranteed. But if we talk about the standard Java
HashMap
(and thus also
HashSet
as it is implemented using a
HashMap
), even its JavaDoc states:
This implementation provides constant-time performance for the basic
operations (
get
and
put
), assuming the hash function
disperses the elements properly among the buckets.
🙂
j
You are still assuming implementation details, which are not guaranteed
True 😄 but I'm ok to roll with hashmap/hashset for the sake of this discussion.
the effective hash within the collection isn't the hash of the value but the hash of the value modulo number of slots/lists within the collection
Sure, but that doesn't mean the access cost scales with the number of elements, because the capacity is increased on-demand in order to keep the size of those lists low. The only valid argument is about real hash collisions (not modulo collisions) because real hash collisions would keep lists long even through multiple capacity bumps (re-hashing wouldn't help in this case). In short, in the limit as the capacity grows, only real hash collisions remain, and modulo collisions disappear. Because for each modulo collision, we can find a capacity that's big enough to solve it and separate the elements into different buckets.
how big is the chan[c]e that in a hashSet of capacity 8 (with a loadFactor of 0,76) 6 entries aren't colliding
That's not what big-O notation (and "constant time access") means. O(1) access time means that the access time doesn't grow indefinitely with the number of elements, and is instead bounded by some constant. So your analysis cannot just analyze one state, it should be measuring how "long" it takes to access an element when there is 10 elements in the map, and comparing it to how long it takes when there is 100, or 1000, or 10000 elements. If the access time stops growing with the size, it means it's bounded by a constant. Now, with the theoretical worst hash function (returning the same hash for any value), we do get an indefinitely growing list in a single bucket, independent of resizing. And it's effectively proportional to the number of elements, as it's effectively a linear search in a list. So that does mean
O(n)
as @Vampire said. With a good hash function dispersing hashes evenly, we should be able to find a constant maximum list size that will never be exceeded thanks to capacity bumps.
So my only argument is, that "constant access" isn't correct, but my "O(log n)" statement might well also be to pessimistic
The thing is, it depends on the hash function. But assuming a well-distributed hash function, it is correct to say
O(1)
access time.
v
...
Actually, no 😄
Copy code
/**
 * Computes key.hashCode() and spreads (XORs) higher bits of hash
 * to lower.  Because the table uses power-of-two masking, sets of
 * hashes that vary only in bits above the current mask will
 * always collide. (Among known examples are sets of Float keys
 * holding consecutive whole numbers in small tables.)  So we
 * apply a transform that spreads the impact of higher bits
 * downward. There is a tradeoff between speed, utility, and
 * quality of bit-spreading. Because many common sets of hashes
 * are already reasonably distributed (so don't benefit from
 * spreading), and because we use trees to handle large sets of
 * collisions in bins, we just XOR some shifted bits in the
 * cheapest possible way to reduce systematic lossage, as well as
 * to incorporate impact of the highest bits that would otherwise
 * never be used in index calculations because of table bounds.
 */
static final int hash(Object key) {
    int h;
    return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}
j
"No" to what? 🙂
v
O(1)
As the hashes are not used as-is, you can get collisions, even if each instance has a different hash-code and capacity is infinity. So to reach
O(1)
you need not only a well-distributed hash function, but also one that stays in the range
0
-
(1 << 16) - 1
, i.e.
0
-
65535
. So as soon as you have at least
65536
elements,
O(1)
is automatically busted with
java.lang.Hash(Map/Set)
, no matter what you do. 🙂
j
I'm not sure I entirely follow what you mean here. I mean, all the above discussion was theoretical, but in practice a hash is a 32-bit integer so in any case, we already had a limit of 2^32 elements if we don't want any collision (we can't have a different hash code for every value). With this particular implementation detail, it just means we will have collisions for sure starting at 2^16 instead of 2^32. But having collisions doesn't necessarily mean it's not
O(1)
anymore. It just means the constant is a bit bigger.
It could mean that it's not
O(1)
if the collisions followed some systematic pattern (and the distribution was thus uneven), and that's what the bit shift + XOR trick that you showed here is trying to avoid here. Starting from a well-distributed 32-bit hash, using only half of the bits would create systematic collisions for some common patterns that do indeed heavily rely on only differing in half of the bits, so this bit trick is a mitigation of that.
c
I'm glad I asked my question. 😂
😂 2