If I want to track whether I encountered some object already kotlinlang #getting-started

If I want to track whether I encountered some obje...

Colton Idle

06/27/2023, 9:48 PM

If I want to track whether I encountered some object already with fast lookups, I'm inclined to write

val mapOfExistence = _mapOf_<MyObject, Boolean>()

but is there a way to just have

val mapOfExistence = _mapOf_<MyObject>()

which would just be my hint if it's true/false depending on whether it exists in the map?

Vampire

06/27/2023, 9:49 PM

You mean

listOf<MyObject>()

🚫 1

Vampire

06/27/2023, 9:50 PM

setOf<MyObject>()

👆 1

Youssef Shoaib [MOD]

06/27/2023, 9:50 PM

a map of objects to booleans is practically the definition of a set. That's because a set's

contains

is a function

(MyObject) -> Boolean

, which is the equivalent of the map's

get

with a default value of false

Colton Idle

06/27/2023, 9:53 PM

wait. can I use a list?

Colton Idle

06/27/2023, 9:53 PM

no. I want constant time lookups... right. so a map is what I want. And cool. a set seems like it should work. lemme give it a go

Joffrey

06/27/2023, 9:56 PM

Constant time lookup of existence calls for a set

Colton Idle

06/27/2023, 10:01 PM

thank you all. brain fog is real.

Stephan Schröder

06/28/2023, 7:31 AM

TLDR: you can't get constant lookup, the best you can do is O(log n) which is close enough

Joffrey

06/28/2023, 8:03 AM

@Stephan Schröder what do you mean? Hash sets and hash maps definitely have constant time existence lookup (hash collisions aside), that's the whole point

Stephan Schröder

06/29/2023, 9:35 AM

@Joffrey Hash sets don't point to the element, they point to a short list containing the element.Of course you have a load factor smaller than 1, so you try to have enough lists so that most of them only contain max one element, but it's pretty probably (hand-wavy math) that some colision do occure, so you're access time isn't constant. The (extremely improbable) worstcase is even that all your items end up in the same list since all there hashCodes modulo the number of lists just happends to be the same value. I haven't done the math to what the average access time is, I picked O(log n) because for me it's one level above constant access (O(1)), but thinking about it if could be smaller like O(log(log(n))), but it's not the same for each element, so it's not constant. Simply super smal in practice.

Joffrey

06/29/2023, 10:11 AM

Collisions depend on the hash function of course, but in practice it's considered constant time because there are extremely few collisions with the default hash functions. Definitely not O(log(n))!

Vampire

06/29/2023, 10:11 AM

The essence is, he wanted something that has the same lookup-time as a

HashMap

and that is a

HashSet

. That hash code collisions lower the access time is of course correct. 🙂

Vampire

06/29/2023, 10:13 AM

If you have a well-distributed hash-code, lookup is

O(1)

, if you have a badly-distributed (i.e. constant) hash-code, lookup is

O(n)

Vampire

06/29/2023, 10:16 AM

But all that is irrelevant anyway, as OP does not talk about

HashMap

, but about

mapOf()

which happens to be implemented using

HashMap

on JVM backend. But I don't think that is guaranteed. 😄

Stephan Schröder

06/29/2023, 11:19 AM

@Joffrey the effective hash within the collection isn't the hash of the value but the hash of the value modulo number of slots/lists within the collection. let's see if I can still do the math of a simple question: how big is the change that in a hashSet of capacity 8 (with a loadFactor of 0,76) 6 entries aren't colliding (and therefore put into the same slot/list). the anser should be: 8/8*7/8*6/8*5/8*4/8*3/8 (so the first item has 8 slots to "choose" from, the second 7 because one slot is already filled, the third has 6...) =7*6*4*3/8*8*8*8=504/4096=0,123 so there's only a 12,3% change of finding all elements alone in their slots/lists (or in "direct access time") So my only argument is, that "constant access" isn't correct, but my "O(log n)" statement might well also be to pessimistic. I've left university too long ago to remember how to compute probabitliy distributions. HashMap/Set are the best default maps/sets to choose.

Vampire

06/29/2023, 11:57 AM

You are still assuming implementation details, which are not guaranteed. But if we talk about the standard Java

HashMap

(and thus also

HashSet

as it is implemented using a

HashMap

), even its JavaDoc states:

This implementation provides constant-time performance for the basic

operations (
get
and
put
), assuming the hash function

disperses the elements properly among the buckets.

🙂

Joffrey

06/29/2023, 1:34 PM

You are still assuming implementation details, which are not guaranteed

True 😄 but I'm ok to roll with hashmap/hashset for the sake of this discussion.

the effective hash within the collection isn't the hash of the value but the hash of the value modulo number of slots/lists within the collection

Sure, but that doesn't mean the access cost scales with the number of elements, because the capacity is increased on-demand in order to keep the size of those lists low. The only valid argument is about real hash collisions (not modulo collisions) because real hash collisions would keep lists long even through multiple capacity bumps (re-hashing wouldn't help in this case). In short, in the limit as the capacity grows, only real hash collisions remain, and modulo collisions disappear. Because for each modulo collision, we can find a capacity that's big enough to solve it and separate the elements into different buckets.

how big is the chan[c]e that in a hashSet of capacity 8 (with a loadFactor of 0,76) 6 entries aren't colliding

That's not what big-O notation (and "constant time access") means. O(1) access time means that the access time doesn't grow indefinitely with the number of elements, and is instead bounded by some constant. So your analysis cannot just analyze one state, it should be measuring how "long" it takes to access an element when there is 10 elements in the map, and comparing it to how long it takes when there is 100, or 1000, or 10000 elements. If the access time stops growing with the size, it means it's bounded by a constant. Now, with the theoretical worst hash function (returning the same hash for any value), we do get an indefinitely growing list in a single bucket, independent of resizing. And it's effectively proportional to the number of elements, as it's effectively a linear search in a list. So that does mean

O(n)

as @Vampire said. With a good hash function dispersing hashes evenly, we should be able to find a constant maximum list size that will never be exceeded thanks to capacity bumps.

So my only argument is, that "constant access" isn't correct, but my "O(log n)" statement might well also be to pessimistic

The thing is, it depends on the hash function. But assuming a well-distributed hash function, it is correct to say

O(1)

access time.

Vampire

06/29/2023, 1:38 PM

...

Vampire

06/29/2023, 1:38 PM

Actually, no 😄

Vampire

06/29/2023, 1:38 PM

Copy code

/**
 * Computes key.hashCode() and spreads (XORs) higher bits of hash
 * to lower.  Because the table uses power-of-two masking, sets of
 * hashes that vary only in bits above the current mask will
 * always collide. (Among known examples are sets of Float keys
 * holding consecutive whole numbers in small tables.)  So we
 * apply a transform that spreads the impact of higher bits
 * downward. There is a tradeoff between speed, utility, and
 * quality of bit-spreading. Because many common sets of hashes
 * are already reasonably distributed (so don't benefit from
 * spreading), and because we use trees to handle large sets of
 * collisions in bins, we just XOR some shifted bits in the
 * cheapest possible way to reduce systematic lossage, as well as
 * to incorporate impact of the highest bits that would otherwise
 * never be used in index calculations because of table bounds.
 */
static final int hash(Object key) {
    int h;
    return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}

Joffrey

06/29/2023, 1:44 PM

"No" to what? 🙂

Vampire

06/29/2023, 1:50 PM

O(1)

As the hashes are not used as-is, you can get collisions, even if each instance has a different hash-code and capacity is infinity. So to reach

O(1)

you need not only a well-distributed hash function, but also one that stays in the range

(1 << 16) - 1

, i.e.

. So as soon as you have at least

elements,

O(1)

is automatically busted with

java.lang.Hash(Map/Set)

, no matter what you do. 🙂

Joffrey

06/29/2023, 2:15 PM

I'm not sure I entirely follow what you mean here. I mean, all the above discussion was theoretical, but in practice a hash is a 32-bit integer so in any case, we already had a limit of 2^32 elements if we don't want any collision (we can't have a different hash code for every value). With this particular implementation detail, it just means we will have collisions for sure starting at 2^16 instead of 2^32. But having collisions doesn't necessarily mean it's not

O(1)

anymore. It just means the constant is a bit bigger.

Joffrey

06/29/2023, 2:17 PM

It could mean that it's not

O(1)

if the collisions followed some systematic pattern (and the distribution was thus uneven), and that's what the bit shift + XOR trick that you showed here is trying to avoid here. Starting from a well-distributed 32-bit hash, using only half of the bits would create systematic collisions for some common patterns that do indeed heavily rely on only differing in half of the bits, so this bit trick is a mitigation of that.

Colton Idle

06/29/2023, 8:58 PM

I'm glad I asked my question. 😂

😂 2

7 Views

Open in Slack

Previous Next