Colton Idle
06/27/2023, 9:48 PMval mapOfExistence = _mapOf_<MyObject, Boolean>()
but is there a way to just have
val mapOfExistence = _mapOf_<MyObject>()
which would just be my hint if it's true/false depending on whether it exists in the map?Vampire
06/27/2023, 9:49 PMlistOf<MyObject>()
?Vampire
06/27/2023, 9:50 PMsetOf<MyObject>()
?Youssef Shoaib [MOD]
06/27/2023, 9:50 PMcontains
is a function (MyObject) -> Boolean
, which is the equivalent of the map's get
with a default value of falseColton Idle
06/27/2023, 9:53 PMColton Idle
06/27/2023, 9:53 PMJoffrey
06/27/2023, 9:56 PMColton Idle
06/27/2023, 10:01 PMStephan Schröder
06/28/2023, 7:31 AMJoffrey
06/28/2023, 8:03 AMStephan Schröder
06/29/2023, 9:35 AMJoffrey
06/29/2023, 10:11 AMVampire
06/29/2023, 10:11 AMHashMap
and that is a HashSet
.
That hash code collisions lower the access time is of course correct. 🙂Vampire
06/29/2023, 10:13 AMO(1)
, if you have a badly-distributed (i.e. constant) hash-code, lookup is O(n)
Vampire
06/29/2023, 10:16 AMHashMap
, but about mapOf()
which happens to be implemented using HashMap
on JVM backend. But I don't think that is guaranteed. 😄Stephan Schröder
06/29/2023, 11:19 AMVampire
06/29/2023, 11:57 AMHashMap
(and thus also HashSet
as it is implemented using a HashMap
), even its JavaDoc states:
This implementation provides constant-time performance for the basic
operations (andget
), assuming the hash functionput
disperses the elements properly among the buckets.🙂
Joffrey
06/29/2023, 1:34 PMYou are still assuming implementation details, which are not guaranteedTrue 😄 but I'm ok to roll with hashmap/hashset for the sake of this discussion.
the effective hash within the collection isn't the hash of the value but the hash of the value modulo number of slots/lists within the collectionSure, but that doesn't mean the access cost scales with the number of elements, because the capacity is increased on-demand in order to keep the size of those lists low. The only valid argument is about real hash collisions (not modulo collisions) because real hash collisions would keep lists long even through multiple capacity bumps (re-hashing wouldn't help in this case). In short, in the limit as the capacity grows, only real hash collisions remain, and modulo collisions disappear. Because for each modulo collision, we can find a capacity that's big enough to solve it and separate the elements into different buckets.
how big is the chan[c]e that in a hashSet of capacity 8 (with a loadFactor of 0,76) 6 entries aren't collidingThat's not what big-O notation (and "constant time access") means. O(1) access time means that the access time doesn't grow indefinitely with the number of elements, and is instead bounded by some constant. So your analysis cannot just analyze one state, it should be measuring how "long" it takes to access an element when there is 10 elements in the map, and comparing it to how long it takes when there is 100, or 1000, or 10000 elements. If the access time stops growing with the size, it means it's bounded by a constant. Now, with the theoretical worst hash function (returning the same hash for any value), we do get an indefinitely growing list in a single bucket, independent of resizing. And it's effectively proportional to the number of elements, as it's effectively a linear search in a list. So that does mean
O(n)
as @Vampire said.
With a good hash function dispersing hashes evenly, we should be able to find a constant maximum list size that will never be exceeded thanks to capacity bumps.
So my only argument is, that "constant access" isn't correct, but my "O(log n)" statement might well also be to pessimisticThe thing is, it depends on the hash function. But assuming a well-distributed hash function, it is correct to say
O(1)
access time.Vampire
06/29/2023, 1:38 PMVampire
06/29/2023, 1:38 PMVampire
06/29/2023, 1:38 PM/**
* Computes key.hashCode() and spreads (XORs) higher bits of hash
* to lower. Because the table uses power-of-two masking, sets of
* hashes that vary only in bits above the current mask will
* always collide. (Among known examples are sets of Float keys
* holding consecutive whole numbers in small tables.) So we
* apply a transform that spreads the impact of higher bits
* downward. There is a tradeoff between speed, utility, and
* quality of bit-spreading. Because many common sets of hashes
* are already reasonably distributed (so don't benefit from
* spreading), and because we use trees to handle large sets of
* collisions in bins, we just XOR some shifted bits in the
* cheapest possible way to reduce systematic lossage, as well as
* to incorporate impact of the highest bits that would otherwise
* never be used in index calculations because of table bounds.
*/
static final int hash(Object key) {
int h;
return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}
Joffrey
06/29/2023, 1:44 PMVampire
06/29/2023, 1:50 PMO(1)
As the hashes are not used as-is, you can get collisions, even if each instance has a different hash-code and capacity is infinity.
So to reach O(1)
you need not only a well-distributed hash function, but also one that stays in the range 0
- (1 << 16) - 1
, i.e. 0
- 65535
.
So as soon as you have at least 65536
elements, O(1)
is automatically busted with java.lang.Hash(Map/Set)
, no matter what you do. 🙂Joffrey
06/29/2023, 2:15 PMO(1)
anymore. It just means the constant is a bit bigger.Joffrey
06/29/2023, 2:17 PMO(1)
if the collisions followed some systematic pattern (and the distribution was thus uneven), and that's what the bit shift + XOR trick that you showed here is trying to avoid here. Starting from a well-distributed 32-bit hash, using only half of the bits would create systematic collisions for some common patterns that do indeed heavily rely on only differing in half of the bits, so this bit trick is a mitigation of that.Colton Idle
06/29/2023, 8:58 PM