Will there be any replacement for Arb.distinct and...
# kotest
d
Will there be any replacement for Arb.distinct and distinctBy?
s
We will add a version that requires a limit of number of iterations to try, and then will either error, or just return duplicates
the current version is liable to hang
d
👍🏼
s
something like arb.distinctBy(1000)
m
@dave08 you may be able to emulate distinct sampling using flatMap i.e.
Copy code
Arb.list(
  arbInput, // the input arb
  50..100 // take 50..100 randomly
).flatMap { randomSample ->
  Arb.of(randomSample.distinctBy { ... })
}
Proper distinct needs a bit of thinking..
d
Yeah, I guess that's possible, but you wouldn't necessarily get the 50..100
Actually, I'm not even sure you'd get them with the current implementation?
But, yeah, distinct isn't easy...
m
Arb is a population of data of some type, so you're right maybe 50 samples isn't enough to find the needle in the haystack population, maybe we need 500 maybe 1000
So yeah... It's a can of worms, we need to see what we can come up with.
Another difficulty is because distinct is a terminal operation. In sequence for instance, in order for kotlin to compute distinct, it will have to exhaust the whole sequence. What's problematic is that well an Arb is a population, so the expectation is that whenever we ask for a sample from the population, the contract says it will have one ready. For instance, if we were having a population of ints 1 to 10 and then we call distinct on that arb, what would be the expectation? Would it: a) produce 10 distinct elements and then stop? Null? throw? this arb is no longer a population and can't be resampled.. or b) makes sure we sample 1 to 10 in a uniform distribution, i.e each number 10% of the time. This is akin to having a population of 10 numbers that is equally distributed. This behaviour effectively is going to weed out imbalances in the initial population, e.g. if the generator of the numbers have a lot of 1s but not many of the other numbers.
🤯
d
Agreed. In my case, I need to fake a Repository's contents, so I'm mostly concerned with the table's primary key (but I do have other "fields" that need to be unique...). Int's can always keep sets of ranges that are left to look in (modifying the producing arb...), but not all other types could do that...
Maybe an additional
distinct
param on
<http://Arb.int|Arb.int>()
might be useful though, if that's a solution...
s
I made a proposal that we just push the decisions down to the user - tell us what you want to do if we can't generate n distinct values - make none distinct, or throw ?
d
Yeah, but in my case, with unique db primary keys, it depends on how often it will throw... just having tests fail because of test framework limits might be frustrating if it happens too often...
But then, sometimes it's not to common to have collisions in certain data sets, so that might be a reasonable proposition in those cases
s
Then you would up the limit, set the max attempts to 100000000 or whatever you want
if you're confident it can be filled