Will there be any replacement for Arb distinct and distinctB kotlinlang #kotest

Join Slack

Will there be any replacement for Arb.distinct and...

# kotest

dave08

05/19/2021, 10:49 AM

Will there be any replacement for Arb.distinct and distinctBy?

sam

05/19/2021, 11:40 AM

We will add a version that requires a limit of number of iterations to try, and then will either error, or just return duplicates

sam

05/19/2021, 11:40 AM

the current version is liable to hang

dave08

05/19/2021, 11:40 AM

👍🏼

sam

05/19/2021, 11:51 AM

something like arb.distinctBy(1000)

sam

05/19/2021, 11:56 AM

https://github.com/kotest/kotest/issues/2262

👍🏼 1

mitch

05/19/2021, 12:50 PM

@dave08 you may be able to emulate distinct sampling using flatMap i.e.

Copy code

Arb.list(
  arbInput, // the input arb
  50..100 // take 50..100 randomly
).flatMap { randomSample ->
  Arb.of(randomSample.distinctBy { ... })
}

mitch

05/19/2021, 12:51 PM

Proper distinct needs a bit of thinking..

dave08

05/19/2021, 12:52 PM

Yeah, I guess that's possible, but you wouldn't necessarily get the 50..100

dave08

05/19/2021, 12:53 PM

Actually, I'm not even sure you'd get them with the current implementation?

dave08

05/19/2021, 12:53 PM

But, yeah, distinct isn't easy...

mitch

05/19/2021, 12:55 PM

Arb is a population of data of some type, so you're right maybe 50 samples isn't enough to find the needle in the haystack population, maybe we need 500 maybe 1000

mitch

05/19/2021, 12:55 PM

So yeah... It's a can of worms, we need to see what we can come up with.

mitch

05/19/2021, 1:10 PM

Another difficulty is because distinct is a terminal operation. In sequence for instance, in order for kotlin to compute distinct, it will have to exhaust the whole sequence. What's problematic is that well an Arb is a population, so the expectation is that whenever we ask for a sample from the population, the contract says it will have one ready. For instance, if we were having a population of ints 1 to 10 and then we call distinct on that arb, what would be the expectation? Would it: a) produce 10 distinct elements and then stop? Null? throw? this arb is no longer a population and can't be resampled.. or b) makes sure we sample 1 to 10 in a uniform distribution, i.e each number 10% of the time. This is akin to having a population of 10 numbers that is equally distributed. This behaviour effectively is going to weed out imbalances in the initial population, e.g. if the generator of the numbers have a lot of 1s but not many of the other numbers.

mitch

05/19/2021, 1:11 PM

🤯

dave08

05/19/2021, 1:23 PM

Agreed. In my case, I need to fake a Repository's contents, so I'm mostly concerned with the table's primary key (but I do have other "fields" that need to be unique...). Int's can always keep sets of ranges that are left to look in (modifying the producing arb...), but not all other types could do that...

dave08

05/19/2021, 1:26 PM

Maybe an additional

distinct

param on

<http://Arb.int|Arb.int>()

might be useful though, if that's a solution...

sam

05/19/2021, 1:28 PM

I made a proposal that we just push the decisions down to the user - tell us what you want to do if we can't generate n distinct values - make none distinct, or throw ?

dave08

05/19/2021, 1:29 PM

Yeah, but in my case, with unique db primary keys, it depends on how often it will throw... just having tests fail because of test framework limits might be frustrating if it happens too often...

dave08

05/19/2021, 1:30 PM

But then, sometimes it's not to common to have collisions in certain data sets, so that might be a reasonable proposition in those cases

sam

05/19/2021, 2:00 PM

Then you would up the limit, set the max attempts to 100000000 or whatever you want

sam

05/19/2021, 2:00 PM

if you're confident it can be filled

10 Views

Open in Slack

Previous Next