Hey, I am testing the difference between the perfo...
# coroutines
m
Hey, I am testing the difference between the performance of different dispatchers. I got the same results before, but now I confirm it with a benchmark - CPU intensive operation is faster with more threads then Dispatchers.Main, but (on IO the fastest, on executor with 100 threads it is faster than on Default). My only explanation is that it is because those threads are fighting with other threads on my computer, like IntelliJ, browser etc. Is it so, or is there any other explanation?
Copy code
@State(Scope.Benchmark)
open class KotlinBenchmark {

    private val orders: List<Order> = List(100) { Order("Customer$it") }
    private val singleThread = Executors.newSingleThreadExecutor().asCoroutineDispatcher()
    private val e100Threads = Executors.newFixedThreadPool(100).asCoroutineDispatcher()

    @Benchmark
    fun defaultCpu1(bh: Blackhole) = runBlocking {
        bh.consume(makeCoffee(orders, Dispatchers.Default, ::cpu1))
    }

    @Benchmark
    fun defaultCpu2(bh: Blackhole) = runBlocking {
        bh.consume(makeCoffee(orders, Dispatchers.Default, ::cpu2))
    }

    @Benchmark
    fun defaultBlocking(bh: Blackhole) = runBlocking {
        bh.consume(makeCoffee(orders, Dispatchers.Default, ::blocking))
    }

    @Benchmark
    fun defaultSuspending(bh: Blackhole) = runBlocking {
        bh.consume(makeCoffee(orders, Dispatchers.Default, ::suspending))

    }

    @Benchmark
    fun e100ThreadsCpu1(bh: Blackhole) = runBlocking {
        bh.consume(makeCoffee(orders, e100Threads, ::cpu1))
    }

    @Benchmark
    fun e100ThreadsCpu2(bh: Blackhole) = runBlocking {
        bh.consume(makeCoffee(orders, e100Threads, ::cpu2))
    }

    @Benchmark
    fun e100ThreadsBlocking(bh: Blackhole) = runBlocking {
        bh.consume(makeCoffee(orders, e100Threads, ::blocking))
    }

    @Benchmark
    fun e100ThreadsSuspending(bh: Blackhole) = runBlocking {
        bh.consume(makeCoffee(orders, e100Threads, ::suspending))
    }

    @Benchmark
    fun singleThreadCpu1(bh: Blackhole) = runBlocking {
        bh.consume(makeCoffee(orders, singleThread, ::cpu1))
    }

    @Benchmark
    fun singleThreadCpu2(bh: Blackhole) = runBlocking {
        bh.consume(makeCoffee(orders, singleThread, ::cpu2))
    }

    @Benchmark
    fun singleThreadBlocking(bh: Blackhole) = runBlocking {
        bh.consume(makeCoffee(orders, singleThread, ::blocking))
    }

    @Benchmark
    fun singleThreadSuspending(bh: Blackhole) = runBlocking {
        bh.consume(makeCoffee(orders, singleThread, ::suspending))
    }

    @Benchmark
    fun ioCpu1(bh: Blackhole) = runBlocking {
        bh.consume(makeCoffee(orders, <http://Dispatchers.IO|Dispatchers.IO>, ::cpu1))
    }

    @Benchmark
    fun ioCpu2(bh: Blackhole) = runBlocking {
        bh.consume(makeCoffee(orders, <http://Dispatchers.IO|Dispatchers.IO>, ::cpu2))
    }

    @Benchmark
    fun ioBlocking(bh: Blackhole) = runBlocking {
        bh.consume(makeCoffee(orders, <http://Dispatchers.IO|Dispatchers.IO>, ::blocking))
    }

    @Benchmark
    fun ioSuspending(bh: Blackhole) = runBlocking {
        bh.consume(makeCoffee(orders, <http://Dispatchers.IO|Dispatchers.IO>, ::suspending))
    }
}

class Order(val customer: String)
class Coffee(val order: Order)

suspend fun makeCoffee(
    orders: List<Order>,
    dispatcher: CoroutineDispatcher,
    makeCoffee: suspend (Order)->Coffee
) = withContext(dispatcher) {
    orders.map { async { makeCoffee(it) } }
        .map { it.join() }
}

fun cpu1(order: Order): Coffee {
    val size = 350 // ~0.1 second on my MacBook
    val list = List(size) { it }
    val listOfLists = List(size) { list }
    val listOfListsOfLists = List(size) { listOfLists }
    listOfListsOfLists.hashCode()
    return Coffee(order)
}

fun cpu2(order: Order): Coffee {
    val size = 820 // ~1 second on my MacBook
    val list = List(size) { it }
    val listOfLists = List(size) { list }
    val listOfListsOfLists = List(size) { listOfLists }
    listOfListsOfLists.hashCode()
    return Coffee(order)
}

fun blocking(order: Order): Coffee {
    Thread.sleep(1000)
    return Coffee(order)
}

suspend fun suspending(order: Order): Coffee {
    delay(1000)
    return Coffee(order)
}
Copy code
Benchmark                                Mode  Cnt         Score        Error  Units
KotlinBenchmark.defaultBlocking         thrpt    5         0.077 ±      0.001  ops/s
KotlinBenchmark.defaultCpu1             thrpt    5         0.157 ±      0.100  ops/s
KotlinBenchmark.defaultCpu2             thrpt    5         0.009 ±      0.002  ops/s
KotlinBenchmark.defaultSuspending       thrpt    5         0.998 ±      0.002  ops/s
KotlinBenchmark.e100ThreadsBlocking     thrpt    5         0.997 ±      0.003  ops/s
KotlinBenchmark.e100ThreadsCpu1         thrpt    5         0.173 ±      0.024  ops/s
KotlinBenchmark.e100ThreadsCpu2         thrpt    5         0.012 ±      0.002  ops/s
KotlinBenchmark.e100ThreadsSuspending   thrpt    5         0.998 ±      0.004  ops/s
KotlinBenchmark.ioBlocking              thrpt    5         0.499 ±      0.001  ops/s
KotlinBenchmark.ioCpu1                  thrpt    5         0.185 ±      0.038  ops/s
KotlinBenchmark.ioCpu2                  thrpt    5         0.012 ±      0.006  ops/s
KotlinBenchmark.ioSuspending            thrpt    5         0.998 ±      0.003  ops/s
KotlinBenchmark.singleThreadBlocking    thrpt    5         0.010 ±      0.001  ops/s
KotlinBenchmark.singleThreadCpu1        thrpt    5         0.067 ±      0.006  ops/s
KotlinBenchmark.singleThreadCpu2        thrpt    5         0.005 ±      0.001  ops/s
KotlinBenchmark.singleThreadSuspending  thrpt    5         0.998 ±      0.005  ops/s
SampleBenchmark.fibClassic              thrpt    5       417.312 ±      4.514  ops/s
SampleBenchmark.fibTailRec              thrpt    5  51156850.491 ± 556032.508  ops/s
d
Im trying to undersand what these functions are doing -- I dont see any "CPU Intensive" functions. makeCoffee seems like it does a few list creations -- what am I missing? From my veiw it looks like these are mostly executing in memory mgt code and within coroutine framework code, not 'cpu intensive' (except the fib -- not shown). Memory mgt code is very difficult to profile due to the wide variations of performance as memory gets fragmented, plus add to that the implict GC. I would suggest A) to test CPU intensive code do CPU not memory intensive operations B) Profile the tests before running a full benchmark to verify that the majority of the time spent is in the code you want it to be, not 'infrastructure
☝️ 2
e
Also, but this might differ per platform or environment config (e.g. JVM properties, or maybe even the hardware): don't the IO and Default dispatcher share the same thread pool until that thread pool runs out of threads? If I'm not mistaken, this is done to make switching between the IO and Default context fast, because if a thread is available then there needs to be no thread switch and the coroutine can continue on the same thread.
So try to limit the number of threads in the pool and go over the limit of parallelism with you benchmark, so that the coroutines infra must either spawn new threads, or switch between threads in the pool. That is where the efficiency differences between Default and IO should show: IO is made for work that might suspend for a long time without doing work (e.g. idle waiting for a network / disk response), while Default should be able to efficiently do (parallel/concurrent) operations
And agreed: allocating lists (of lists (of lists)) isn't very CPU intensive, right? Instead, have the CPU intensive tasks actually calculate something. IDK, maybe
Copy code
var i = Int.MAX_VALUE // ~25 seconds on my Macbook
while (i > 0) i -= 1
m
Thanks @DALDEI @Erik, I updated the Benchmarks. I named the previous case as memory-intensive, and introduced better CPU-intensive functions. Here is the current code: https://github.com/MarcinMoskala/coroutines-benchmarks/blob/master/src/jmh/java/me/champeau/jmh/KotlinBenchmark.kt I needed to update them, because
Copy code
var i = Int.MAX_VALUE // ~25 seconds on my Macbook
while (i > 0) i -= 1
was optimized on low-level. The second function is checking for different numbers if they are prime numbers or not. The results are now much more as expected. https://kt.academy/article/cc-dispatchers#performance-of-dispatchers-against-different-tasks Does it look good now?
d
The one odd measure is CPU1 row 2,3. vs CPU2 Looking at the code I would not expect them to scale differnty (Id expect both to be like CPU1 for some high number of threads, The most obvious reason for the times to go UP with more threads is that it takes time to create threads, as well as losing good memory locality wrt CPU caches. The 'ideal' pure CPU load sweet spot is 1 CPU core per parallel function. It rarely turns out to be that in practice -- but its a good ballpark. You can from 8-64 threads there was no benifit. run in 100 (or if you did this at bigger scale, 1000, 10000 threads) then it will slow down everything. That is happening with CPU1 and CPU2 -- but why is CPU1 settling out == CPU2 ? some weirdness with how the OS scheduler along with the coroutine dispatcher is finding a way to be more efficient with CPU2 at 100 threads then cpu1. My guess is if you tweak CPU1 by various means that will change dramatically -- such as requiring more memory/variables in the inner loop. It may be a consequence of GC as well hitting a 'magic time' to do stuff. All in all - this is what I would expect to see
Another observation - at 100 threads -- you are not seeing the vast performance gains often cited for coroutines over threads. This is as I would expect, you dont see that gain until much higher number of threads, or coroutines that do much less per function. ALl these test do no context switching at all in the loops so your only counting 1000 total context switches - which is neglible for threads or coroutines. Do a million -- you will see a differnce (IF you have enough ram for a million threads)