Hi, folks. I don't know if this is the right place...
# coroutines
l
Hi, folks. I don't know if this is the right place to ask, sorry if it's not. Can coroutines run in parallel? After a lot of searching, I found many answers pointing that "No. by design", but with no explanation. But how can this be, since we can run two coroutines in two different threads? And multiple coroutines can run in a single thread, right?
👌 3
j
The parallelism is controlled by the coroutine dispatcher.
thank you color 1
2
j
That can definitely run in parallel. Sometimes articles show simple examples using
runBlocking
, and in this specific case the dispatcher is a single-threaded event loop, so no parallelism. But multi-threaded dispatchers are very common, for instance
Dispatchers.Default
or
<http://Dispatchers.IO|Dispatchers.IO>
.
👍 1
c
> We can run two coroutines in two different threads? Yes, in this case they run in parallel (unless you add some kind of lock yourself). > And multiple coroutines can run in a single thread, right? Yes, but then they cannot run in parallel: it's the thread that does actions, and a thread can only do a single thing at once. For example, these are concurrent but not parallel, because the dispatcher only has a single thread (playground):
Copy code
val singleThread = newSingleThreadContext("single")
val singleThreadScope = CoroutineScope(singleThread)

repeat(3) { id ->
    singleThreadScope.launch {
        var counter = 10
        while (isActive && counter > 0) {
            println("Running in $id")
            counter--
            delay(1)
        }
    }
}
These are concurrent and parallel, because the dispatcher has multiple threads (as already pointed out above) (playground):
Copy code
val multiThreadScope = CoroutineScope(Dispatchers.IO)

repeat(3) { id ->
    multiThreadScope.launch {
        var counter = 10
        while (isActive && counter > 0) {
            println("Running in $id")
            counter--
            delay(1)
        }
    }
}
👍 1
l
I was trying to do a little experimental to see how much parallelism is faster than sequential but the results are inverse. The sequential is better than parallel. Did I did something wrong? How can this be?
Copy code
suspend fun main(): Unit = runBlocking{
    val numberOfTimes = 1_000_000
    val arraySize = 10

    measureTimeMillis {
        coroutineScope {
            repeat(numberOfTimes){
                launch {
                    Array(arraySize) { it.hashCode() }
                }
            }
        }
    }.apply {
        println("With corroutines: $this ms")
    }

    measureTimeMillis {
        repeat(numberOfTimes){
            Array(arraySize) { it.hashCode() }
        }
    }.apply {
        println("Without corroutines: $this ms")
    }
}
c
A few points; • Benchmarking is hard! In particular, the JVM gets faster and faster as it runs, so benchmarks that that are later in the code tend to be faster than benchmarks at the start of the code. • Here, your first example is not parallel. You are using the dispatcher from
runBlocking
, which uses a single thread. Try using
withContext(Dispatchers.Default) { … }
instead of just
coroutineScope { … }
to use a different dispatcher. • Since both your examples use a single thread, you are comparing "running everything sequentially in a single for loop" (
repeat
is compiled to a regular
for
loop) to "running everything sequentially, but each iteration has to be submitted to the dispatcher's event queue, ran, and then the next one must be scheduled" which is much harder to optimize by the JVM.
1
Your snippet doesn't run on the Kotlin playground because it takes too much memory 😅 I don't have anything on hand to run it myself and try to understand it further
j
Ivan, I think you meant coroutineScope(Dispatchers.Default) ->
withContext(Dispatchers.Default)
👍 1
c
Ah yes, thanks. I edited the message.
l
Ivan, I made the modifications you suggested, but the sequential approach is still prevailing. This raises a question: is my example not designed for this type of optimization, or were Kotlin coroutines not built for high processing? I've changed the example to be run in the playground:
Copy code
import kotlinx.coroutines.Dispatchers
import kotlinx.coroutines.launch
import kotlinx.coroutines.withContext
import kotlin.system.measureTimeMillis


suspend fun main(): Unit = withContext(Dispatchers.Default){
    val numberOfTimes = 10_000
    val arraySize = 10

    measureTimeMillis {
        repeat(numberOfTimes){
            Array(arraySize) { it.hashCode() }
        }
    }.apply {
        println("Without corroutines: $this ms")
    }

    measureTimeMillis {
        repeat(numberOfTimes){
            launch {
                Array(arraySize) { it.hashCode() }
            }
        }
    }.apply {
        println("With corroutines: $this ms")
    }
}
And thanks in advance for the interactions
j
My guess is that workload is spending much more time allocating memory than CPU processing, so parallel processing doesn't really help and the addition of coroutine memory allocations makes the coroutines run longer. Also, you need some mechanism to scope your coroutines to ensure they all complete within your time measurement block. E.g.:
Copy code
measureTimeMillis {
    coroutineScope {
        repeat(numberOfTimes) {
            launch {
                doWork()
            }
        }
    }
}
Also note, the level of parallelism is very much dependent on the system you're running on. I have no idea how many CPU cores the Kotlin Playground executes with.
l
I runned on a 4 core CPU. I also think the workload of allocation is increasing the time, but doesn't this means that sequential will always win? I'm just trying to figure out the cpu problems that coroutines can resolve.
j
Any CPU bound workload should benefit, algorithms that are highly computational, able to utilize as many CPU cores as the system offers. At the same time, CPU bound workloads can only benefit by as many CPU cores as is available as well. So on a 4 core system, you should be able to achieve close to 4x performance by running in parallel (the threading overhead will keep it somewhere less than 4x). If your work is not especially CPU intensive, you'll see less benefit from parallelism. Coroutines offers some benefits over utilizing threads directly, but you'll generally see similar performance between both approaches to parallelism, if you use a coroutine dispatcher that utilizes all available cores or that many threads directly. One of the advantages of coroutines is that they're cheap to create. If you tried that same code creating 10,000 threads, you'd likely run out of memory. But really seeing the benefit of parallelism, you'd only need 4 coroutines running concurrently in the default dispatcher to max out the performance benefit of parallelism on a 4 core system.
l
I got it. Thanks Jeff for the explanation
j
I put a fibonacci workload in your benchmark to demonstrate what I was describing, as a simple CPU intensive function.
Copy code
import kotlinx.coroutines.*
import kotlin.time.measureTime

suspend fun main() {
    val numberOfTimes = 16
    val value = 40

    measureTime {
        repeat(numberOfTimes) {
            val time = measureTime {
                fibonacci(value)
            }
            println("$it $time")
        }
    }.apply {
        println("Without coroutines: $this")
    }

    measureTime {
        coroutineScope {
            repeat(numberOfTimes) {
                launch(Dispatchers.Default) {
                    val time = measureTime {
                        fibonacci(value)
                    }
                    println("$it $time")
                }
            }
        }
    }.apply {
        println("With coroutines: $this")
    }
}

fun fibonacci(n: Int): Long {
    return if (n <= 1) {
        n.toLong()
    } else {
        fibonacci(n - 1) + fibonacci(n - 2)
    }
}
Running on my 16-core machine, I get this output:
Copy code
0 249.901979ms
1 248.530566ms
2 254.228630ms
3 249.449091ms
4 239.983553ms
5 243.419348ms
6 243.652377ms
7 249.540971ms
8 252.376248ms
9 256.178971ms
10 252.962235ms
11 245.185620ms
12 250.231187ms
13 246.561204ms
14 251.857170ms
15 247.229061ms
Without coroutines: 4.012906210s
0 276.228992ms
3 277.375637ms
15 275.671535ms
12 277.222348ms
10 278.674351ms
1 280.761071ms
2 281.079150ms
7 281.659647ms
5 281.960807ms
11 281.052791ms
14 281.033821ms
9 282.077916ms
4 282.861293ms
8 283.675079ms
6 285.752790ms
13 286.216757ms
With coroutines: 348.880909ms
The slightly faster individual execution times when running sequentially is likely due to CPU single-threaded performance boosting.
The odd thing is, I first put this code in my KMP library tests I had open and when I ran them there as JVM tests, I got these unexpected results:
Copy code
0 570.481784ms
1 495.259729ms
2 498.058526ms
3 493.501416ms
4 488.237780ms
5 486.254588ms
6 495.960375ms
7 502.377977ms
8 484.046018ms
9 503.903930ms
10 489.172675ms
11 503.375202ms
12 505.227134ms
13 490.989507ms
14 486.554507ms
15 491.134007ms
Without coroutines: 8.013552646s
0 14.861984062s
14 14.867873495s
9 14.915033116s
8 14.933522893s
11 14.955004188s
7 14.974335832s
2 14.985772431s
6 14.991468946s
10 15.022878786s
5 15.025456945s
3 15.026649970s
1 15.041704843s
4 15.044281802s
13 15.042278650s
12 15.047098948s
15 15.046905120s
With coroutines: 15.084903500s
Everything is taking longer, and the more coroutines running in parallel, the longer each of them takes to complete. So now I'm baffled about what's going on in my library's JVM test environment! 😂 I don't get the same results running the same code on native targets or from the library directly. Only JVM tests. 😕
l
I got these results running on my machine as a normal JVM build:
Copy code
0 1.600314200s
1 1.493324800s
2 1.520053800s
3 1.514458400s
4 1.472961100s
5 1.479469700s
6 1.507933200s
7 1.469162900s
8 1.490816500s
9 1.489173900s
10 1.482828100s
11 1.495576700s
12 1.532814300s
13 1.523432700s
14 1.555138600s
15 1.513885900s
Without coroutines: 24.196978500s
3 1.805911900s
6 1.779178100s
7 1.785112900s
2 1.947047900s
0 1.965964900s
1 2.071362300s
4 2.068797800s
5 2.060026600s
9 1.677584200s
8 1.711236500s
10 1.694684300s
11 1.647069500s
12 1.754483200s
15 1.725787300s
14 1.749260900s
13 1.761648s
With coroutines: 4.006675500s
But the results you got it's a little curious
j
Yep, that's what I'd expect. If that's your 4 core system, it's getting a 6x performance gain running in parallel, probably from dual CPU threads providing an additional boost to parallelism.
So it turns out Kover is the culprit for why things are running so much slower as JVM tests in my KMP library! I didn't even realize Kover was running outside generating its reports, let alone having such a big performance impact. So this is quite the discovery. Turns out my JVM tests run 15% faster without Kover. Seems like there's an especially large hit to performance when running many parallel coroutines though.
l
Good to know that. How did you discover this?
j
I just removed things until the code performed as expected. I created an issue and found there's a workaround to selectively enable Kover only when running the report task.
👍 1