Hi folks I don t know if this is the right place to ask sorr kotlinlang #coroutines

Hi, folks. I don't know if this is the right place...

Luis

02/02/2024, 1:04 AM

Hi, folks. I don't know if this is the right place to ask, sorry if it's not. Can coroutines run in parallel? After a lot of searching, I found many answers pointing that "No. by design", but with no explanation. But how can this be, since we can run two coroutines in two different threads? And multiple coroutines can run in a single thread, right?

👌 3

Jeff Lockhart

02/02/2024, 1:08 AM

The parallelism is controlled by the coroutine dispatcher.

thank you color 1

➕ 2

Joffrey

02/02/2024, 8:26 AM

That can definitely run in parallel. Sometimes articles show simple examples using

runBlocking

, and in this specific case the dispatcher is a single-threaded event loop, so no parallelism. But multi-threaded dispatchers are very common, for instance

Dispatchers.Default

<http://Dispatchers.IO|Dispatchers.IO>

👍 1

CLOVIS

02/02/2024, 9:01 AM

> We can run two coroutines in two different threads? Yes, in this case they run in parallel (unless you add some kind of lock yourself). > And multiple coroutines can run in a single thread, right? Yes, but then they cannot run in parallel: it's the thread that does actions, and a thread can only do a single thing at once. For example, these are concurrent but not parallel, because the dispatcher only has a single thread (playground):

Copy code

val singleThread = newSingleThreadContext("single")
val singleThreadScope = CoroutineScope(singleThread)

repeat(3) { id ->
    singleThreadScope.launch {
        var counter = 10
        while (isActive && counter > 0) {
            println("Running in $id")
            counter--
            delay(1)
        }
    }
}

These are concurrent and parallel, because the dispatcher has multiple threads (as already pointed out above) (playground):

Copy code

val multiThreadScope = CoroutineScope(Dispatchers.IO)

repeat(3) { id ->
    multiThreadScope.launch {
        var counter = 10
        while (isActive && counter > 0) {
            println("Running in $id")
            counter--
            delay(1)
        }
    }
}

👍 1

Luis

02/02/2024, 3:52 PM

I was trying to do a little experimental to see how much parallelism is faster than sequential but the results are inverse. The sequential is better than parallel. Did I did something wrong? How can this be?

Copy code

suspend fun main(): Unit = runBlocking{
    val numberOfTimes = 1_000_000
    val arraySize = 10

    measureTimeMillis {
        coroutineScope {
            repeat(numberOfTimes){
                launch {
                    Array(arraySize) { it.hashCode() }
                }
            }
        }
    }.apply {
        println("With corroutines: $this ms")
    }

    measureTimeMillis {
        repeat(numberOfTimes){
            Array(arraySize) { it.hashCode() }
        }
    }.apply {
        println("Without corroutines: $this ms")
    }
}

CLOVIS

02/02/2024, 4:31 PM

A few points; • Benchmarking is hard! In particular, the JVM gets faster and faster as it runs, so benchmarks that that are later in the code tend to be faster than benchmarks at the start of the code. • Here, your first example is not parallel. You are using the dispatcher from

runBlocking

, which uses a single thread. Try using

withContext(Dispatchers.Default) { … }

instead of just

coroutineScope { … }

to use a different dispatcher. • Since both your examples use a single thread, you are comparing "running everything sequentially in a single for loop" (

repeat

is compiled to a regular

for

loop) to "running everything sequentially, but each iteration has to be submitted to the dispatcher's event queue, ran, and then the next one must be scheduled" which is much harder to optimize by the JVM.

➕ 1

CLOVIS

02/02/2024, 4:34 PM

Your snippet doesn't run on the Kotlin playground because it takes too much memory 😅 I don't have anything on hand to run it myself and try to understand it further

Joffrey

02/02/2024, 4:35 PM

Ivan, I think you meant ~~coroutineScope(Dispatchers.Default)~~ ->

withContext(Dispatchers.Default)

👍 1

CLOVIS

02/02/2024, 4:37 PM

Ah yes, thanks. I edited the message.

Luis

02/02/2024, 8:03 PM

Ivan, I made the modifications you suggested, but the sequential approach is still prevailing. This raises a question: is my example not designed for this type of optimization, or were Kotlin coroutines not built for high processing? I've changed the example to be run in the playground:

Copy code

import kotlinx.coroutines.Dispatchers
import kotlinx.coroutines.launch
import kotlinx.coroutines.withContext
import kotlin.system.measureTimeMillis


suspend fun main(): Unit = withContext(Dispatchers.Default){
    val numberOfTimes = 10_000
    val arraySize = 10

    measureTimeMillis {
        repeat(numberOfTimes){
            Array(arraySize) { it.hashCode() }
        }
    }.apply {
        println("Without corroutines: $this ms")
    }

    measureTimeMillis {
        repeat(numberOfTimes){
            launch {
                Array(arraySize) { it.hashCode() }
            }
        }
    }.apply {
        println("With corroutines: $this ms")
    }
}

And thanks in advance for the interactions

Jeff Lockhart

02/02/2024, 10:49 PM

My guess is that workload is spending much more time allocating memory than CPU processing, so parallel processing doesn't really help and the addition of coroutine memory allocations makes the coroutines run longer. Also, you need some mechanism to scope your coroutines to ensure they all complete within your time measurement block. E.g.:

Copy code

measureTimeMillis {
    coroutineScope {
        repeat(numberOfTimes) {
            launch {
                doWork()
            }
        }
    }
}

Also note, the level of parallelism is very much dependent on the system you're running on. I have no idea how many CPU cores the Kotlin Playground executes with.

Luis

02/03/2024, 1:14 AM

I runned on a 4 core CPU. I also think the workload of allocation is increasing the time, but doesn't this means that sequential will always win? I'm just trying to figure out the cpu problems that coroutines can resolve.

Jeff Lockhart

02/03/2024, 2:09 AM

Any CPU bound workload should benefit, algorithms that are highly computational, able to utilize as many CPU cores as the system offers. At the same time, CPU bound workloads can only benefit by as many CPU cores as is available as well. So on a 4 core system, you should be able to achieve close to 4x performance by running in parallel (the threading overhead will keep it somewhere less than 4x). If your work is not especially CPU intensive, you'll see less benefit from parallelism. Coroutines offers some benefits over utilizing threads directly, but you'll generally see similar performance between both approaches to parallelism, if you use a coroutine dispatcher that utilizes all available cores or that many threads directly. One of the advantages of coroutines is that they're cheap to create. If you tried that same code creating 10,000 threads, you'd likely run out of memory. But really seeing the benefit of parallelism, you'd only need 4 coroutines running concurrently in the default dispatcher to max out the performance benefit of parallelism on a 4 core system.

Luis

02/03/2024, 3:55 PM

I got it. Thanks Jeff for the explanation

Jeff Lockhart

02/04/2024, 11:01 PM

I put a fibonacci workload in your benchmark to demonstrate what I was describing, as a simple CPU intensive function.

Copy code

import kotlinx.coroutines.*
import kotlin.time.measureTime

suspend fun main() {
    val numberOfTimes = 16
    val value = 40

    measureTime {
        repeat(numberOfTimes) {
            val time = measureTime {
                fibonacci(value)
            }
            println("$it $time")
        }
    }.apply {
        println("Without coroutines: $this")
    }

    measureTime {
        coroutineScope {
            repeat(numberOfTimes) {
                launch(Dispatchers.Default) {
                    val time = measureTime {
                        fibonacci(value)
                    }
                    println("$it $time")
                }
            }
        }
    }.apply {
        println("With coroutines: $this")
    }
}

fun fibonacci(n: Int): Long {
    return if (n <= 1) {
        n.toLong()
    } else {
        fibonacci(n - 1) + fibonacci(n - 2)
    }
}

Running on my 16-core machine, I get this output:

Copy code

0 249.901979ms
1 248.530566ms
2 254.228630ms
3 249.449091ms
4 239.983553ms
5 243.419348ms
6 243.652377ms
7 249.540971ms
8 252.376248ms
9 256.178971ms
10 252.962235ms
11 245.185620ms
12 250.231187ms
13 246.561204ms
14 251.857170ms
15 247.229061ms
Without coroutines: 4.012906210s
0 276.228992ms
3 277.375637ms
15 275.671535ms
12 277.222348ms
10 278.674351ms
1 280.761071ms
2 281.079150ms
7 281.659647ms
5 281.960807ms
11 281.052791ms
14 281.033821ms
9 282.077916ms
4 282.861293ms
8 283.675079ms
6 285.752790ms
13 286.216757ms
With coroutines: 348.880909ms

The slightly faster individual execution times when running sequentially is likely due to CPU single-threaded performance boosting.

Jeff Lockhart

02/04/2024, 11:21 PM

The odd thing is, I first put this code in my KMP library tests I had open and when I ran them there as JVM tests, I got these unexpected results:

Copy code

0 570.481784ms
1 495.259729ms
2 498.058526ms
3 493.501416ms
4 488.237780ms
5 486.254588ms
6 495.960375ms
7 502.377977ms
8 484.046018ms
9 503.903930ms
10 489.172675ms
11 503.375202ms
12 505.227134ms
13 490.989507ms
14 486.554507ms
15 491.134007ms
Without coroutines: 8.013552646s
0 14.861984062s
14 14.867873495s
9 14.915033116s
8 14.933522893s
11 14.955004188s
7 14.974335832s
2 14.985772431s
6 14.991468946s
10 15.022878786s
5 15.025456945s
3 15.026649970s
1 15.041704843s
4 15.044281802s
13 15.042278650s
12 15.047098948s
15 15.046905120s
With coroutines: 15.084903500s

Everything is taking longer, and the more coroutines running in parallel, the longer each of them takes to complete. So now I'm baffled about what's going on in my library's JVM test environment! 😂 I don't get the same results running the same code on native targets or from the library directly. Only JVM tests. 😕

Luis

02/05/2024, 12:28 AM

I got these results running on my machine as a normal JVM build:

Copy code

0 1.600314200s
1 1.493324800s
2 1.520053800s
3 1.514458400s
4 1.472961100s
5 1.479469700s
6 1.507933200s
7 1.469162900s
8 1.490816500s
9 1.489173900s
10 1.482828100s
11 1.495576700s
12 1.532814300s
13 1.523432700s
14 1.555138600s
15 1.513885900s
Without coroutines: 24.196978500s
3 1.805911900s
6 1.779178100s
7 1.785112900s
2 1.947047900s
0 1.965964900s
1 2.071362300s
4 2.068797800s
5 2.060026600s
9 1.677584200s
8 1.711236500s
10 1.694684300s
11 1.647069500s
12 1.754483200s
15 1.725787300s
14 1.749260900s
13 1.761648s
With coroutines: 4.006675500s

Luis

02/05/2024, 12:28 AM

But the results you got it's a little curious

Jeff Lockhart

02/05/2024, 12:30 AM

Yep, that's what I'd expect. If that's your 4 core system, it's getting a 6x performance gain running in parallel, probably from dual CPU threads providing an additional boost to parallelism.

Jeff Lockhart

02/05/2024, 7:24 AM

So it turns out Kover is the culprit for why things are running so much slower as JVM tests in my KMP library! I didn't even realize Kover was running outside generating its reports, let alone having such a big performance impact. So this is quite the discovery. Turns out my JVM tests run 15% faster without Kover. Seems like there's an especially large hit to performance when running many parallel coroutines though.

Luis

02/08/2024, 2:06 AM

Good to know that. How did you discover this?

Jeff Lockhart

02/08/2024, 4:37 AM

I just removed things until the code performed as expected. I created an issue and found there's a workaround to selectively enable Kover only when running the report task.

👍 1

7 Views

Open in Slack

Previous Next