I think I found a bug in coroutines but I m not sure if it i kotlinlang #coroutines

I think I found a bug in coroutines... but I'm not...

MrPowerGamerBR

06/16/2019, 8:51 PM

I think I found a bug in coroutines... but I'm not sure if it is really a bug or a mistake I made: After the JVM is up for +- 3 days,

delay(...)

stops working, no matter what dispatcher I use (

Dispatchers.Default

<http://Dispatchers.IO|Dispatchers.IO>

, custom dispatcher from a custom executor, etc). It just hangs on the

delay

call and never continues. The dispatcher/thread pool isn't overloaded, because the application is still alive and processing every request without any delay... except when there is a

delay

call somewhere, then the code hangs) Restarting the JVM magically fixes the issue, until after a few days it stops working again. Using JVM 12, Kotlin 1.3.21 + Coroutines 1.2.1 (Yes, I know, it is outdated, I didn't update yet because I couldn't get Kotlin's script engine working on 1.3.31... if reeeally needed I can update, but I decided to ask here before I do it to figure out if anyone knows what could cause this issue (or if it was already fixed))

😱 3

🍿 2

MrPowerGamerBR

06/16/2019, 9:07 PM

Here's a GIF showing the issue, the code is evaluated, executed buuut it gets stuck on the

delay(...)

, no matter what dispatcher I use.

https://cdn.discordapp.com/attachments/393332226881880074/589923294916902961/SayRQ1OGN5.gif▾

Of course, maybe it is a issue on my end... but I can't figure out what would cause this issue because restarting the application fixes the issue (until it breaks after a few days...). However sometimes (very rarely) the tasks works fine, they execute, are delayed and then they exit... but sometimes they get stuck.

MrPowerGamerBR

06/16/2019, 11:27 PM

What looks that it happens is that after a veeery long time (way more than the specified delay), the coroutine does continue. Here I executed the code:

https://cdn.discordapp.com/attachments/392476688614948897/589958622000185549/unknown.png▾

And after almost one hour later... the code was executed.

https://cdn.discordapp.com/attachments/392476688614948897/589958706410553356/unknown.png▾

Very strange, as I said it isn't that there isn't available threads to process the requests, because launching tasks works, only

delay

seems to have that strange issue. (even suspending functions works fine, except if they call

delay

)

😱 1

elizarov

06/16/2019, 11:34 PM

Something is bogging down the thread the does

delay

schedule. It is called “Default Executor”.

elizarov

06/16/2019, 11:35 PM

What’s your execution infrastructure? Can you take a thread-dump to see what that thread is busy with?

MrPowerGamerBR

06/17/2019, 12:52 AM

Yeah sure, let me upload it to gist

MrPowerGamerBR

06/17/2019, 12:52 AM

Here it is https://gist.github.com/MrPowerGamerBR/fcf307ae40daa69fbee6680f3d067a57

MrPowerGamerBR

06/17/2019, 12:53 AM

I'm using a cached thread pool for the coroutines:

Copy code

val coroutineExecutor = createThreadPool("Coroutine Executor Thread %d")
	val coroutineDispatcher = coroutineExecutor.asCoroutineDispatcher()

	fun createThreadPool(name: String): ExecutorService {
		return Executors.newCachedThreadPool(ThreadFactoryBuilder().setNameFormat(name).build())
	}

elizarov

06/17/2019, 12:54 AM

That should not affect

delay

in the way you describe.

elizarov

06/17/2019, 12:54 AM

I don’t see anything unusual in the thread-dump, though. The “kotlinx.coroutines.DefaultExecutor” does not seem to be busy with anything — parked as it is supposed to be

MrPowerGamerBR

06/17/2019, 12:55 AM

Strange that everything works (calling

suspend

methods, etc), but not delay

MrPowerGamerBR

06/17/2019, 12:58 AM

I can run the Kotlin Coroutines debug JVM agent, however I don't want to reboot the JVM if possible (if I reboot, the issue goes away until it randomly goes back)

elizarov

06/17/2019, 12:58 AM

Any chance you can attach a debugger this running JVM?

elizarov

06/17/2019, 1:02 AM

One theory I have is that somehow your system becomes slow at thread creation…. Because after 1 sec when you don’t use

delay

its thread is automatically shutsdown, so next time you need to delay it needs to create thread……. But at this thread dump I see the thread up & running

MrPowerGamerBR

06/17/2019, 1:03 AM

Yes, but that would require a JVM restart (and that would "fix" the issue until it breaks again, which looks like it takes about three/four days for some reason)

elizarov

06/17/2019, 1:03 AM

Can you get a memory dump with

jmap

elizarov

06/17/2019, 1:03 AM

(it works on a running process)

elizarov

06/17/2019, 1:04 AM

jmap -dump:live,format=b,file=heap.bin <pid>

MrPowerGamerBR

06/17/2019, 1:04 AM

Yeah, the only issue is the size of the memory dump, I don't think I would be able to open it because the app has 36GBs memory allocated, but I could try

elizarov

06/17/2019, 1:06 AM

Hm… I will not be able to open it…

MrPowerGamerBR

06/17/2019, 1:07 AM

I can try reproducing on a smaller instance (this happened before on one of my test instances) but the issue is waiting until the problem happens again 😛

elizarov

06/17/2019, 1:08 AM

Is the process remote or local? If it is local, then you can attach debugger from IDEA without restart

MrPowerGamerBR

06/17/2019, 1:11 AM

remote

elizarov

06/17/2019, 1:12 AM

Bad luck. I have an interesting theory, though.

elizarov

06/17/2019, 1:15 AM

Do you control all the code that uses coroutine? Do you know all the uses of

delay

in your code?

MrPowerGamerBR

06/17/2019, 1:17 AM

Yes

MrPowerGamerBR

06/17/2019, 1:17 AM

(Well, I think I do, I don't think I do use any other external dependency that uses

delay

)

elizarov

06/17/2019, 1:17 AM

Do you use something like

delay(Long.MAX_VALUE)

to wait forever?

MrPowerGamerBR

06/17/2019, 1:24 AM

No, I don't... but what about "very big values that aren't `Long.MAX_VALUE`"? (Because if a user says "hey, wait 100 years because idk" and the code waits for... well,

now - 100 years in milliseconds

)

elizarov

06/17/2019, 1:27 AM

Neither should be really a problem… I’m still puzzled how to explain it… Especially this “1 hour later wakeup”

MrPowerGamerBR

06/17/2019, 1:31 AM

Also, it isn't exactly 1 hour, it looks like it varies

MrPowerGamerBR

06/17/2019, 1:32 AM

And what is interesting is that the tasks doesn't follow a "order" (This task was delayed first, so it will be executed first), it seems to be random

elizarov

06/17/2019, 1:32 AM

That is extremely puzzling. It basically means that something got broken inside the delay’s binary heap data-structure, but I cannot figure out how this could happen — it is pretty robust data structure, simply synchronized and well covered with tests.

MrPowerGamerBR

06/17/2019, 1:33 AM

Here's a example: https://gist.github.com/MrPowerGamerBR/2e5aa548745e923074604da37197fc7e

elizarov

06/17/2019, 1:34 AM

What could break is a completely broken implementation of

System.nanoTime()

(and jumps back and forth and something like that), but that kind of broken nano-time would wreak a lot of havoc elsewhere. I doubt a running system can have a problem with it.

MrPowerGamerBR

06/17/2019, 1:34 AM

And here's the code itself: https://github.com/LorittaBot/Loritta/blob/master/loritta-core/src/main/java/com/mrpowergamerbr/loritta/modules/WelcomeModule.kt#L180

MrPowerGamerBR

06/17/2019, 1:35 AM

What I did notice is that it happens after the JVM was up for more than 3 days

elizarov

06/17/2019, 1:38 AM

I cannot grasp a significance of “3 days” in this context….

elizarov

06/17/2019, 1:38 AM

In your last log. That “Coroutine Executor Thread”. What are those? I don’t see them in your thread dump

elizarov

06/17/2019, 1:39 AM

Also scary big numbers… Like “Coroutine Executor Thread 149779”. Are they sequentially numbered? Looks like quite a big thread churn…

elizarov

06/17/2019, 1:42 AM

From that log it follows that a new thread was created approx every 2 seconds. That’s not by itself a big number (would not bog down a system) unless something else is going on…

MrPowerGamerBR

06/17/2019, 1:42 AM

Yes, they are sequentially numbered

elizarov

06/17/2019, 1:43 AM

Why don’t I see them in the thread-dump? What code creates them? Maybe it’s not the

delay

after all, but those threads were not available (missing) at the time?

MrPowerGamerBR

06/17/2019, 1:48 AM

This is the code that creates them:

Executors.newCachedThreadPool(ThreadFactoryBuilder().setNameFormat(name).build())

Only Kotlin Coroutines uses that executor, so it can't be something else. Also, wouldn't GC remove cached thread (that are unused) after 60s, causing them to create new threads (which will have a new ID assigined to them)? And if it was the dispatcher, wouldn't using a different dispatcher work? (like

Dispatchers.Default

<http://Dispatchers.IO|Dispatchers.IO>

)

elizarov

06/17/2019, 3:36 AM

Can it be some kind of contention in somewhere else? Can you try with unconfined dispatcher and with direct printing to console:

Copy code

GlobalScope.launch(Dispatchers.Unconfined) { 
    println("(0) at ${Instant.now()}")
    delay(1)
    println("(1) at ${Instant.now()}")
}

Do you have console output of that process redirected somewhere?

MrPowerGamerBR

06/17/2019, 10:03 AM

I will try doing that asap, while I don't redirect the console output (except when logging with logback) I can see the console output.

MrPowerGamerBR

06/17/2019, 11:01 AM

Tested it right now and yeah, it doesn't work too, it displays the first

(0)

line but never displays the

(1)

MrPowerGamerBR

06/17/2019, 3:14 PM

Because the coroutines are still stuck (and that breaks some of the features on my application) I will restart the JVM and update all the dependencies (Kotlin, Kotlin Coroutines, etc) + enable coroutine debug + JVM debug (to remote attach to it)

elizarov

06/17/2019, 3:25 PM

It means that indeed something somehow got broken inside the DefaultExecutor’s queue that

delay

uses.

MrPowerGamerBR

06/17/2019, 3:40 PM

I hope updating everything + enabling debug will help trying to track down what is causing the issue 🙂

MrPowerGamerBR

06/17/2019, 4:31 PM

Something strange that I noticed: It looks like that when the tasks are eventually executed, a lot of them are executed at once (even if they weren't delayed at the same time)

MrPowerGamerBR

06/17/2019, 10:55 PM

I restarted the JVM and did the following changes: 1. Updated Kotlin to 1.3.31 2. Updated Kotlin Coroutines to 1.3.0-M1 3. Updated all dependencies (well, that doesn't even matter but hey) 4. Increased the thread pool cache to keep for 5 minutes (from 60s), set the core pool size to 10 (from 0) 5. Added

-Dkotlinx.coroutines.debug

to the JVM arguments 6. Added a task that keeps creating coroutines, waits 60s, then prints how much time it took and logs it. 7. Added a way to dump currently running coroutines (with the DebugProbe) 8. Enabled remote debugging I know some of the changes doesn't have any relation to the issue, but I was like "let's try changing everything and see what sticks"

elizarov

06/17/2019, 11:04 PM

Thanks. Let us see how it goes. Ping me if it reproduces. In debugger I’ll be interested to examine the values of the fields of

DefaultExecutor

object instance.

MrPowerGamerBR

06/18/2019, 12:34 AM

I ended up taking out the debug flags, it was slowing everything down and causing issues on my production instance 😞 I will keep it enabled on my test instance (which I hope I can reproduce it there)

MrPowerGamerBR

06/19/2019, 8:35 PM

@elizarov yeah looks like the uptime doesn't have anything to do with it, my production instance is now borked again while the development instance works fine. I will try enabling debug on the production and see if it doesn't have a too big performance impact (maybe it was the coroutines' debug stuff that has a big performance impact? I didn't test with only one of them disabled because after almost one hour trying to get the production instance to work without having 500%+ CPU usage was hard) But before I restart, is there's any way to get the values of the fields of the DefaultExecutor object instance with reflection? Because if it is, wouldn't it worth a shot to try getting it via reflection (since I can evaluate code without restarting) without enabling debug? Probably it is impossible, but just asking before I restart (and waiting until it decides to break again)

elizarov

06/19/2019, 9:01 PM

Yes. You can use “getDeclaredField” and “setAccesible(true)” to get hold on the private fields via reflection

MrPowerGamerBR

06/19/2019, 11:31 PM

Oh okay, that would be helpful, can you share what fields should be get from the DefaultExecutor object that would be helpful to debug the issue + how can I get a reference to the DefaultExecutor object? I tried searching around the internet and documentation about it but I couldn't find anything about it

MrPowerGamerBR

06/20/2019, 2:06 AM

Also, this is just an theory but... if there was a lot (I mean, 100k+ of them) of suspended coroutines, would that affect the delay? Maybe something is broken on my code and is creating a lot of coroutines that stay on the suspended state

elizarov

06/20/2019, 3:42 AM

Suspended where? In

delay()

? In theory it should not be a problem, since

delay

queue uses O(log n) algorithm.

MrPowerGamerBR

06/20/2019, 1:26 PM

@elizarov I'm trying to dump the variables from the

DefaultExecutor

, while most of them I was able to, getting the

_queue

and

_delayed

values from the

EventLoopImplBase

(which I guess are the ones that you will need) is kinda difficult, they don't show up on the

declaredMembers

list, only on the

declaredFields

list, which then returns a

AtomicReferenceFieldUpdater

on the

_queue$FU

field... but what are the variable's types? In Kotlin it is a

atomic<Any?>(null)

with AtomicFu, but what would that be in Java itself?

elizarov

06/20/2019, 1:29 PM

This type is

Object

in Java. Atomics are erased during compilation.

MrPowerGamerBR

06/20/2019, 5:08 PM

Yeah, but doesn't the

AtomicReferenceFieldUpdater

has two types?

<*, *>

... well, actually I don't need to know the two types because you can just cast it with

<*, *>

, however I wanted to know what value should be used on the

_queue.get(SomeObjectHere)

. I tried looking at the Kotlin Coroutines code but because it uses AtomicFu, it is just a

_queue.value

call

elizarov

06/20/2019, 7:30 PM

You don’t need field updater. What you see in code as

_queue.value

is actually stored in

_queue

field.

MrPowerGamerBR

06/20/2019, 8:54 PM

ooooh, sorry for my mistake! I was running my test code on a test project and there the

_delayed

and

_queue

fields were null, but that's because I never spawned any coroutine task. Anyway, here's some of the dumped variables from the

EventLoopImplBase

Copy code

EventLoopImplBase:
_queue = kotlinx.coroutines.internal.LockFreeTaskQueueCore@44170e37
Size: 0
_queue$FU = java.util.concurrent.atomic.AtomicReferenceFieldUpdater$AtomicReferenceFieldUpdaterImpl@671fab50
_delayed = kotlinx.coroutines.internal.ThreadSafeHeap@3d30a5ae
Size: 3300
_delayed$FU = java.util.concurrent.atomic.AtomicReferenceFieldUpdater$AtomicReferenceFieldUpdaterImpl@3ad4a630
isCompleted = false

MrPowerGamerBR

06/20/2019, 9:02 PM

The only thing that seems... "off" is the size of the

_delayed

ThreadSafeHeap, but maybe that's a normal size if you are running a big application (my test instance returns

for that same var)

elizarov

06/20/2019, 9:05 PM

Can you do me a favor. There’s an array

inside

ThreadSafeHeap

instance. It contains instance of

DelayedTask

. Can you dump all of them? I’d need

index

and

nanoTime

for each one to see if it is consistent

MrPowerGamerBR

06/20/2019, 10:05 PM

Sure! I will do that asap

MrPowerGamerBR

06/20/2019, 10:56 PM

https://gist.github.com/MrPowerGamerBR/81da0b372584729fe5735a91a207edf2

MrPowerGamerBR

06/20/2019, 10:58 PM

Aside from some tasks that seem to have negative

nanoTime

, there doesn't seem to be anything wrong

MrPowerGamerBR

07/01/2019, 1:32 PM

@elizarov after adding some "sanity" checks to a few places where coroutines are created (A function was constantly creating coroutines that were shut down right after they were created... moved the "should I create the coroutine?" check to before the coroutine was created. Added a check to avoid creating tasks that are going to be delayed for >6 months) I cannot reproduce the bug anymore. I'm still trying to debug the issue, but the JVM is currently almost 6 days up and running and the coroutines are still working fine and well 🙂

elizarov

07/01/2019, 4:53 PM

Thanks for additional info.

elizarov

07/01/2019, 5:23 PM

Something is clearly wrong with the heap structure you’ve dumped. It is broken at that explains all the strange effects you saw. However, I cannot yet figure out how this could have happened. The only plausible explanation is that

System.nanoTime()

is not monothonic at your system.

elizarov

07/01/2019, 5:53 PM

You were using

delay(Long.MAX_VALUE)

, weren’t you?

elizarov

07/01/2019, 5:54 PM

(What i see in your data could result from it plus slightly backwards-going nanoTime)

elizarov

07/01/2019, 5:55 PM

If so, i know how to fix it.

elizarov

07/02/2019, 9:33 AM

https://github.com/Kotlin/kotlinx.coroutines/issues/1312

5 Views

Open in Slack

Previous Next