I think I found a bug in coroutines... but I'm not...
# coroutines
m
I think I found a bug in coroutines... but I'm not sure if it is really a bug or a mistake I made: After the JVM is up for +- 3 days,
delay(...)
stops working, no matter what dispatcher I use (
Dispatchers.Default
,
<http://Dispatchers.IO|Dispatchers.IO>
, custom dispatcher from a custom executor, etc). It just hangs on the
delay
call and never continues. The dispatcher/thread pool isn't overloaded, because the application is still alive and processing every request without any delay... except when there is a
delay
call somewhere, then the code hangs) Restarting the JVM magically fixes the issue, until after a few days it stops working again. Using JVM 12, Kotlin 1.3.21 + Coroutines 1.2.1 (Yes, I know, it is outdated, I didn't update yet because I couldn't get Kotlin's script engine working on 1.3.31... if reeeally needed I can update, but I decided to ask here before I do it to figure out if anyone knows what could cause this issue (or if it was already fixed))
đŸ˜± 3
🍿 2
Here's a GIF showing the issue, the code is evaluated, executed buuut it gets stuck on the
delay(...)
, no matter what dispatcher I use.

https://cdn.discordapp.com/attachments/393332226881880074/589923294916902961/SayRQ1OGN5.gifâ–Ÿ

Of course, maybe it is a issue on my end... but I can't figure out what would cause this issue because restarting the application fixes the issue (until it breaks after a few days...). However sometimes (very rarely) the tasks works fine, they execute, are delayed and then they exit... but sometimes they get stuck.
What looks that it happens is that after a veeery long time (way more than the specified delay), the coroutine does continue. Here I executed the code:

https://cdn.discordapp.com/attachments/392476688614948897/589958622000185549/unknown.pngâ–Ÿ

And after almost one hour later... the code was executed.

https://cdn.discordapp.com/attachments/392476688614948897/589958706410553356/unknown.pngâ–Ÿ

Very strange, as I said it isn't that there isn't available threads to process the requests, because launching tasks works, only
delay
seems to have that strange issue. (even suspending functions works fine, except if they call
delay
)
đŸ˜± 1
e
Something is bogging down the thread the does
delay
schedule. It is called “Default Executor”.
What’s your execution infrastructure? Can you take a thread-dump to see what that thread is busy with?
m
Yeah sure, let me upload it to gist
I'm using a cached thread pool for the coroutines:
Copy code
val coroutineExecutor = createThreadPool("Coroutine Executor Thread %d")
	val coroutineDispatcher = coroutineExecutor.asCoroutineDispatcher()

	fun createThreadPool(name: String): ExecutorService {
		return Executors.newCachedThreadPool(ThreadFactoryBuilder().setNameFormat(name).build())
	}
e
That should not affect
delay
in the way you describe.
I don’t see anything unusual in the thread-dump, though. The “kotlinx.coroutines.DefaultExecutor” does not seem to be busy with anything — parked as it is supposed to be
m
Strange that everything works (calling
suspend
methods, etc), but not delay
I can run the Kotlin Coroutines debug JVM agent, however I don't want to reboot the JVM if possible (if I reboot, the issue goes away until it randomly goes back)
e
Any chance you can attach a debugger this running JVM?
One theory I have is that somehow your system becomes slow at thread creation
. Because after 1 sec when you don’t use
delay
its thread is automatically shutsdown, so next time you need to delay it needs to create thread

. But at this thread dump I see the thread up & running
m
Yes, but that would require a JVM restart (and that would "fix" the issue until it breaks again, which looks like it takes about three/four days for some reason)
e
Can you get a memory dump with
jmap
?
(it works on a running process)
jmap -dump:live,format=b,file=heap.bin <pid>
m
Yeah, the only issue is the size of the memory dump, I don't think I would be able to open it because the app has 36GBs memory allocated, but I could try
e
Hm
 I will not be able to open it

m
I can try reproducing on a smaller instance (this happened before on one of my test instances) but the issue is waiting until the problem happens again 😛
e
Is the process remote or local? If it is local, then you can attach debugger from IDEA without restart
m
remote
e
Bad luck. I have an interesting theory, though.
Do you control all the code that uses coroutine? Do you know all the uses of
delay
in your code?
m
Yes
(Well, I think I do, I don't think I do use any other external dependency that uses
delay
)
e
Do you use something like
delay(Long.MAX_VALUE)
to wait forever?
m
No, I don't... but what about "very big values that aren't `Long.MAX_VALUE`"? (Because if a user says "hey, wait 100 years because idk" and the code waits for... well,
now - 100 years in milliseconds
)
e
Neither should be really a problem
 I’m still puzzled how to explain it
 Especially this “1 hour later wakeup”
m
Also, it isn't exactly 1 hour, it looks like it varies
And what is interesting is that the tasks doesn't follow a "order" (This task was delayed first, so it will be executed first), it seems to be random
e
That is extremely puzzling. It basically means that something got broken inside the delay’s binary heap data-structure, but I cannot figure out how this could happen — it is pretty robust data structure, simply synchronized and well covered with tests.
m
e
What could break is a completely broken implementation of
System.nanoTime()
(and jumps back and forth and something like that), but that kind of broken nano-time would wreak a lot of havoc elsewhere. I doubt a running system can have a problem with it.
What I did notice is that it happens after the JVM was up for more than 3 days
e
I cannot grasp a significance of “3 days” in this context
.
In your last log. That “Coroutine Executor Thread”. What are those? I don’t see them in your thread dump
Also scary big numbers
 Like “Coroutine Executor Thread 149779”. Are they sequentially numbered? Looks like quite a big thread churn

From that log it follows that a new thread was created approx every 2 seconds. That’s not by itself a big number (would not bog down a system) unless something else is going on

m
Yes, they are sequentially numbered
e
Why don’t I see them in the thread-dump? What code creates them? Maybe it’s not the
delay
after all, but those threads were not available (missing) at the time?
m
This is the code that creates them:
Executors.newCachedThreadPool(ThreadFactoryBuilder().setNameFormat(name).build())
Only Kotlin Coroutines uses that executor, so it can't be something else. Also, wouldn't GC remove cached thread (that are unused) after 60s, causing them to create new threads (which will have a new ID assigined to them)? And if it was the dispatcher, wouldn't using a different dispatcher work? (like
Dispatchers.Default
or
<http://Dispatchers.IO|Dispatchers.IO>
)
e
Can it be some kind of contention in somewhere else? Can you try with unconfined dispatcher and with direct printing to console:
Copy code
GlobalScope.launch(Dispatchers.Unconfined) { 
    println("(0) at ${Instant.now()}")
    delay(1)
    println("(1) at ${Instant.now()}")
}
Do you have console output of that process redirected somewhere?
m
I will try doing that asap, while I don't redirect the console output (except when logging with logback) I can see the console output.
Tested it right now and yeah, it doesn't work too, it displays the first
(0)
line but never displays the
(1)
Because the coroutines are still stuck (and that breaks some of the features on my application) I will restart the JVM and update all the dependencies (Kotlin, Kotlin Coroutines, etc) + enable coroutine debug + JVM debug (to remote attach to it)
e
It means that indeed something somehow got broken inside the DefaultExecutor’s queue that
delay
uses.
m
I hope updating everything + enabling debug will help trying to track down what is causing the issue 🙂
Something strange that I noticed: It looks like that when the tasks are eventually executed, a lot of them are executed at once (even if they weren't delayed at the same time)
I restarted the JVM and did the following changes: 1. Updated Kotlin to 1.3.31 2. Updated Kotlin Coroutines to 1.3.0-M1 3. Updated all dependencies (well, that doesn't even matter but hey) 4. Increased the thread pool cache to keep for 5 minutes (from 60s), set the core pool size to 10 (from 0) 5. Added
-Dkotlinx.coroutines.debug
to the JVM arguments 6. Added a task that keeps creating coroutines, waits 60s, then prints how much time it took and logs it. 7. Added a way to dump currently running coroutines (with the DebugProbe) 8. Enabled remote debugging I know some of the changes doesn't have any relation to the issue, but I was like "let's try changing everything and see what sticks"
e
Thanks. Let us see how it goes. Ping me if it reproduces. In debugger I’ll be interested to examine the values of the fields of
DefaultExecutor
object instance.
m
I ended up taking out the debug flags, it was slowing everything down and causing issues on my production instance 😞 I will keep it enabled on my test instance (which I hope I can reproduce it there)
@elizarov yeah looks like the uptime doesn't have anything to do with it, my production instance is now borked again while the development instance works fine. I will try enabling debug on the production and see if it doesn't have a too big performance impact (maybe it was the coroutines' debug stuff that has a big performance impact? I didn't test with only one of them disabled because after almost one hour trying to get the production instance to work without having 500%+ CPU usage was hard) But before I restart, is there's any way to get the values of the fields of the DefaultExecutor object instance with reflection? Because if it is, wouldn't it worth a shot to try getting it via reflection (since I can evaluate code without restarting) without enabling debug? Probably it is impossible, but just asking before I restart (and waiting until it decides to break again)
e
Yes. You can use “getDeclaredField” and “setAccesible(true)” to get hold on the private fields via reflection
m
Oh okay, that would be helpful, can you share what fields should be get from the DefaultExecutor object that would be helpful to debug the issue + how can I get a reference to the DefaultExecutor object? I tried searching around the internet and documentation about it but I couldn't find anything about it
Also, this is just an theory but... if there was a lot (I mean, 100k+ of them) of suspended coroutines, would that affect the delay? Maybe something is broken on my code and is creating a lot of coroutines that stay on the suspended state
e
Suspended where? In
delay()
? In theory it should not be a problem, since
delay
queue uses O(log n) algorithm.
m
@elizarov I'm trying to dump the variables from the
DefaultExecutor
, while most of them I was able to, getting the
_queue
and
_delayed
values from the
EventLoopImplBase
(which I guess are the ones that you will need) is kinda difficult, they don't show up on the
declaredMembers
list, only on the
declaredFields
list, which then returns a
AtomicReferenceFieldUpdater
on the
_queue$FU
field... but what are the variable's types? In Kotlin it is a
atomic<Any?>(null)
with AtomicFu, but what would that be in Java itself?
e
This type is
Object
in Java. Atomics are erased during compilation.
m
Yeah, but doesn't the
AtomicReferenceFieldUpdater
has two types?
<*, *>
... well, actually I don't need to know the two types because you can just cast it with
<*, *>
, however I wanted to know what value should be used on the
_queue.get(SomeObjectHere)
. I tried looking at the Kotlin Coroutines code but because it uses AtomicFu, it is just a
_queue.value
call
e
You don’t need field updater. What you see in code as
_queue.value
is actually stored in
_queue
field.
m
ooooh, sorry for my mistake! I was running my test code on a test project and there the
_delayed
and
_queue
fields were null, but that's because I never spawned any coroutine task. Anyway, here's some of the dumped variables from the
EventLoopImplBase
Copy code
EventLoopImplBase:
_queue = kotlinx.coroutines.internal.LockFreeTaskQueueCore@44170e37
Size: 0
_queue$FU = java.util.concurrent.atomic.AtomicReferenceFieldUpdater$AtomicReferenceFieldUpdaterImpl@671fab50
_delayed = kotlinx.coroutines.internal.ThreadSafeHeap@3d30a5ae
Size: 3300
_delayed$FU = java.util.concurrent.atomic.AtomicReferenceFieldUpdater$AtomicReferenceFieldUpdaterImpl@3ad4a630
isCompleted = false
The only thing that seems... "off" is the size of the
_delayed
ThreadSafeHeap, but maybe that's a normal size if you are running a big application (my test instance returns
9
for that same var)
e
Can you do me a favor. There’s an array
a
inside
ThreadSafeHeap
instance. It contains instance of
DelayedTask
. Can you dump all of them? I’d need
index
and
nanoTime
for each one to see if it is consistent
m
Sure! I will do that asap
Aside from some tasks that seem to have negative
nanoTime
, there doesn't seem to be anything wrong
@elizarov after adding some "sanity" checks to a few places where coroutines are created (A function was constantly creating coroutines that were shut down right after they were created... moved the "should I create the coroutine?" check to before the coroutine was created. Added a check to avoid creating tasks that are going to be delayed for >6 months) I cannot reproduce the bug anymore. I'm still trying to debug the issue, but the JVM is currently almost 6 days up and running and the coroutines are still working fine and well 🙂
e
Thanks for additional info.
Something is clearly wrong with the heap structure you’ve dumped. It is broken at that explains all the strange effects you saw. However, I cannot yet figure out how this could have happened. The only plausible explanation is that
System.nanoTime()
is not monothonic at your system.
You were using
delay(Long.MAX_VALUE)
, weren’t you?
(What i see in your data could result from it plus slightly backwards-going nanoTime)
If so, i know how to fix it.