I have a suite of common tests that run on iOS and...
# coroutines
j
I have a suite of common tests that run on iOS and JVM that use
kotlinx-coroutines-test
. Each test is run within a
runTest()
coroutine. All the tests execute and pass on JVM. On iOS they usually pass as well, but often (~30% of the time) one of the tests hangs, blocking the test suite's completion. I've seen almost all the tests cause this, from the first to the last, so it's not caused by a specific test. If I add a print statement to the bottom of each of the
runTest()
calls, the print statement is always executed, so it's hanging after the test code completes, but
runtTest()
apparently isn't returning for some reason. Any idea what could be the cause of this? The code is based on this test suite from SQLDelight, but using multiplatform paging and a different database.
Even weirder, it's not even
runTest()
that's hanging. If I wrap the call with:
Copy code
fun runTestAndLogCompletion(testBody: suspend TestScope.() -> Unit) {
    runTest(testBody = testBody)
    println("completed")
}
"completed" is logged before the test goes on to hang indefinitely. It's not clear what could be causing this, other than the iOS test runner itself then.
All the other tests in my project run without issue. It's only these ones that use
kotlinx-coroutines-test
that hang like this.
a
I have experienced this before, I thought I never considered it to be the library's problem, but my own partial knowledge.... But maybe, just maybe, it might be the library doesn't play very well yet with some apple targets
j
Interesting. So these tests you've experienced this with are also using
kotlinx-coroutines-test
and hang on the iOS target?
a
Not ios target perse but I have been having a test that would pass on every target except watchOs (forgoten if it was arm64 or the other one), but when I run it again it passed it was really flacky
j
Oh, ok. And when the test doesn't pass, does it not complete at all, just hangs without completing?
a
exactly that, it just hangs
doest run to completion somehow
j
Definitely sounds like the same thing!
@Dmitry Khalanskiy [JB] have you seen this hanging behavior with
kotlinx-coroutines-test
on Apple targets? Do you have any suggestions on how we might diagnose the cause?
d
Nope, your report is the first one I've seen. The most straightforward way to diagnose this is to send us some code that triggers this (it's okay if you do this in private as well), preferably small and self-contained. I don't think coroutines have anything to do with this, given that
runTest
does finish, but then again, the flakiness does mean that there's some non-determinism involved. Very odd indeed.
a
Let me see if I can put together something. I have an OSS lib I can share
j
Thanks. My library isn't currently open source, but I'm working on getting it there. The fact that everything finishes and it still hangs is certainly baffling.
kotlinx-coroutines-test
is just one of the things specific to this code, vs other tests in my project that haven't ever experienced this. The other thing specific to these tests would be the paging extension code itself. But again, it also completes execution before going on to hang. I'll see if I can put together something for you to be able to take a look at.
Revisiting this, I've found that by introducing a 1ms delay after
runTest { ... }
, I'm able to workaround the hanging on iOS. I haven't been able to get any of the tests to hang after replacing
runTest { ... }
with this `runTestAndPause { ... }`:
Copy code
fun runTestAndPause(
    testBody: suspend TestScope.() -> Unit
) {
    runTest(testBody = testBody)
    runBlocking { delay(1) }
}
Without this workaround, 1 of the 23 tests in this specific test suite that uses
kotlinx-coroutines-test
will almost always hang, preventing the suite from completing (although occasionally they will all complete).
d
What if you replace
runBlocking { delay(1) }
with something other than
runBlocking
? I suppose there is a way to sleep for a given amount of time on iOS. Worst comes to worst, there's the non-optimizable busy-loop
Copy code
repeat(10000) {
  assertTrue(Random.nextInt(until = 100) < 100)
}
j
I used
runBlocking { delay(1) }
for the ease of multiplatform support in common tests. I just tested with a
ThreadUtils.sleep(1)
expect function where the iOS implementation is
NSThread.sleepForTimeInterval(millis.toDouble() / 1000)
and this also works to prevent the tests from hanging.
If I run the tests enough, after dozens of runs, they still will occasionally hang on the 21st test (without the delay, they usually hang sooner). So the small delay seems to usually allow whatever causes the deadlock to clear up. If I add a 50ms or 100ms delay, I haven't been able to get the tests to hang. But of course now the suite takes 1-2 seconds longer to run, which is considerably longer than the tests themselves take (~300-400ms).
d
j
Thank you! Just to clarify, the
println("completed")
statement wasn't enough to prevent the tests from hanging. The odd thing was just that the print statement logged and then the test still went on to hang indefinitely.
I tried removing the coroutines test dependency, to use just pure coroutines with
runBlocking
. But I ended up not being able to find a good replacement for
TestScope.advanceUntilIdle()
, essentially await until the coroutine suspends to perform a check. I'll have to play with the tests some more to see if I can rework this part and see if it's still reproducible without the coroutines test dependency.
d
Got it, fixed the issue description. Without knowing what your tests do exactly, tough to say how you can get rid of
advanceUntilIdle
, but as a (typically) non-idiomatic but robust approach, a large enough
delay
does the trick.
j
The tests are based on this test suite from SQLDelight multiplatform paging extension, modified to use a different database.
TestScope.advanceUntilIdle()
is used here. I haven't been able to reproduce the hanging with the SQLDelight tests (I ported the SQLDelight paging extension to multiplatform and ran the tests a bunch in the process). The SQLDelight tests run faster than my other database tests though. So could just be different timing conditions.
d
Hello! Does anyone have a publically available project where this reproduces?
j
I opened source my library recently, although I haven't been experiencing tests hanging on the most recent versions of my code, even after removing the delay workaround. I'm no longer using the coroutines-test library as well, which could be a contributing factor. (I'm no longer using coroutines-test because
TestScope.advanceUntilIdle()
no longer does what I need it to, so I've had to replace it with an arbitrary delay now.) I went back to an older commit before I removed coroutines-test and reproduced this again. If you run
./gradlew :couchbase-lite-paging:cleanAllTests :couchbase-lite-paging:iosX64Test
repeatedly on the *paging-ios-tests-hang* branch, eventually the iOS tests will hang indefinitely. Based on other times I've experienced this same issue of iOS tests hanging indefinitely, it seems to be caused by some background thread still being active when the test execution completes, which makes sense why delaying a short period at the end of the test may resolve the problem. JVM tests don't exhibit this same behavior though.
d
TestScope.advanceUntilIdle()
no longer does what I need it to
We take breakage seriously, so if, after upgrading to some version of the test library, the behavior of
advanceUntilIdle
changed, it's a cause for concern. Could you describe how exactly the behavior changed so that we could decide if it's a regression or intended behavior that should be documented?
the paging-ios-tests-hang branch, eventually the iOS tests will hang indefinitely
Thank you for the reproducer! We'll look into it.
How long does the bug usually take to reproduce? It's been going on for ten minutes, but tests consistently pass without issues, with the command you provided and on the correct branch.
Nevermind, it reproduced after an hour of attempts!
a
Now that is one tricky bug
👌 1
d
@Jeff Lockhart, are you sure this is still the same issue? Yes, your tests do occasionally hang, but when I attach a debugger to one of them, the test is not finished; instead, the main thread hangs inside calls to
CBLQuery.execute
.
j
Could you describe how exactly the behavior changed so that we could decide if it's a regression or intended behavior that should be documented?
Sorry, to clarify, it didn't break after a coroutines-test update, but after a change in my library's code. In order to avoid double-querying, with both the initial query execute and the query change listener, the code now uses the query change listener for the initial query as well. But because the query is not executed on the coroutine directly anymore,
TestScope.advanceUntilIdle()
doesn't do what it was doing before, waiting for the completion of the query work before checking the results. I need to find a better way to definitively wait for the paging query work to complete, rather than an arbitrary delay, as every once in a while a test will fail.
How long does the bug usually take to reproduce?
Usually it happens within a dozen or so runs. My computer has 16 cores / 32 threads. So not sure if that makes a difference in reproducing more often. I just ran again on that branch and it happened on the first run, then on the fifth, sixth, and tenth runs after that.
are you sure this is still the same issue? ...the test is not finished
Interesting, I'll have to look into this some more. I just tried adding the
println("completed")
log to see if this was doing what it was doing before. The first time the tests hung, I didn't see "completed" logged. But the second time, I did. Maybe there are multiple possible causes for a deadlock going on.
It definitely seems to happen less frequently when run on a debugger! I managed to get it to hang in two ways, the first the test is still running with
CBLQuery.execute
and the second it logs "completed", but actually hangs during the
@AfterTest
database deletion. I didn't think about how this could be the cause! I should have used the debugger to check the deadlocked stack trace originally (this didn't use to work well for iOS). The fact I haven't been seeing these hanging tests in my latest code leads me to believe the locking issue may have been resolved with another change. I'll let you know if I see this again and can confirm the code causing it is from Kotlin or coroutines. Thank you for your help looking into this!
d
I need to find a better way to definitively wait for the paging query work to complete, rather than an arbitrary delay, as every once in a while a test will fail.
You may be interested in https://github.com/Kotlin/kotlinx.coroutines/issues/3919
It definitely seems to happen less frequently when run on a debugger!
It doesn't have to run in a debugger from the start. I ran
while true; do $your_command; done
in the terminal, and when the test hanged, I used the Xcode debugger to attach to the already-running process `test.kexe`: https://stackoverflow.com/questions/9721830/how-to-attach-debugger-to-ios-app-after-launch.
j
Thanks for the links. I did read that coroutines issue, but it seems my use case isn't covered by the solutions described. If I replace the
delay
with
awaitAllChildren
, the tests just hang. I need to look into possible APIs to get a proper signal for when the pager has its results.