Hi all, We are using KTOR in a multiplatform libra...
# ktor
a
Hi all, We are using KTOR in a multiplatform library we develop to share code between our mobile and tv app platforms, and use ktor to make API calls and handle authentication. We are using KTOR 3.0.3 currently, and are using OAuth using the Auth plugin and the Bearer auth configuration. We have been investigating an issue for 1-2 months now where users are logged out due to a failing refreshTokens call that returns a 401. We are recreating the issue by reloading a page in our app very quickly, and thus this makes a lot of API calls start, but soon after be canceled due to us making a new refresh in the app. Thus making a lot of API calls at the same time, and we are cancelling the calls to replace them with new ones soon after. The issue we are seeing generally follows this pattern: 1. The app makes API calls intermittently, and the accessToken expires and we get a 401 response on the call. 2. KTOR triggers the Bearer configs
refreshTokens
lambda, and we make a call to our backend using the
oldTokens.refreshToken
supplied to refresh the refresh token. 3. Step 2 succeeds, the token is updated and the call from step 1 completes successfully. 4. At some point in time the app makes 10-15 simultaneous requests again while the accessToken has just expired, and soon after starting the calls we are cancelling them again due to a new refresh happening in the page in the app. 5. KTOR triggers the
refreshTokens
lambda again two times (A and B) very quickly after one another. 6. Call A takes the
oldTokens.refreshToken
and refreshes successfully using our backend, and returns the BearerTokens object in the lambda with the now updated token. 7. Call B takes the
oldTokens.refreshToken
as well, but this time the token is the one from before step 6 had run, so this token is now invalid. It makes a refresh call to our backend and gets a 401 because the token supplied was not valid anymore because it was just refreshed and thus outdated 15ms before that. 8. Because we get a 401 on a refreshToken call, we return null from the
refreshTokens
lambda and the user is logged out and sent to the login screen. We have read and read up on the causes of the issue, and we have found in multiple threads on youtrack and in this channel that the Bearer auth handling in Ktor should be thread safe by design, but this does not seem to be the case with our findings. • We have tried implementing a locking mechanisn using a Mutex, that will make sure the
loadTokens
lambda is locked while the
refreshTokens
lambda is refreshing a token, but this does not work as the documentation on the plugin is wrongly stating that the loadTokens is called before every call, while this is not the case, it is only called once, and subsequent tokens are loaded from an internal cache in ktor. • We have tried implementing a mechanism that unmarshals the JWT at the start of the
refreshTokens
lambda and checks the expireTime of the token, and in the case the token is not yet expired, does not try to refresh it using our backend, but instead immediately returns the same token as was input in
oldTokens
. We did this because our locking mechanism did not prevent multiple
refreshTokens
calls from happening at almost the same time. This does not work either, this issue is still happening. • We have also tried making a locking mechanism using a Mutex that will make sure any read anywhere of a token from our internal token storage will be locked as soon as the
refreshTokens
lambda is running and vice versa, any token read operation will lock the
refreshTokens
until the read is done. This however makes parallel calls impossible to do, as they lock each other making calls synchronous and thus extremely slow, locking up the app, and causing an ANR that prompts the user to close the app because it is so slow that it is not responding. So we have tried quite a few things, but the issue persists, somehow the
refreshTokens
lambda is triggered twice in a very short timeframe, and the first request goes well and refreshes the token, and the second request does not receive this token in the
oldTokens
but receives the now invalidated refreshToken and thus makes a call, gets a 401 and logs the user out. We are working with the hypothesis that the two calls to the
refreshTokens
happen because KTOR allows the
refreshTokens
lambda to be called again immediately after, but now with the old token because Ktor thinks the token just returned was not valid due to the call being cancelled, although in reality we already made our API call to refresh before being cancelled thus the token was valid and should be replaces in the internal token cache. We have avoided the
clearToken
function that seems to cause more issues than it solves, and also by different users seems to cause multiple refresh calls as well. We are using the
markAsRefreshTokenRequest()
function on the request builder for the refresh token call we are making. We are using a single Ktor Http client to make all authenticated requests as recommended in the threads we could find on youtrack. Here is our current Bearer auth config
Copy code
install(Auth) {
    bearer {
        // Specify which calls don't need to refresh after a 401 response.
        sendWithoutRequest { request ->
            // This callback should return true when we are making a request to the login endpoint, as this endpoint should be sent without waiting for the 401.
            val result = !(request.url.encodedPath.contains("auth/token")
                    && (request.body as? PostAuthenticateUserRequestNetworkModel)?.grantType == GrantTypeNetworkModel.PASSWORD)

            Napier.d(tag = "Kmm.Auth", message = "SendWithoutRequest: $result")

            return@sendWithoutRequest result
        }

        // Invoked during requests
        loadTokens {
            Napier.d(tag = "Kmm.Auth", message = "LoadTokens")
            onLoadTokens(oAuthTokenStorage)
        }

        // Refresh invalid access token
        refreshTokens {
            refreshMutex.withLock {
                if (jwtUtils.parseJwtPayloadToExpireTime(oldTokens!!.accessToken)!!.epochSeconds > Clock.System.now().epochSeconds) {
                    return@refreshTokens oldTokens
                }
                PerformedCall.from(
                    jsonRequestBody = response.request.content.toString(),
                    response = response.bodyAsText(),
                    endpointName = response.request.url.encodedPathAndQuery,
                    headers = response.request.headers.entries(),
                    url = response.request.url.toString(),
                )?.let {
                    options.onCallPerformed.invoke(it)
                }
                Napier.d(tag = "Kmm.Auth", message = "RefreshTokens $oldTokens")
                withContext(NonCancellable) {
                    onRefreshToken(
                        options = options,
                        baseParamsProvider = ntvbBaseParamsProvider,
                        oAuthTokenStorage = oAuthTokenStorage,
                        oldTokens = oldTokens,
                    )
                }
            }
        }
    }
}
I hope someone can help us, as we seem to cannot fix the issue no matter what we try.
🧵 2
v
We were struggling with token updates in my app also. I’m not sure if I can help you with solving it, but I share your feelings that token update mechanism in ktor needs some love, improvements and locks. For us the main point of misunderstanding was that even though
loadTokens()
takes a lambda function it caches these tokens under the hood, it doesn’t call this lambda on every time tokens are needed, the only way to force updating it is to call
httpClient.invalidateBearerTokens()
every time you tokens change. But probably, you already know it.
💯 2
a
We have an issue to address this general problem by providing control over the tokens' storage. It would be great if you create a sample project or write a self-contained code snippet to reproduce the problem of the
refreshTokens
block triggered twice in a very short timeframe.
a
@Vita Sokolova I had not heard about the invalidate function you mentioned, but I can't seem to find it. I searched for the function in the entire Ktor github repository and did not find it. Is it called what you stated, or was it maybe a function in an earlier version of ktor?
@Aleksei Tirman [JB] I am working on a unit test that will recreate the issue, but I have not cracked the code to do that yet, I will let you know when I have any updates on it
🙏 1
v
Sorry, my bad, it’s my extension function:
Copy code
fun HttpClient.invalidateBearerTokens() {
    authProviders
        .filterIsInstance<BearerAuthProvider>()
        .singleOrNull()?.clearToken()
}
a
@Vita Sokolova it was the clearToken function I though you used as well. Thanks for the feedback anyway
@Aleksei Tirman [JB] I have been trying to recreate the issue we are seeing in a clean HttpClient with a MockEngine simulating a Bearer auth scheme. But I have been completely unable to recreate the issue outside of our app environment for some reason, it is really a weird thing that we can't seem to recreate the issue in a test but it is happening all the time in the app
a
Would the solution of giving the users control for the tokens storage (KTOR-8180) solve your problem? Or would you rather have a solution for the current problem?
a
We are working with our BE team to implement a grace period of tokens so that we can solve the problem there until we have more control of what tokens are sent using ktor
Untitled.kt
@Aleksei Tirman [JB] I have an update on the matter. I was able to find the reason for our problem, and also created a test that confirms this is what happens, so you can recreate it yourself. The problem happens because the coroutineScopes that start coroutines when they are tied to a UI in Android for example often are cancelled when the user goes to a different page while networking is happening. The issue occurs because when cancellation happens between an actual request to a backend to refresh a token and the return of the refreshTokens lambda, a cancellation exception happens that makes the now updated tokens invalidate the ktor tokens cache, and they are not updated due to the cancellation. If we wrap our refresh logic in a
withContext(NonCancellable)
block to prevent cancellations from happening at all, the issue still occurs, as the parent scope that is calling our refreshTokens lambda is still being cancelled, so ktor is completely ignoring the new tokens that are then sent back from the lambda because cancellation exceptions still happen inside the ktor code in the Auth plugin. So we are unable to resolve the issue on the client side because the issue happens both when we allow for cancellations and when we don't, as the problem occurs inside the Auth plugin. You can run the test I attached above where I have mocked a server allowing for an endpoint to refresh tokens and one for getting some data. I regards to your question above about whether or not the proposed solution of making the client be the TokenHolder, this would still not solve the problem as it is happening inside the Auth plugin itself it seems. If you would fix this issue, there would still be a possibility for it to happen though, as the root cause of the problem: the server getting an actual request and creating a new token, but the client never getting a response, can still happen in cases where the client looses the internet connecting at about the same time the cancellation happens in the test, this would be a much rarer case though, so I still do believe that a solution could be very beneficial in ktor as well
a
Thank you for the thorough explanation and the test. I've filed an issue to address this problem.
🙌 1
a
Thank you to you too Aleksei, I will follow the issue for any further updates while we fix the issue in our BE temporarily too.
k
My team would also be very happy to see this issue addressed. We’re facing a similar problem where refresh tokens can be reused due to coroutine cancellation during the token refresh flow, which sometimes leads to forced logouts. This is something we want to avoid at all costs to ensure a smooth user experience. It would prevent valid refresh responses from being discarded and reduce the risk of token reuse errors. I believe many teams would benefit from this, as most common auth providers either support or expect this kind of behavior. I appreciate the investigation and the proposed improvements, this would be a great addition to the plugin.