Hello! Did any of you encounter number parsing err...
# apollo-kotlin
e
Hello! Did any of you encounter number parsing errors on release builds only, on Android 5 devices? 😄 Something like
com.apollographql.apollo3.exception.JsonDataException: Expected an int but was 4294967297 at path [errors, 0, locations, 0, line]
(full stacktrace in 🧵)
Copy code
com.apollographql.apollo3.exception.JsonDataException: Expected an int but was 4294967297 at path [errors, 0, locations, 0, line]
  	at com.apollographql.apollo3.api.json.BufferedSourceJsonReader.nextInt(BufferedSourceJsonReader.kt:615)
  	at com.apollographql.apollo3.api.internal.ResponseParser.readErrorLocation(ResponseParser.kt:137)
  	at com.apollographql.apollo3.api.internal.ResponseParser.readErrorLocations(ResponseParser.kt:124)
  	at com.apollographql.apollo3.api.internal.ResponseParser.readError(ResponseParser.kt:80)
  	at com.apollographql.apollo3.api.internal.ResponseParser.readErrors(ResponseParser.kt:62)
  	at com.apollographql.apollo3.api.internal.ResponseParser.parse(ResponseParser.kt:34)
  	at com.apollographql.apollo3.api.Operations.parseJsonResponse(Operations.kt:63)
  	at com.apollographql.apollo3.network.http.HttpNetworkTransport.singleResponse(HttpNetworkTransport.kt:104)
  	at com.apollographql.apollo3.network.http.HttpNetworkTransport.access$singleResponse(HttpNetworkTransport.kt:37)
  	at com.apollographql.apollo3.network.http.HttpNetworkTransport$execute$1.invokeSuspend(HttpNetworkTransport.kt:91)
  	at com.apollographql.apollo3.network.http.HttpNetworkTransport$execute$1.invoke(HttpNetworkTransport.kt:0)
  	at com.apollographql.apollo3.network.http.HttpNetworkTransport$execute$1.invoke(HttpNetworkTransport.kt:0)
  	at kotlinx.coroutines.flow.SafeFlow.collectSafely(Builders.kt:61)
  	at kotlinx.coroutines.flow.AbstractFlow.collect(Flow.kt:230)
  	at kotlinx.coroutines.flow.internal.ChannelFlowOperatorImpl.flowCollect(ChannelFlow.kt:195)
  	at kotlinx.coroutines.flow.internal.ChannelFlowOperator.collect$suspendImpl(ChannelFlow.kt:167)
  	at kotlinx.coroutines.flow.internal.ChannelFlowOperator.collect(ChannelFlow.kt:0)
  	at kotlinx.coroutines.flow.FlowKt__ErrorsKt.catchImpl(Errors.kt:156)
  	at kotlinx.coroutines.flow.FlowKt.catchImpl(Unknown Source)
  	at kotlinx.coroutines.flow.FlowKt__ErrorsKt$catch$$inlined$unsafeFlow$1.collect(SafeCollector.common.kt:114)
  	at com.apollographql.apollo3.cache.normalized.internal.ApolloCacheInterceptor$interceptMutation$1.invokeSuspend(ApolloCacheInterceptor.kt:160)
  	at com.apollographql.apollo3.cache.normalized.internal.ApolloCacheInterceptor$interceptMutation$1.invoke(ApolloCacheInterceptor.kt:0)
  	at com.apollographql.apollo3.cache.normalized.internal.ApolloCacheInterceptor$interceptMutation$1.invoke(ApolloCacheInterceptor.kt:0)
  	at kotlinx.coroutines.flow.SafeFlow.collectSafely(Builders.kt:61)
...
The body being parsed, as logged by OkHttp, is this:
{"errors":[{"message":"Enter your secret code.","locations":[{"line":1,"column":134}],"path":["signup"],"code":"needs-pin"}],"data":{"signup":null}}
I can probably build a reproducer, but I doubt that someone still has an Android 5 device around. But FWIW, I can always reproduce on a Nexus 7 downgraded to Android 5.
Oh, this is happening on Apollo 3.8.4. We didn't have this error when on version 3.7.0
b
Odd! I wonder where
4294967297
comes from
apparently it's 2^32 + 1. The plot thickens 😅
☝️ 2
e
yeah, for some other values as well it's
2^31 + value
🙃
b
could be interesting to have some breakpoints inside
BufferedSourceJsonReader.peekNumber
if that's possible, or open a reproducer indeed
oh I missed the "release builds only" part, so I suppose debugging is not really feasible
e
yes, as soon as I set
isDebuggable = true
for the release build variant, the bug is gone
b
I guess it's one of these cases where good old
println
is the only tool that can help us 🙂
bored party parrot 1
e
yup, println will never fail us 😄 here's the repro, btw: https://github.com/eduardb/ApolloAndroidLollipopParsingBugRepro would you like me to open an issue on GH?
b
great 👍 Yes please do 🙏
👍 1
thank you color 1
e
b
Alas, the ARMv7 api 21 emulator is unusably slow to boot, and I don't have my hands on a real device. Can you try to repro with
4.0.0-beta.6
by any chance?
e
Yeah, I couldn't use that emulator either 😞 Sure thing, I will test various different versions 👍
🙏 1
m
Looks vaguely R8/D8 related? Try disabling fullMode (if you have it) or optimizations?
🤔 1
e
3.7.0: no crash 3.7.5: no crash 3.8.2: no crash 3.8.3: crash – could it be that the Kotlin update introduced the issue? Testing v4 beta now
👀 2
m
could it be that the Kotlin update introduced the issue?
Looks like a good candidate
Might be worth dumping the bytecode before/after the upgrade
e
how do you do that? 😄
m
Copy code
git checkout sha1_at_1.9
./gradlew :apollo-api:compileKotlinJvm
javap -v libraries/apollo-api/build/classes/kotlin/jvm/main/com/apollographql/apollo3/api/json/BufferedSourceJsonReader.class
git checkout sha1_before_1.9
./gradlew :apollo-api:compileKotlinJvm
javap -v libraries/apollo-api/build/classes/kotlin/jvm/main/com/apollographql/apollo3/api/json/BufferedSourceJsonReader.class
thank you color 1
It’s exactly the same thoug...
🤔 1
e
All V4 versions seem to have the same issue. I tested on alpha.1, beta.6, and some in-between. I pushed the code on the
v4
branch in the repo linked above, fwiw
👍 1
thank you color 1
oh, and I tested with
android.enableR8.fullMode
set to both true and false respectively, same result
👀 1
thank you color 1
m
The more I think about it the more I think it’s an art bug on those devices?
🤔 1
Like this code doesn’t use any specific API I think
b
yes, looks like it to me too
with a few println we could maybe understand where it goes wrong, and even perhaps find a workaround
m
Currently thinking whether https://github.com/romainguy/kotlin-explorer can help 🤔
👍 1
IIRC, the tool uses an actual device to generate the assembly
1
Don’t have much time to dive into this right now but trying to isolate the issue even more would be super helpful. Pulling the BufferedSourceJsonReader code out of Apollo and directly into the repro app and removing extra code (this code doesn’t have a lot of dependencies). We can try to see if we can order a device here to reproduce too.
If we can reduce it down to a function we can make a compelling issue for Android/D8/other (if the issue is really there)
Now is there even an
oatdump
on lollipop devices, that is the question...
@Eduard Boloș where did you get your API 21 device from? 😅
a
@mbonnin we're using a Nexus 7 with an Android 5.0.2 factory image from https://developers.google.com/android/images Other google phones might also work. Or if you mean in prod - idk, we have users running stuff like Galaxy Note3.
🙏 2
e
Hmm, I don't know if it's an art bug though, would that explain that v3.8.2 works?
m
Nice! I have a Nexus 7 in a box somewhere... Only problem is I need to find the box 😅 and... the USB connector was shaky...
🤩 1
👍 1
e
I fiddled with Gradle for quite a bit to get composite builds working. If there's something specific you want me to println, now it's a good time 😄
m
would that explain that v3.8.2 works?
My hunch is something in the dex triggers an optimization that doesn’t work on 21
Because the same dex works fine on other devices
💡 1
Just a hunch TBH, it’s really hard to tell.
e
yeah, of course 😄
m
@Aleksandrs Vitjukovs looks like you’re using the “razor” build
5.0.2 (LRX22G)
?
e
Yes (I am using Alex's device over adborc 😄)
thank you color 1
👍 2
m
TIL adborc
1
e
Yeah, adborc is great, but you'll need a patched version to have it working with the latest version of scrcpy
m
Makes me want to play Warcraft blob upside down
😄 2
a
If you do get around to flashing android 5, the included
flash-all.sh
might not work. For me,

these instructions

worked.
thank you color 1
e
Hmm, I am debugging BufferedSourceJsonReader, and I have so many questions 😄 For instance, is it me, or in
peekNumber()
, the constructed
value
is negative (for a positive number, I mean):
value = *-*(c - '0'.code.toByte()).toLong()
,
val newValue = value * 10 *-* (c - '0'.code.toByte())
? Is there a reason for that? If I change the sign, then the bug goes away, fwiw.
m
There’s a comment in there but TBH I’m not sure I have all the background on this
This is code that was taken over from moshi that was taken over from gson 😄
e
ah, right:
var value: Long = 0 // Negative to accommodate Long.MIN_VALUE more easily.
it's almost like a game of Chinese whispers 😄
😄 1
e
now I wonder if moshi has the same issue on Android 5. but I don't think that's a rabbit hole I'll fall in today
🐰 1
m
I’m tempted to say if the issue was present on moshi, we’d have noticed by now but never know...
e
oh, well, I dropped using a negative value, and it all seems to be working fine 😬 would you accept such a patch in the library? 😄
this is the patch: https://github.com/apollographql/apollo-kotlin/pull/5938 tests should pass, let's see 😄
m
Given the amount of mileage in the moshi code, I wouldn’t change this without a deep understanding of what’s going on.
Almost sure there’s an edge case where the negative value is required
Alright, this is where it was first introduced 12 years ago!
👀 1
🕵️‍♂️ 1
(don’t hold your breath, there’s not too much context in there)
😅 1
🥲 1
e
I added tests for min and max values, they pass, gonna push as soon as the current tests finish running
m
At the very least, your PR should update this statement:
Copy code
fitsInLong = fitsInLong and (value > MIN_INCOMPLETE_INTEGER) || value == MIN_INCOMPLETE_INTEGER && newValue < value
🤔 1
It should be inverted if now we’re saying we’re keeping track of the positive value
Copy code
fitsInLong = fitsInLong and (value < MAX_INCOMPLETE_INTEGER) || value == MAX_INCOMPLETE_INTEGER && newValue > value
thank you color 1
Also the test must check that we’re not falling back to the
PEEKED_NUMBER
(“parse string”) case.
This code is highly optimized to not allocate a string by keeping track of numbers that can fit in a long (fun fact, this is also why there’s a
nextLong()
in the API)
e
Also the test must check that we’re not falling back to the
PEEKED_NUMBER
(“parse string”) case.
Hmm, not sure how to write a test for that, both
doPeek
and
peekNumber
are private. Should I just do a step-by-step through them to see? Or should I make them internal? 😬
m
I have absolutely no idea 😅
Probably breakpoint is enough
👍 1
My guess is that `-9223372036854775808L`wraps to 0 Long.MIN_VALUE
e
no, it doesn't in my unit test 😛
m
So
value == MAX_INCOMPLETE_INTEGER && newValue < value
and so
fitInLongs
becomes false and so we allocate a string
e
ah
let me see
m
Cause
9223372036854775808
cannot be represented in a Long
e
you were right 😭
m
lolsob
At least we understand why the negative value...
e
yeah....
m
Maybe put a println in the loop, see at what point
value
becomes
0x100000001
?
e
but how often do ppl need to deserialize Long.MIN_VALUE? 😄
😅 1
m
😄
I mean in our case, we should never enter the
fitsInLong
branch anyway. It’s a single “1”
We should go here
Copy code
value = -(c - '0'.code.toByte()).toLong()
And that’s about it
• We have
"1"
• We track the negative value so
value = -1
. In hex, this is
0xffffffffffffffff
• Then we invert this ◦ complement:
0x00000000000000000
◦ add 1:
0x0000000000000001
• And we have
"1"
as a long
Or I think man...
e
I changed a println, and now the bug doesn't happen anymore 🤦‍♂️ before:
XXX: c: 49, c - '0'.code.toByte(): 1, -(c - '0'.code.toByte()).toLong(): -1371107853288341505
after:
XXX: c: 49, c - '0'.code.toByte(): 1, (c - '0'.code.toByte()).toLong(): 1, -(c - '0'.code.toByte()).toLong(): -1
in the
NUMBER_CHAR_SIGN, NUMBER_CHAR_NONE
case
😭 1
definitely something very low level messing things up
nod 1
Copy code
(c - '0'.code.toByte()).toLong().also { value = -it }
This works lol
😅 1
1
how about this for a patch? 🤣
m
Not really sure how to debug this further TBH...
I guess looking at the produced assembly is the way to go but no idea how to do this, even less on API level 15
Might be worth opening an issue on the Google issue tracker as a reference?
e
API level 21 😛 But yeah...
Google issue tracker, the place where bugs are forever forgotten 😅
😅 1
m
Yeaaaaaaa
e
Now seriously, I highly doubt that someone from google will look into such an issue on Android 5. I just checked with Kotlin Explorer, and the DEX code for both
value = -(c - '0'.code.toByte()).toLong()
and
(c - '0'.code.toByte()).toLong().also { value = -it }
is the same when code is optimized with R8. Given these, how keen would you be to merging that change in? 🙏 This would make a couple thousand people very happy, and it would give them back access to affordable financial services 😄
is the same when code is optimized with R8
Hmm, actually, if I turn on R8 in the repro project, no bug... which prompted me to check the ProGuard output in our project, where we have R8 enabled... and it looks like the Apollo classes are not being optimized 😅 So that explains the bug too... but no idea why they are not optimized. Gonna investigate some more.
m
The plot thickens....
Let us know what you find out. I’m not super keen of merging things without really knowing the root cause but if there’s no other way I guess it’s on the table
b
does changing
"proguard-android-optimize.txt"
to
"proguard-android.txt"
change anything?
e
Sorry, I was actually looking at the wrong build, Apollo classes are optimized obfuscated and shrank actually, so now I am clueless @bod we use
"proguard-android.txt"
. Maybe we should use
"proguard-android-optimize.txt"
instead? 😄
m
FWIW, I could boot a 5.1 x86 emulator image but this doesn’t reproduce the problem kodee sad
thank you color 1
😭 1
b
we use
"proguard-android.txt"
. Maybe we should use
"proguard-android-optimize.txt"
instead? 😄
your repro project uses the optimize one so I guess that's not the culprit. Was worth a try I guess 😅
e
ah, but in my repro project I actually had R8 off, and as soon as I enabled it, the bug was gone, so maybe
"proguard-android-optimize.txt"
does help? In our real project we use the other file. Unfortunately, I can't test this anymore today, I will have to wait until tomorrow to get access to the Nexus 7 again.
b
oooh sorry I missed that bit - interesting 👀 ah yes indeed
isMinifyEnabled = false
😅 1
a
@Eduard Boloș don't leave people hanging. It did work. (I came here because I saw the result but not the entire journey) What a thread. Huge thanks for indulging us and for your patience. Still curious to understand why tho.
👀 1
1
m
So enabling r8 made it work??
So weird....
a
Eduard changed
proguard-android.txt
to
proguard-android-optimize.txt
. That's it. I'm not sure if that's the same as enabling R8?
s
r8 + "proguard-android-optimize.txt"? I don't even know what adding the -optimize one instead really does edit: Beat me to it ^^
👍 1
m
Not sure either 😅. I think R8 runs always? Either way, thanks for the follow up!
b
I think it's just a more 'agressive' set of optimizations enabled in R8? This probably changes something in the bytecode that "un-triggers" the bug (just adding some
println
also un-triggered it IIUC?) . However, to be honest, this could potentially trigger other issues 🙈
e
don't leave people hanging. It did work
Yes, sorry, Alex was able to test the change very last minute 😄 proguard-android.txt contains
-dontoptimize
, whereas proguard-android-optimize.txt does not, but it has some exceptions (
-optimizations !code/simplification/arithmetic,!code/simplification/cast,!field/*,!class/merging/*
). I really don't expect issues with these optimizations, as now they are the default. This is what ultimately triggered me to get to the solution, the fact that in Kotlin Explorer (thanks Martin for reminding me about it) the DEX code for both
value = -(c - '0'.code.toByte()).toLong()
and
(c - '0'.code.toByte()).toLong().also { value = -it }
were the same when "Optimize with R8" was checked, and obviously Benoit's question about which rules were used (thanks! 😄). Now I no longer have the DEX code around, but what I noticed was that the optimized dex code was using one fewer register to calculate the value than the non-optimized version. So I imagine that somehow that extra register was getting corrupted? Not sure though, just an assumption. Anyway, thanks Martin and Benoit for all your support, and your patience! You were great as always! I just wish I would've noticed earlier that I didn't have ProGuard enabled on the repro project, so that we wouldn't have reached 100+ messages on this thread 😄 But oh, well, at least I feel like a learned a lot on this journey 🙂 Until next time! 👋
🙏 1
❤️ 2