Hello! I have been working on upgrading my team’s ...
# apollo-kotlin
t
Hello! I have been working on upgrading my team’s Apollo codegen and have unfortunately hit some road blocks. During the validation steps when calling
GQLDocument.validateAsExecutable
and again when building IR, our build is severely impacted by heavy fragment re-use which results in very long validation times. These validation calls also recursively call
validate
on every selection set, so it seems some entities are being validated multiple times. More in 🧵
I’ve forked the repo at version v3.8.2 and have retrofitted
CodegenTest
to run over our source files. Are there changes in v4.0.0 that could have improved anything? Java codegen finishes and builds in about 8 minutes on v2.3.1 (super old I know 😅), but on v3.8.2, fragment validation on the first pass takes about 41 minutes and operation validation on the first pass takes about 29 minutes. Some of the fragments and operations are validated once more when building IR, so validation takes close to 2 hours. After getting through document validation and building IR, we ended up with an OOM error while writing file infos after a grand total of 1 hour and 58 minutes.
Can the validation calls can be memoized? I’m not really familiar with document validation and if every single call to
validate
is necessary to maintain the integrity of the document. Could we optimize our fragment/operation definitions to help too? I’m currently working on getting an obfuscated version of our source. I also saw this issue and was surprised to not find anything about the validation steps taking a long time. For context, we have over 5k type definitions in our schema, including over 1k fragments defined and over 200 operations.
CodegenTest
was altered to generate Kotlin models and use operation-based codegen when testing our build. Sorry for the super long read, but hoping we can get somewhere!
m
Yikes! 2h is not great. Codegen performance hasn't really been a problem so far, usually, build time is dominated by the Kotlin compiler so we haven't spent a lot of time improving the codegen performance but it looks like we should.
One expensive part of validation is
fieldsCanMerge
, we have an issue to speed things up there but haven't got to it yet. It's hard to "memo" stuff because every fragment needs to be validated in the context of its operation
But it might very well be something completely different. 2h sounds way too much TBH so the code must hit a busyloop somewhere or something like this. If you can share your schema/operations, I'm pretty sure we can get to the bottom of this. There is graphql-anonymizer to anonymize your schema + queries. Or if you don't want to share publicly, feel free to share at martin@apollographql.com, that works too
Did you get any chance to look into it? I'm quite curious where the bottleneck might be now 🙂
t
Ah yes! I’ve been working on obfuscating our biggest query and just got it properly obfuscated and building again. This one takes about 20 minutes alone if it’s put through
CodegenTest
. Will upload here shortly!
Here are the scalars we use.
Copy code
scalarMapping = mapOf(
              "ykvogxdjcr" to ScalarInfo("java.lang.String"),
              "dgsalnbfnk" to ScalarInfo("java.lang.String"),
              "owblnrurns" to ScalarInfo("java.lang.String"),
              "fxvpqiduxs" to ScalarInfo("java.lang.String"),
              "ebjtupyrji" to ScalarInfo("java.lang.String"),
              "vtxouwljjr" to ScalarInfo("java.lang.String"),
              "inpciuewhq" to ScalarInfo("java.lang.String"),
              "fkukbftrbi" to ScalarInfo("java.lang.String"),
              "buoguxadgo" to ScalarInfo("java.lang.String"),
              "dujbacbaoz" to ScalarInfo("java.lang.String"),
          ),
I’m still stepping through the code to see what the actual bottleneck is for this example, although it seems to be
selectionSet.validate
and not
fieldsInSetCanMerge
. Not 100% sure yet though.
thank you color 1
m
Nice!
fieldsInSetCanMerge
sounds like a good candidate!
Calling just validate is ok-ish, takes 11s on my M2 laptop (test here)
Sanity check: are you using
responseBased
codegen by any chance?
I need to call it a day, will look into more details tomorrow. Let me know if you find anything!
t
Oh ok so I just ran the test on my machine and got pretty fast results as you did. It seems like the latest 4.x version might’ve fixed something, because changing the version back to 3.8.2 in the test causes validation to take along time again. Wondering if the
detectCycles
addition in 4.x is short-circuiting our validation chain? Sounds good! Thanks for taking a look. Will do 👍
Oh and also, was using
operationBased
.
responseBased
didn’t finish (or I gave up) when I tried it on the obfuscated example.
👍 1
m
Makes sense. I started a profile run with IJ, see if we can get a flame graph, will post that tomorrow (if the test finishes 😅 )
🤞 1
PS: I don't think detectCycle would help. It's doing additional checks so most likely slowing things down if anything
👍 1
t
So it looks like 4.x is much faster! Really curious to know what changed between 3.8.2 and the latest beta.
m
Flame graph doesn't tell much, just spends a lot of time traversing the GraphQL tree. Maybe
possibleTypes
but looks more like v3 was traversing more than needed.
t
Interesting…well looks like we’ll go straight to v4! How long did that run take? Also curious to know what kinds of improvements made it into v4 that would have fixed this.
m
~11min IIRC (but I cut this run short because I didn't want to wait 😄 )
Also very curious about what changed but not much time to dive into this and since we have a solution I'm tempted to not look 😄
t
Haha agreed. Thanks for all the help! Much appreciated.
thank you color 1