This paper has a pretty good explanation of multistage programming using Java as an example:
https://sci-hub.tw/https://doi.org/10.1007/978-3-662-44202-9_16
Basically, as your host program runs, nothing "happens" -- it just accumulates eDSL code into some data structure (e.g. stack, queue, AST DAG). At periodic intervals, you compile down an "intermediate representation" with some optimizations (e.g. constant propagation, common subexpression elimination) and maybe specialize it to another architecture (outputting e.g. CUDA, webasm). Then at some predetermined time, you execute the whole chunk at once. You could also do this for each line of code (in which case, it's just an interpreter) but the execution of the eDSL does not necessarily need to follow the program counter of the host program.
The benefit is mainly performance -- instead of carrying around the original data structure and executing exactly what the user wrote, you're translating the eDSL code into some specialized machine code on the fly. There's been lots of progress in making these things run fast, at close to native speeds.
I guess the benefit for Kotlingrad is that it could leverage the GPU. There's been some discussion about it here:
https://github.com/breandan/kotlingrad/issues/8