Thanks for your feedback. I think you would have to take into account more than just the activation function.
There are arguments for other initialization strategies for ReLU activation in other contexts that have been cited in the literature, e.g. strategies for recurrent neural nets:
https://arxiv.org/pdf/1511.03771.pdf
Again, instead of worrying about heuristics or arguments for particular contexts, I would rather invest the time in neural architecture search. I'm not a fan of magic numbers 😉
I've been considering global optimizer settings that can be overridden locally for some time. This might be added soon, but there are just dozens of other tasks that seem more important.
Using the new higher-order continuation base classes, it's relatively easy to define custom layers based on a sequence of lower-level layers such as a convolution followed by an activation.