Hey - trying to work out whether I'm being pedanti...
# getting-started
r
Hey - trying to work out whether I'm being pedantic or not when I insist on extracting inlined
"[a-z]+".toRegex()
into a memoized
val
. Decompiling the generated bytecode shows that the compiler is generating byte code that instantiates a new instance of the Regex every time the code is called (as you'd expect). Is there some other optimisation taking place (in the Regex class's constructor? In the JVM?) which would prevent it endlessly recompiling the same regex? Is compiling the regex a sufficiently fast operation that it's stupid to be wasting time extracting a constant? Or am I right to care about this?
👀 1
h
We noticed the same thing, also curious about this At the moment we have it wrapped in a
lazy
Not sure about whether this is a correct approach too
r
I should really profile it rather than being lazy and asking...
s
I was wondering about exact same thing. I've created a stupidly simplified benchmark for it:
Copy code
private val regexStrings = """
    (\W|^)stock\s{0,3}tips(\W|${'$'})
    (\W|^)stock\s{0,3}tip(s){0,1}(\W|${'$'})
    (\W|^)[\w.\-]{0,25}@(yahoo|hotmail|gmail)\.com(\W|${'$'})
    (\W|^)po[#\-]{0,1}\s{0,1}\d{2}[\s-]{0,1}\d{4}(\W|${'$'})
    ^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})${'$'}
""".trimIndent().lines().map { it.trim() }

@State(Scope.Benchmark)
class RegexBenchmark {
    private val regexString0 = regexStrings[0]
    private val regexString1 = regexStrings[1]
    private val regexString2 = regexStrings[2]
    private val regexString3 = regexStrings[3]

    private val regex0 = regexString0.toRegex()
    private val regex1 = regexString1.toRegex()
    private val regex2 = regexString2.toRegex()
    private val regex3 = regexString3.toRegex()

    @Benchmark
    fun memoized(): List<Regex> = buildList {
        add(regex0)
        add(regex1)
        add(regex2)
        add(regex3)
    }

    @Benchmark
    fun createdOnDemand(): List<Regex> = buildList {
        add(regexString0.toRegex())
        add(regexString1.toRegex())
        add(regexString2.toRegex())
        add(regexString3.toRegex())
    }
}
and results suggest that unless I over simplified it, there is no magic optimization under the hood and it is often better to cache regexes performance-wise:
Copy code
benchmarks summary:
Benchmark                       Mode  Cnt     Score     Error  Units
RegexBenchmark.createdOnDemand  avgt    5  2552.654 ± 269.883  ns/op
RegexBenchmark.memoized         avgt    5    12.873 ±   1.689  ns/op
thank you color 2
r
Awesome, thanks!
k
I think a realistic test would compare the time to compile a regex with the time to search a typical string using that regex. If searching takes 1000 times longer than compiling the regex, it would be pointless memoizing it (except for readability purposes).
r
I did something like that, though with a very simple regex & random string - compilation takes roughly 10% of the search time. So not nothing.
Copy code
import org.apache.commons.lang3.RandomStringUtils
import kotlin.time.measureTime

fun main() {
  val elapsed = measureTime {
    repeat(1_000_000) {
      if (RandomStringUtils.random(10).matches("[a-z]+".toRegex())) {
        println("matched!")
      }
    }
  }
  println("took $elapsed")
}
👍 1
k
That's good. I wonder how much time is spent in RandomStringUtils though.
r
Good point!
Taking RandomStringUtils out of the equation it's more dramatic - the inline version takes > 3 times as long as the memoized version. So compiling the regex takes more than twice as long as matching on it. Having said with it's still individually tiny - compilation is costing roughly 112 nanoseconds on my machine, matching 46 nanoseconds.
Unsurprisingly this gets worse as the regex gets more complicated - 340 nanoseconds for
^[a-zA-Z0-9]+([-.][a-zA-Z0-9]+)*\.?$
s
how long regex operations take greatly depends on both the complexity of Regexes, and the inputs they're being fed with. I guess it's very hard if not impossible to provide any universal benchmarks for pattern compilation vs regex operations performance. For one case regex operation might be 100 times longer than pattern compilation, for the other pattern compilation may be 10 times longer. IMO if you want reliable benchmarks, you should perform them in context of the real code base and real inputs. Also, given some realistic scenario it might be the case that all this regex stuff doesn't matter at all since 99% of business operation is being spent on DB or HTTP calls. To me, it looks like caching regexes can sometimes save you some time, so why not just consider it a good practice and go for it most of the times? Anyways, I also modified the benchmark to cover pattern compilation vs match/findAll performance. It's not really realistic, but just in case you wanted to see the results, here they are 🙂
Copy code
private val simpleRegexStrings = """
    [a-zA-Z]{2,}
    (\w+)\s*(\w+)
    \d{3}-\d{3}-\d{4}
    \d+
""".trimIndent().lines().map { it.trim() }

private val complexRegexStrings = """
    (\W|^)stock\s{0,3}tips(\W|${'$'})
    (\W|^)stock\s{0,3}tip(s){0,1}(\W|${'$'})
    (\W|^)[\w.\-]{0,25}@(yahoo|hotmail|gmail)\.com(\W|${'$'})
    (\W|^)po[#\-]{0,1}\s{0,1}\d{2}[\s-]{0,1}\d{4}(\W|${'$'})
    ^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})${'$'}
""".trimIndent().lines().map { it.trim() }

private val testedStrings = with(RandomStringUtils.secure()) {
    buildList {
        repeat(10) { this += next(Random.nextInt(10, 100)) }
        repeat(10) { this += nextAscii(Random.nextInt(10, 100)) }
        repeat(10) { this += nextAlphabetic(Random.nextInt(10, 100)) }
        repeat(10) { this += nextNumeric(Random.nextInt(10, 100)) }
        repeat(10) { this += nextAlphanumeric(Random.nextInt(10, 100)) }
    }
}


@State(Scope.Benchmark)
class RegexBenchmark {
    private val memoizedSimpleRegexes = simpleRegexStrings.map { it.toRegex() }
    private val memoizedComplexRegexes = complexRegexStrings.map { it.toRegex() }

    @Benchmark
    fun simpleMemoizedPlusMatches(): List<Boolean> =
        testedStrings.flatMap { string ->
            memoizedSimpleRegexes.map { it.matches(string) }
        }

    @Benchmark
    fun simpleCreatedOnDemandPlusMatches(): List<Boolean> =
        testedStrings.flatMap { string ->
            simpleRegexStrings.map { it.toRegex().matches(string) }
        }


    @Benchmark
    fun simpleMemoizedPlusFindAll(): List<MatchResult> =
        testedStrings.flatMap { string ->
            memoizedSimpleRegexes.flatMap { it.findAll(string).toList() }
        }

    @Benchmark
    fun simpleCreatedOnDemandPlusFindAll(): List<MatchResult> =
        testedStrings.flatMap { string ->
            simpleRegexStrings.flatMap { it.toRegex().findAll(string).toList() }
        }

    @Benchmark
    fun complexMemoizedPlusMatches(): List<Boolean> =
        testedStrings.flatMap { string ->
            memoizedComplexRegexes.map { it.matches(string) }
        }

    @Benchmark
    fun complexCreatedOnDemandPlusMatches(): List<Boolean> =
        testedStrings.flatMap { string ->
            complexRegexStrings.map { it.toRegex().matches(string) }
        }


    @Benchmark
    fun complexMemoizedPlusFindAll(): List<MatchResult> =
        testedStrings.flatMap { string ->
            memoizedComplexRegexes.flatMap { it.findAll(string).toList() }
        }

    @Benchmark
    fun complexCreatedOnDemandPlusFindAll(): List<MatchResult> =
        testedStrings.flatMap { string ->
            complexRegexStrings.flatMap { it.toRegex().findAll(string).toList() }
        }
}
Copy code
benchmarks summary:
Benchmark                                         Mode  Cnt       Score       Error  Units
RegexBenchmark.complexCreatedOnDemandPlusFindAll  avgt    5  421853.928 ± 18149.835  ns/op
RegexBenchmark.complexMemoizedPlusFindAll         avgt    5  246762.121 ± 33902.417  ns/op

RegexBenchmark.complexCreatedOnDemandPlusMatches  avgt    5  235193.925 ± 13763.335  ns/op
RegexBenchmark.complexMemoizedPlusMatches         avgt    5   64234.799 ± 13706.492  ns/op

RegexBenchmark.simpleCreatedOnDemandPlusFindAll   avgt    5  190437.414 ±  7547.736  ns/op
RegexBenchmark.simpleMemoizedPlusFindAll          avgt    5  175379.259 ±  4604.692  ns/op

RegexBenchmark.simpleCreatedOnDemandPlusMatches   avgt    5   68223.847 ±  1464.906  ns/op
RegexBenchmark.simpleMemoizedPlusMatches          avgt    5   31435.911 ±  1861.502  ns/op
r
(But that still means you could compile > 3,000 similar regexes in the course of serving a single request and only add 1ms to the response time, so... meh.)
d
I generally like to pull regexes out anyway and give them semantic names. It's easy enough to understand what
Regex("\\w+")
matches, but
val userNameRegex = Regex("\\w+")
tells me WHY we're matching it. It also has the benefit – as minuscule as it may be – of reducing number of compiles.
3