Hey trying to work out whether I m being pedantic or not whe kotlinlang #getting-started

Hey - trying to work out whether I'm being pedanti...

Rob Elliot

09/18/2024, 10:03 AM

Hey - trying to work out whether I'm being pedantic or not when I insist on extracting inlined

"[a-z]+".toRegex()

into a memoized

val

. Decompiling the generated bytecode shows that the compiler is generating byte code that instantiates a new instance of the Regex every time the code is called (as you'd expect). Is there some other optimisation taking place (in the Regex class's constructor? In the JVM?) which would prevent it endlessly recompiling the same regex? Is compiling the regex a sufficiently fast operation that it's stupid to be wasting time extracting a constant? Or am I right to care about this?

👀 1

Hristijan

09/18/2024, 10:40 AM

We noticed the same thing, also curious about this At the moment we have it wrapped in a

lazy

Not sure about whether this is a correct approach too

Rob Elliot

09/18/2024, 10:41 AM

I should really profile it rather than being lazy and asking...

Szymon Jeziorski

09/18/2024, 10:52 AM

I was wondering about exact same thing. I've created a stupidly simplified benchmark for it:

Copy code

private val regexStrings = """
    (\W|^)stock\s{0,3}tips(\W|${'$'})
    (\W|^)stock\s{0,3}tip(s){0,1}(\W|${'$'})
    (\W|^)[\w.\-]{0,25}@(yahoo|hotmail|gmail)\.com(\W|${'$'})
    (\W|^)po[#\-]{0,1}\s{0,1}\d{2}[\s-]{0,1}\d{4}(\W|${'$'})
    ^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})${'$'}
""".trimIndent().lines().map { it.trim() }

@State(Scope.Benchmark)
class RegexBenchmark {
    private val regexString0 = regexStrings[0]
    private val regexString1 = regexStrings[1]
    private val regexString2 = regexStrings[2]
    private val regexString3 = regexStrings[3]

    private val regex0 = regexString0.toRegex()
    private val regex1 = regexString1.toRegex()
    private val regex2 = regexString2.toRegex()
    private val regex3 = regexString3.toRegex()

    @Benchmark
    fun memoized(): List<Regex> = buildList {
        add(regex0)
        add(regex1)
        add(regex2)
        add(regex3)
    }

    @Benchmark
    fun createdOnDemand(): List<Regex> = buildList {
        add(regexString0.toRegex())
        add(regexString1.toRegex())
        add(regexString2.toRegex())
        add(regexString3.toRegex())
    }
}

and results suggest that unless I over simplified it, there is no magic optimization under the hood and it is often better to cache regexes performance-wise:

Copy code

benchmarks summary:
Benchmark                       Mode  Cnt     Score     Error  Units
RegexBenchmark.createdOnDemand  avgt    5  2552.654 ± 269.883  ns/op
RegexBenchmark.memoized         avgt    5    12.873 ±   1.689  ns/op

thank you color 2

Rob Elliot

09/18/2024, 10:54 AM

Awesome, thanks!

Klitos Kyriacou

09/18/2024, 11:01 AM

I think a realistic test would compare the time to compile a regex with the time to search a typical string using that regex. If searching takes 1000 times longer than compiling the regex, it would be pointless memoizing it (except for readability purposes).

Rob Elliot

09/18/2024, 11:07 AM

I did something like that, though with a very simple regex & random string - compilation takes roughly 10% of the search time. So not nothing.

Copy code

import org.apache.commons.lang3.RandomStringUtils
import kotlin.time.measureTime

fun main() {
  val elapsed = measureTime {
    repeat(1_000_000) {
      if (RandomStringUtils.random(10).matches("[a-z]+".toRegex())) {
        println("matched!")
      }
    }
  }
  println("took $elapsed")
}

👍 1

Klitos Kyriacou

09/18/2024, 11:19 AM

That's good. I wonder how much time is spent in RandomStringUtils though.

Rob Elliot

09/18/2024, 11:24 AM

Good point!

Rob Elliot

09/18/2024, 12:00 PM

Taking RandomStringUtils out of the equation it's more dramatic - the inline version takes > 3 times as long as the memoized version. So compiling the regex takes more than twice as long as matching on it. Having said with it's still individually tiny - compilation is costing roughly 112 nanoseconds on my machine, matching 46 nanoseconds.

Rob Elliot

09/18/2024, 12:14 PM

Unsurprisingly this gets worse as the regex gets more complicated - 340 nanoseconds for

^[a-zA-Z0-9]+([-.][a-zA-Z0-9]+)*\.?$

Szymon Jeziorski

09/18/2024, 12:16 PM

how long regex operations take greatly depends on both the complexity of Regexes, and the inputs they're being fed with. I guess it's very hard if not impossible to provide any universal benchmarks for pattern compilation vs regex operations performance. For one case regex operation might be 100 times longer than pattern compilation, for the other pattern compilation may be 10 times longer. IMO if you want reliable benchmarks, you should perform them in context of the real code base and real inputs. Also, given some realistic scenario it might be the case that all this regex stuff doesn't matter at all since 99% of business operation is being spent on DB or HTTP calls. To me, it looks like caching regexes can sometimes save you some time, so why not just consider it a good practice and go for it most of the times? Anyways, I also modified the benchmark to cover pattern compilation vs match/findAll performance. It's not really realistic, but just in case you wanted to see the results, here they are 🙂

Copy code

private val simpleRegexStrings = """
    [a-zA-Z]{2,}
    (\w+)\s*(\w+)
    \d{3}-\d{3}-\d{4}
    \d+
""".trimIndent().lines().map { it.trim() }

private val complexRegexStrings = """
    (\W|^)stock\s{0,3}tips(\W|${'$'})
    (\W|^)stock\s{0,3}tip(s){0,1}(\W|${'$'})
    (\W|^)[\w.\-]{0,25}@(yahoo|hotmail|gmail)\.com(\W|${'$'})
    (\W|^)po[#\-]{0,1}\s{0,1}\d{2}[\s-]{0,1}\d{4}(\W|${'$'})
    ^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})${'$'}
""".trimIndent().lines().map { it.trim() }

private val testedStrings = with(RandomStringUtils.secure()) {
    buildList {
        repeat(10) { this += next(Random.nextInt(10, 100)) }
        repeat(10) { this += nextAscii(Random.nextInt(10, 100)) }
        repeat(10) { this += nextAlphabetic(Random.nextInt(10, 100)) }
        repeat(10) { this += nextNumeric(Random.nextInt(10, 100)) }
        repeat(10) { this += nextAlphanumeric(Random.nextInt(10, 100)) }
    }
}


@State(Scope.Benchmark)
class RegexBenchmark {
    private val memoizedSimpleRegexes = simpleRegexStrings.map { it.toRegex() }
    private val memoizedComplexRegexes = complexRegexStrings.map { it.toRegex() }

    @Benchmark
    fun simpleMemoizedPlusMatches(): List<Boolean> =
        testedStrings.flatMap { string ->
            memoizedSimpleRegexes.map { it.matches(string) }
        }

    @Benchmark
    fun simpleCreatedOnDemandPlusMatches(): List<Boolean> =
        testedStrings.flatMap { string ->
            simpleRegexStrings.map { it.toRegex().matches(string) }
        }


    @Benchmark
    fun simpleMemoizedPlusFindAll(): List<MatchResult> =
        testedStrings.flatMap { string ->
            memoizedSimpleRegexes.flatMap { it.findAll(string).toList() }
        }

    @Benchmark
    fun simpleCreatedOnDemandPlusFindAll(): List<MatchResult> =
        testedStrings.flatMap { string ->
            simpleRegexStrings.flatMap { it.toRegex().findAll(string).toList() }
        }

    @Benchmark
    fun complexMemoizedPlusMatches(): List<Boolean> =
        testedStrings.flatMap { string ->
            memoizedComplexRegexes.map { it.matches(string) }
        }

    @Benchmark
    fun complexCreatedOnDemandPlusMatches(): List<Boolean> =
        testedStrings.flatMap { string ->
            complexRegexStrings.map { it.toRegex().matches(string) }
        }


    @Benchmark
    fun complexMemoizedPlusFindAll(): List<MatchResult> =
        testedStrings.flatMap { string ->
            memoizedComplexRegexes.flatMap { it.findAll(string).toList() }
        }

    @Benchmark
    fun complexCreatedOnDemandPlusFindAll(): List<MatchResult> =
        testedStrings.flatMap { string ->
            complexRegexStrings.flatMap { it.toRegex().findAll(string).toList() }
        }
}

Copy code

benchmarks summary:
Benchmark                                         Mode  Cnt       Score       Error  Units
RegexBenchmark.complexCreatedOnDemandPlusFindAll  avgt    5  421853.928 ± 18149.835  ns/op
RegexBenchmark.complexMemoizedPlusFindAll         avgt    5  246762.121 ± 33902.417  ns/op

RegexBenchmark.complexCreatedOnDemandPlusMatches  avgt    5  235193.925 ± 13763.335  ns/op
RegexBenchmark.complexMemoizedPlusMatches         avgt    5   64234.799 ± 13706.492  ns/op

RegexBenchmark.simpleCreatedOnDemandPlusFindAll   avgt    5  190437.414 ±  7547.736  ns/op
RegexBenchmark.simpleMemoizedPlusFindAll          avgt    5  175379.259 ±  4604.692  ns/op

RegexBenchmark.simpleCreatedOnDemandPlusMatches   avgt    5   68223.847 ±  1464.906  ns/op
RegexBenchmark.simpleMemoizedPlusMatches          avgt    5   31435.911 ±  1861.502  ns/op

Rob Elliot

09/18/2024, 12:17 PM

(But that still means you could compile > 3,000 similar regexes in the course of serving a single request and only add 1ms to the response time, so... meh.)

Daniel Pitts

09/18/2024, 1:49 PM

I generally like to pull regexes out anyway and give them semantic names. It's easy enough to understand what

Regex("\\w+")

matches, but

val userNameRegex = Regex("\\w+")

tells me WHY we're matching it. It also has the benefit – as minuscule as it may be – of reducing number of compiles.

➕ 3

14 Views

Open in Slack

Previous Next