go regexp.Replace notes

I had some code which did some repeated regexp.Replace operations. This is not the best way to do things, but it worked. It wasn’t noticeably slow, but just from inspection there’s some inefficiencies. It’s possible to speed to things up without rewriting it substantially, at some cost to clarity.

code

The code works with strings, because that’s what I’m used to dealing with. This is part of the problem.

s = re1.ReplaceAllString(s, "this")
s = re2.ReplaceAllString(s, "$3$2$1")
s = re3.ReplaceAllString(s, "that")

And so forth. It looks fairly straightforward, but what’s really happening?

string

The first thing to note is that the regexp package works with bytes internally and all the regexp String functions convert to byte slices coming and going. This is not immediately visible, but when expanded inline it’s more obviously wasteful.

s = string(re1.ReplaceAll([]byte(s), []byte("this")))
s = string(re2.ReplaceAll([]byte(s), []byte("$3$2$1")))
s = string(re3.ReplaceAll([]byte(s), []byte("that")))

That’s a lot of conversions, each of which involves a copy. The compiler can optimize some of the simple cases, but not across function calls.

One fix is to do the conversion once, then do the regexp replacement.

b := []byte(s)
b = re1.ReplaceAll(b, []byte("this")))
b = re2.ReplaceAll(b, []byte("$3$2$1")))
b = re3.ReplaceAll(b, []byte("that")))
s = string(b)

This works well, although it gets tedious if the replacements are strings, and especially when using a replacement function, which now needs further conversion.

replace

The second thing I noticed is that regexp.Replace allocates a new copy even if there are no replacements to be made. It’s possible to avoid this by calling Match first.

if re1.Match(b) {
    b = re1.ReplaceAll(b, []byte("this")))
}

It’s interesting that the plain strings.Replace function implements this optimization already. In optimizing my regexp calls by converting to bytes, I pessimized the plain strings.Replace calls by replacing them with bytes.Replace (which also always allocates a copy).

match

The go regexp package has good worst case performance, but not great best case performance. It turned out quite a bit of time was spent in Match as well. In my case, I had a regexp like ...banana which one can clearly tell only matches if the string (or byte array) contains “banana”, but the regexp seemed to spend quite some time matching the dots. A simple bytes.Contains call sped things up considerably.

if bytes.Contains(b, []byte("banana")) && re1.Match(b) {
    b = re1.ReplaceAll(b, []byte("this")))
}

This doesn’t do much for every for regexp, but for a select few, provided one can identify a decent anchor, there’s a lot to be gained.

results

All together, this was worth a 25x speedup in the common case where only a subset of regex matched. Most of the gains from the final optimization. Overall, though, this code was mostly inconsequential for performance, and the souped up version was growing (increasingly) unsightly. But in other circumstances, it may be useful to bypass regexp matching.

thoughts

The regexp package itself could take advantage of the unsafe package to eliminate some copies. The regex code will not write to its argument, so its safe to create a byte slice with the same backing store as the string. And we know the returned byte slice is freshly allocated with no aliases, so it will be safe to use as the backing store for a string.

The second problem really flows from the first. Replace needs to return a new slice for each call, lest further changes modify the initial argument. Not a problem with strings however. The String version of these functions could adopt the Match then Replace optimization. But it’s also fair to argue that the expected case is to do the replacement, and preemptive Match checking slows things down.

I’m not surprised or disappointed that the regexp library doesn’t do more aggressive optimizations to lookahead for impossible matches, but it would be nice to have.

Posted 02 Dec 2019 11:39 by tedu Updated: 02 Dec 2019 11:39
Tagged: go programming