go regexp.Replace notes
I had some code which did some repeated regexp.Replace
operations. This is not the best way to do things, but it worked. It wasn’t noticeably slow, but just from inspection there’s some inefficiencies. It’s possible to speed to things up without rewriting it substantially, at some cost to clarity.
code
The code works with strings, because that’s what I’m used to dealing with. This is part of the problem.
s = re1.ReplaceAllString(s, "this")
s = re2.ReplaceAllString(s, "$3$2$1")
s = re3.ReplaceAllString(s, "that")
And so forth. It looks fairly straightforward, but what’s really happening?
string
The first thing to note is that the regexp package works with bytes internally and all the regexp String
functions convert to byte slices coming and going. This is not immediately visible, but when expanded inline it’s more obviously wasteful.
s = string(re1.ReplaceAll([]byte(s), []byte("this")))
s = string(re2.ReplaceAll([]byte(s), []byte("$3$2$1")))
s = string(re3.ReplaceAll([]byte(s), []byte("that")))
That’s a lot of conversions, each of which involves a copy. The compiler can optimize some of the simple cases, but not across function calls.
One fix is to do the conversion once, then do the regexp replacement.
b := []byte(s)
b = re1.ReplaceAll(b, []byte("this")))
b = re2.ReplaceAll(b, []byte("$3$2$1")))
b = re3.ReplaceAll(b, []byte("that")))
s = string(b)
This works well, although it gets tedious if the replacements are strings, and especially when using a replacement function, which now needs further conversion.
replace
The second thing I noticed is that regexp.Replace
allocates a new copy even if there are no replacements to be made. It’s possible to avoid this by calling Match
first.
if re1.Match(b) {
b = re1.ReplaceAll(b, []byte("this")))
}
It’s interesting that the plain strings.Replace
function implements this optimization already. In optimizing my regexp calls by converting to bytes, I pessimized the plain strings.Replace
calls by replacing them with bytes.Replace
(which also always allocates a copy).
match
The go regexp package has good worst case performance, but not great best case performance. It turned out quite a bit of time was spent in Match
as well. In my case, I had a regexp like ...banana
which one can clearly tell only matches if the string (or byte array) contains “banana”, but the regexp seemed to spend quite some time matching the dots. A simple bytes.Contains
call sped things up considerably.
if bytes.Contains(b, []byte("banana")) && re1.Match(b) {
b = re1.ReplaceAll(b, []byte("this")))
}
This doesn’t do much for every for regexp, but for a select few, provided one can identify a decent anchor, there’s a lot to be gained.
results
All together, this was worth a 25x speedup in the common case where only a subset of regex matched. Most of the gains from the final optimization. Overall, though, this code was mostly inconsequential for performance, and the souped up version was growing (increasingly) unsightly. But in other circumstances, it may be useful to bypass regexp matching.
thoughts
The regexp package itself could take advantage of the unsafe package to eliminate some copies. The regex code will not write to its argument, so its safe to create a byte slice with the same backing store as the string. And we know the returned byte slice is freshly allocated with no aliases, so it will be safe to use as the backing store for a string.
The second problem really flows from the first. Replace needs to return a new slice for each call, lest further changes modify the initial argument. Not a problem with strings however. The String version of these functions could adopt the Match then Replace optimization. But it’s also fair to argue that the expected case is to do the replacement, and preemptive Match checking slows things down.
I’m not surprised or disappointed that the regexp library doesn’t do more aggressive optimizations to lookahead for impossible matches, but it would be nice to have.