Tags

, , ,

I just thought of a way to avoid recursion or unintended replacement when doing a search & replace routine over a set of strings (or even one big string).

I thought I’d write it down here in case I ever want to come back to it. (Also as a potential point of conversation.)

A search & replace routine doing multiple replacements can suffer from two problems if it re-scans text it’s already altered. If the routine re-applies all match patterns after every change, it can recurse infinitely if a replacement string re-creates a matched pattern.

Even if it avoids recursion, applying one replacement after another can cause problems earlier replacements create patterns matched by latter ones.

For example, when converting plain text to text useable in XML or HTML, there is a well-known pitfall to avoid: Ampersands (&) must be converted to special strings that look like this: &amp; The less-than sign (<) must also be converted; this time the strings look like: &lt;

Now consider a standard S&R routine that first changes all ‘<‘ characters to &lt; strings and then changes all ‘&‘ characters to &amp; strings. Any &lt; strings from the first step now become &amp;lt; strings. Thus the original ‘<‘ now looks like: &lt;

The trick, of course, is to make sure you change the ‘&‘ characters first so that any latter changes that generate new&‘ characters don’t have those changes corrupted.

(Note that changing ‘&‘ to &amp; offers a threat of infinite recursion if the S&R routine doesn’t avoid changing &amp; to &amp;amp; and that to &amp;amp;amp; and so on forever. Fortunately, this is fairly easy to avoid and not a risk in most situations. But be aware of it, especially if you roll your own search & replace routine.)

((You’d think it’d be easy enough to avoid simply by only applying each match pattern just once. That does avoid recursion, but there are times when you actually want recursion (assuming the design has some way of stopping it). Complex outputs can be created if you apply all matches after every change, but it does require the author of the match patterns to be aware of potential infinite recursion.))

In any event, an idea occurred to me that should make the order of replacements irrelevant.

Just think of the process as progressively building an output string from an input string. Replacements remove segments from the input string and insert them into the output string. After all replacements are applied, any remaining bits of string are removed and inserted into the output.

No search pattern ever matches any changed string, so there’s no way for patterns to interfere. (Again, there are situations when you do want later patterns to see and potentially change earlier changes, but in my experience most times not.)

Assume you have a list of strings to process. Or assume you have a single string to process; it the same either way.

From the pristine input string, create a “virtual” output string that is empty, but has an indexed slot for each input character.

My thought is to implement the input and output strings as linked lists. A simple forward-linked list might work just fine. Each segment in the output list has index values for its start and stop position in the old list. As the output string grows, its segments have gaps between them representing unchanged text.

When a pattern finds a match, it removes the matched string from the input — linking the previous and subsequent strings across the gap. Index values in those list segments record the missing text. The previous string might end at character 26, and the following string might start at character 31, so characters 27-30 have been matched and replaced by some new string (which is in the output).

As the input string shrinks, the output string grows. It doesn’t matter if the inserts don’t have the same length. The output string is virtual and nothing needs to be shifted. The index values in the output segment reflect where the matched pattern was, so they’ll line up with all the unmatched text when it’s copied.

After all the patterns are applied, presumably both the input and output strings are not empty, but it’s possible they could be.

If the output is empty, no matches occurred, so just copy the input to the output, and we’re done.

If the input is empty, the entire string was matched and replaced (either all at once or over many matches). In this case the output string is “full,” so we’re done.

Assuming some segments remain in the input, just transfer them to their slots in the output. Effectively this step matches all remaining strings (virtually, either with themselves or with a wildcard, same either way) and replaces them with themselves.

In fact, if the “wildcard match” is applied as the final pattern, there is no need to consider the three possible final states. In all cases the final state will be an empty input string and a “full” output.

It’s one of those algorithms that you think of and — if you are really a programmer — you immediately think: “That would be fun to implement!” It’s actually a fairly simple one to do — simple enough for a semi-advanced student.

Or maybe it’s just me. Even after all this time, even in this day and age, I still like tinkering with linked lists and doing list- and string-processing routines. I just can’t stop coding! 🙂