Monday, May 16, 2011

Regex not so Regular?

A couple of days ago, on the VWNC mailing list, there was a discussion about the Regex11 package. Regex11 was originally written by Vassili Bykov as a Regex library for Smalltalk. It's actually been around a while and served the Smalltalk community pretty well, I think, in part due to Vassili's excellent coding abilities.

I think it's really cool that someone can sit down and write a complete Regex implementation in Smalltalk. This is part of the Smalltalk ethos. Being able to invent your own future. Write your own software, in any area you like. On the other hand, as I read the comments, I found myself wondering, "it's great that you can do this, but does that mean you should?" Of course, the answer varies. It depends on what your goals are. A certain part of me feels that I don't really want to care this much about how the implementation works. Isn't Regex one of those things that's grown up enough now, that we can just use off-the-shelf services provided by the host platforms? If we just reuse those, we might get some speed (benefits of thousands of code monkeys tuning and tweaking them over the years), and some standardization.

So I went on a little journey. Read on if what I discovered is of interest...

The Package

I put my work in a package called ExRegex. As in "External Regex", using the ExternalInterface features of VisualWorks to bind to an outside-of-smalltalk shared library. Plus I just liked the way ExRegex looked almost palindromic. And I made a Test package called ExRegex-Tests. No surprises here, pretty standard operating procedure.

I only began to implement the substitution stuff; for now, it's just good for matching. One thing I did do differently than Regex11 was employ a symmetric pair of binary selectors for doing Regex matches.
    '(a|b)c' ?= 'ac' --> true
'Travis' =? '(Senior|Griggs)' --> false


The ? character always leans towards the side with the regex pattern.

The Tests

I stole the tests from the Regex11-Testing package. I made my own SUnitToo copy of it. I soon found I wasn't satisfied with them. The original tests have a single test method. But said method fetches a clauseArray which is a series of inputs, and expected outputs. There's 3 or 4 relatively involved methods between the loop in the single test method and actually executing a single regex match.

What this means, is that when you get a failure, you don't know much. First you have to dig through the little framework of interpreting the clauseArray. When I'm trying to understand how someone's Regex framework works, I don't want to understand how their Regex test framework works as well. I just want to see very simple lines of code, the kind I would put in a workspace.

Secondly, as soon, as you hit a failure, you're done. The clauseArray has 137 entries in it. So if it fails for any reason (error or assertion failure) on test number 22... you have no idea how the others work. Is it just this one test? Or do others fail too? Is there a common pattern amongst the failures?

I think we forget sometimes, that just as easy as it is to write a little DSL or interpretation framework, it's just as easy to write a little bit of code that interprets that into real Smalltalk methods. It's easy to generate code and it's easy to compile it and it;s easy to install it. Here's what I did for this particular case:

RegexTest new clauseArray keysAndValuesDo:
[:index :array |
array size > 2
ifTrue:
[ws := String new writeStream.
ws
nextPutAll: 'match';
print: index.
ws nextPutAll: ' <test> '.
ws nextPutAll: ' | regex | '.
ws nextPutAll: ' regex := ' , array first printString , ' exRegex. '.
2 to: array size
by: 3
do:
[:n |
ws nextPutAll: ' self '.
ws
nextPutAll: ((array at: n + 1) ifTrue: [' assert: '] ifFalse: [' deny: ']).
ws
nextPutAll: ' (regex match: ';
print: (array at: n);
nextPutAll: ').'].
RegexTest compile: (RBParser parseMethod: ws contents) formattedCode
classified: 'tests match']]


I didn't even have to bother to format the code, I let the RB services do that for me. What I ended up with was nice easy methods to read like this:


And when I run the tests from within the IDE I get nice feedback like this:



This made it much easier to unravel things that, well, started to unravel after that.

So much for Regularity

Regularity means (among others) "conforming to a standard or pattern." I'm not sure a more ironic name was ever chosen as a moniker for a "standard set" of pattern matching. The C library interface is pretty standard. You have 4 basic functions:

  • regcomp() - builds up a regex structure based on a pattern and some flags
  • regexec() - compares an input source against the regex structure for matches
  • regerror() - makes human readable strings for your regex structure when parse errors exist
  • regfree() - used to free the various chunks of memory associated with the regfree structure

I knew Regex's varied a little from platform to platform. What surprised me was that the Regex11 did stuff the C library one couldn't (e.g. \d as sugar for [0-9] and \s for separators). And vice versa, the C library variant will do [0-9]{1,3} (match 1 to 3 digits), but the Smalltalk Regex11 can't do the {1,3} thing. You'd have to live with + (1 or more) or repeat it 3 times.

But the fun didn't stop there. While both are part of the standard libc, the BSD variant that one finds on MacOSX, is different than the one finds under GNU platforms, such as Linux. And not just in their behavior. While the 4 C functions are the same between the two, and the same symbolic defines for the flags exist, they actually map to different values. For example, the option RE_NOSUB (what you use when you just care if it matches or not, not where), is a 0x08 on a GNU platform and 0x04 on a BSD platform. And the regex_t structure, which you're responsible to allocate the memory for, is entirely different between the two. For a C program using these libraries, you just use the defines, and you're fine, but for a Smalltalk (or any other not-compiled-by-the-C-Compiler/Preprocessor language), that more intimate knowledge of what the symbolic values mean, becomes more important. Getting around these differences was interesting.

Regular? Anything but. This is kind of alarming actually. It's not like these things give you an error when you try to use something like \d. It really just tries to match an escaped d (rather than a digit), and now it fails. The upshot is that you don't have much portability with these things, and as you move from environment to environment, language to language, platform to platform, you'd better have a pretty good test suite around your regexes to make sure they're still matching the way they were when you developed your original application. On this aspect, this is one time where the "do it all in Smalltalk" actually has a compelling advantage, depending on your constraints.

There's an excellent exhaustive table at this link that shows a number of common implementations, and how they compare across a whole plethora of different options.

The Speed Factor

What about the speed issue? Again, it depends. For doing simple matches, Regex11 will come out faster. Less overhead in going across the C boundaries. Let's take a regex for matching IP addresses:
'.*(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).*'.

If we match this against simple strings like '192.168.3.128' and 'abc', Regex11 is faster. As much as 3 times faster. But if we match against larger strings, such as the guts of the IPSocketAddress class>>allAboutHostAddresses method (an expected match) and the result of Object comment (shouldn't match), then the tables turn quite a bit. The Smalltalk version is suddenly 210 times slower!. This measures both the time to create the Regex object, as well as do the match.

So, I guess it depends on what kind of Regex's you're doing, and what kinds of frequencies (remember, it should always be "Make it Work, Make it Right, Make it Fast").

I don't know why exactly, the pattern above actually requires the leading and trailing .* values to match the ip patterns mid source, I had to add it for Regex11, but the ExRegex interface didn't need it.

7 comments:

  1. But if we match against larger strings [...] The Smalltalk version is suddenly 210 times slower!.

    This is to be expected, the Regex11 Matcher implementation is a naive implementation. Compiling down to more elaborate matcher architectures (NFAs or DFAs) will make it faster on larger inputs (and known problematic inputs!). See the first article listed here: http://swtch.com/~rsc/regexp/.

    I don't know why exactly, the pattern above actually requires the leading and trailing .* values to match the ip patterns mid source

    There are various use cases for pattern matching: one is filtering entire strings, another is substring search (what you seem to expect). So it seems you are using the filtering API instead of the substring matching API.

    ReplyDelete
  2. You do realize that "regular" in "regular expresion" refers to the set of languages that can be described via a regular expression (as per Chomsky's classification), and not any kind of standard feature set? ;-)

    ReplyDelete
  3. I've always wondered, if an implementation using finite state automata would be faster. My gut feeling tells me that perhaps parsing a regex could be slightly slower but matching should be much faster. Have you any insights on this? (Actually I've thought of implementing RegEx11 this way as a personal exercise. Just like Vassili's package comment states.)

    ReplyDelete
  4. Steffen, please do :-)

    How: http://swtch.com/~rsc/regexp/

    Why:


    ['aaaaaaaaaaaaaaaaaaaaaa'
    matchesRegex: 'a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaa']
    microsecondsToRun / 1000000.0

    ReplyDelete
  5. Regex11 does not support backreferences, does it? (I've never needed this feature yet.) A FSA approach is only possible if backreferences are not needed.

    ReplyDelete
  6. Two features I found seriously lacking from Regex11/VB-Regex are non-greedy qualifiers and match objects instead of plain strings, e. g.

    '<p>just a test</p><p>another one</p>'
    allRegexMatches: '<p>(.*?)</p>'
    do: [ :m | Transcript nextPutAll: (m at: 1)]

    This should print 'just a test' and 'another one' from the match instances.

    Maybe it's not a good idea to parse non-regular languages like XML this way - better use SAX/DOM parsers but sometimes it comes in handy.

    Actually I just found out GNU-Smalltalk implements both features this way. Although it relies on a low-level C implementation tied to the String class (#searchRegexInternal:from:to:) which is "derived from GNU libc, with modifications made originally for Ruby to support Perl-like syntax" (105KB C source, 4K LOC).

    see GNU Smalltalk Manual

    ReplyDelete