Comments on Objology: Regex not so Regular?

Two features I found seriously lacking from Regex1...

2011-05-18T09:20:38.894-07:00

Two features I found seriously lacking from Regex11/VB-Regex are non-greedy qualifiers and match objects instead of plain strings, e. g.

'just a testanother one'
allRegexMatches: '(.*?)'
do: [ :m | Transcript nextPutAll: (m at: 1)]

This should print 'just a test' and 'another one' from the match instances.

Maybe it's not a good idea to parse non-regular languages like XML this way - better use SAX/DOM parsers but sometimes it comes in handy.

Actually I just found out GNU-Smalltalk implements both features this way. Although it relies on a low-level C implementation tied to the String class (#searchRegexInternal:from:to:) which is "derived from GNU libc, with modifications made originally for Ruby to support Perl-like syntax" (105KB C source, 4K LOC).

see GNU Smalltalk Manual

Regex11 does not support backreferences, does it? ...

2011-05-18T08:24:30.315-07:00

Regex11 does not support backreferences, does it? (I've never needed this feature yet.) A FSA approach is only possible if backreferences are not needed.

Steffen, please do :-) How: http://swtch.com/~rsc...

2011-05-18T06:08:20.284-07:00

Steffen, please do :-)

How: http://swtch.com/~rsc/regexp/

Why:

['aaaaaaaaaaaaaaaaaaaaaa'
matchesRegex: 'a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaa']
microsecondsToRun / 1000000.0

I've always wondered, if an implementation usi...

2011-05-18T04:42:48.121-07:00

I've always wondered, if an implementation using finite state automata would be faster. My gut feeling tells me that perhaps parsing a regex could be slightly slower but matching should be much faster. Have you any insights on this? (Actually I've thought of implementing RegEx11 this way as a personal exercise. Just like Vassili's package comment states.)

k: yes, i do.

2011-05-17T07:40:46.805-07:00

k: yes, i do.

You do realize that "regular" in "r...

2011-05-17T03:00:04.629-07:00

You do realize that "regular" in "regular expresion" refers to the set of languages that can be described via a regular expression (as per Chomsky's classification), and not any kind of standard feature set? ;-)

But if we match against larger strings [...] The S...

2011-05-17T02:52:24.159-07:00

But if we match against larger strings [...] The Smalltalk version is suddenly 210 times slower!.

This is to be expected, the Regex11 Matcher implementation is a naive implementation. Compiling down to more elaborate matcher architectures (NFAs or DFAs) will make it faster on larger inputs (and known problematic inputs!). See the first article listed here: http://swtch.com/~rsc/regexp/.

I don't know why exactly, the pattern above actually requires the leading and trailing .* values to match the ip patterns mid source

There are various use cases for pattern matching: one is filtering entire strings, another is substring search (what you seem to expect). So it seems you are using the filtering API instead of the substring matching API.