tag:blogger.com,1999:blog-7515306875906042828.post6110135138524490515..comments2024-02-05T11:56:11.174-08:00Comments on Objology: Regex not so Regular?Travis Griggshttp://www.blogger.com/profile/01599271142862167244noreply@blogger.comBlogger7125tag:blogger.com,1999:blog-7515306875906042828.post-87078885035358518852011-05-18T09:20:38.894-07:002011-05-18T09:20:38.894-07:00Two features I found seriously lacking from Regex1...Two features I found seriously lacking from Regex11/VB-Regex are non-greedy qualifiers and match objects instead of plain strings, e. g.<br /><br />'<p>just a test</p><p>another one</p>'<br />allRegexMatches: '<p>(.*?)</p>'<br />do: [ :m | Transcript nextPutAll: (m at: 1)]<br /><br />This should print 'just a test' and 'another one' from the match instances.<br /><br />Maybe it's not a good idea to parse non-regular languages like XML this way - better use SAX/DOM parsers but sometimes it comes in handy.<br /><br />Actually I just found out GNU-Smalltalk implements both features this way. Although it relies on a low-level C implementation tied to the String class (#searchRegexInternal:from:to:) which is "derived from GNU libc, with modifications made originally for Ruby to support Perl-like syntax" (105KB C source, 4K LOC).<br /><br />see <a href="http://www.gnu.org/software/smalltalk//manual/html_node/Regular-expressions.html" rel="nofollow">GNU Smalltalk Manual</a>Berndhttps://www.blogger.com/profile/16137959958057744615noreply@blogger.comtag:blogger.com,1999:blog-7515306875906042828.post-12602886238112853252011-05-18T08:24:30.315-07:002011-05-18T08:24:30.315-07:00Regex11 does not support backreferences, does it? ...Regex11 does not support backreferences, does it? (I've never needed this feature yet.) A FSA approach is only possible if backreferences are not needed.Steffennoreply@blogger.comtag:blogger.com,1999:blog-7515306875906042828.post-12964903166011776172011-05-18T06:08:20.284-07:002011-05-18T06:08:20.284-07:00Steffen, please do :-)
How: http://swtch.com/~rsc...Steffen, please do :-)<br /><br />How: <a href="http://swtch.com/~rsc/regexp/" rel="nofollow">http://swtch.com/~rsc/regexp/</a><br /><br />Why:<br /><br /><br />['aaaaaaaaaaaaaaaaaaaaaa'<br />matchesRegex: 'a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaa']<br />microsecondsToRun / 1000000.0Reinout Heeckhttp://www.desk.org/~reinznoreply@blogger.comtag:blogger.com,1999:blog-7515306875906042828.post-11733683026788878622011-05-18T04:42:48.121-07:002011-05-18T04:42:48.121-07:00I've always wondered, if an implementation usi...I've always wondered, if an implementation using finite state automata would be faster. My gut feeling tells me that perhaps parsing a regex could be slightly slower but matching should be much faster. Have you any insights on this? (Actually I've thought of implementing RegEx11 this way as a personal exercise. Just like Vassili's package comment states.)Steffennoreply@blogger.comtag:blogger.com,1999:blog-7515306875906042828.post-53879759736783218682011-05-17T07:40:46.805-07:002011-05-17T07:40:46.805-07:00k: yes, i do.k: yes, i do.Travis Griggshttps://www.blogger.com/profile/01599271142862167244noreply@blogger.comtag:blogger.com,1999:blog-7515306875906042828.post-46202592315638393532011-05-17T03:00:04.629-07:002011-05-17T03:00:04.629-07:00You do realize that "regular" in "r...You do realize that "regular" in "regular expresion" refers to the set of languages that can be described via a regular expression (as per Chomsky's classification), and not any kind of standard feature set? ;-)Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-7515306875906042828.post-29443146707926175862011-05-17T02:52:24.159-07:002011-05-17T02:52:24.159-07:00But if we match against larger strings [...] The S...<i>But if we match against larger strings [...] The Smalltalk version is suddenly 210 times slower!.</i><br /><br />This is to be expected, the Regex11 Matcher implementation is a naive implementation. Compiling down to more elaborate matcher architectures (NFAs or DFAs) will make it faster on larger inputs (and known problematic inputs!). See the first article listed here: <a href="http://swtch.com/~rsc/regexp/" rel="nofollow">http://swtch.com/~rsc/regexp/</a>.<br /><br /><i>I don't know why exactly, the pattern above actually requires the leading and trailing .* values to match the ip patterns mid source</i><br /><br />There are various use cases for pattern matching: one is filtering entire strings, another is substring search (what you seem to expect). So it seems you are using the filtering API instead of the substring matching API.Reinout Heecknoreply@blogger.com