Tuesday, May 31, 2011

Discovering Uniscribe

I've been working on a Uniscribe interface to VisualWorks Smalltalk off and on for the last two weeks or so. One of the frustrating things for me so far, is the lack of a good post-doc resource to go to for help with this stuff. There's the MSDN docs, which are OK, but when you have questions beyond that, I have as of yet, not found any of the resources I'm used to using for this kind of thing, such as mailing lists, etc. And the amount of "tutorial" style pages out there is pretty small. So I thought I'd at least leave a trail of breadcrumbs here.

There is of course the Uniscribe Reference MSDN Docs.

Another valuable resource I found was Uniscribe: The Missing Documentation & Examples, a post written by one of the folks working on Chrome, once upon a time.

The other thing I discovered recently, is that where there are OpenType variants of UniscribeFunctions, you want to use those. For example, use ScriptShapeOpenType() instead of just ScriptShape().

Do I have a working binding between VisualWorks Smalltalk and Uniscribe? At a limited level, yes. I can deal with the standard kinds of Text emphases (#bold, #italic, #family, etc). Even #color. I'm not yet using any of the line breaking abilities (so it all comes out on one line). What can't I deal with... any string that results in more than one item. An item, in Uniscribe speak, occurs everytime the the writing system changes. So if you mix arabic and english, you get separate items for each range. But even more common, you get changes on any english punctuation. So the string 'abc,xyz' actually produces 3 items, one for the 'abc', one for the ',', and one for the 'xyz'. And currently... something isn't right with my interface when this happens. Item one is displayed correctly, but not the following items. When I figure it out, you can be sure there will be a followup post here.

Thursday, May 19, 2011


Each time I do one of these external interface things, I wrestle with the finalization problem. I have to deal with it in Cairo. In Pango. I'm currently working on Uniscribe interface, and have worked on CoreText interface. They all have this same pattern. Some external structure or opaque pointer which the library either initializes or creates for you, some functions that take the structure or opaque pointer and make stuff happen with it, and some function that frees the same. It's not enough to just use the free() interface that C provides.

The common pattern, on the Smalltalk side is that you have sort of Smalltalk object that acts as a front (facade) for your external structure/opaque pointer. And the interesting part becomes how do I make it so that when I don't need the Smalltalk object anymore, and the wonderful garbage collector makes life easy for me, erasing it from existence, that the related external resources are let go via that appropriate release/free C function.

In some part, I tried to deal with this in the Weaklings package, but the pattern is to "built on top" of the idea of weak slots. And it leaves part of the job to the programmer. Boris Popov tried to fix it in his version 18.1, but I resisted it, because it fundamentally changed the API.

I had yet another go at this in the recent ExRegex package. I don't know if this is the right solution yet, but I like it best so far. What I want is a nice simple uncomplicated, not bound up in other subclasses or frameworks, way to indicate that when an object is ready to go away, some arbitrary action happens.

The class FinalizeAction in that package, was my solution. It's a simple #ephemeron behavior (nothing to do with the ill named Ephemeron class). The ephemeral slot (the first instance variable of the an #ephemeron type object) is a reference to the object you wish to perform finalization services for. And it's other instance variable, is action. Which can be any object that responds to #value:, things like Blocks, or MessageChannels. Or even Symbols if you're using newer versions of VisualWorks (or load the simplest of all packages, SymbolValue, in older versions). What I liked about FinalizeAction was that it solved my problem with 1 class, 2 instance variables, 2 class variables, and 8 methods. Not bad. I like small solutions.

So here's some simple examples:

| b |
b := Pixmap extent: 10 @ 10.
(FinalizeAction for: b)
action: (MessageChannel new receiver: Transcript selector: #print:)

Block (note we're using a zero arg block which demonstrates we're actually using the cull: API)
| d |
d := ByteArray new: 100000.
(FinalizeAction for: d)
action: [ObjectMemory current spaceSummaryOn: Transcript]

Simple Unary Message Send
| f |
f := 'howdy.txt' asFilename.
(FinalizeAction for: f) action: #out

FinalizeActions have an instance registry to keep them alive until they fire. I have sought and sought for a scheme that doesn't involve some sort of registry. I thought I had one in the early part of the ExRegex spike, until the basic invariant of #ephemeron based finalization sunk in:

You have to guarantee that your finalizer will stay alive longer than the object its performing services for, and without some sort of other path back to the roots of the garbage collector, there is no way to do that.

I'm ruminating on where to go with this. Fold it into the Weaklings package? Or just clone it when I need it (given it's small size)? I'm also curious if the basic use API (as shown in the examples above) couldn't use some work to make it seem more natural. Putting a helper on Object would of course make things simpler, but I do try to keep this to a minimum.

Monday, May 16, 2011

Regex not so Regular?

A couple of days ago, on the VWNC mailing list, there was a discussion about the Regex11 package. Regex11 was originally written by Vassili Bykov as a Regex library for Smalltalk. It's actually been around a while and served the Smalltalk community pretty well, I think, in part due to Vassili's excellent coding abilities.

I think it's really cool that someone can sit down and write a complete Regex implementation in Smalltalk. This is part of the Smalltalk ethos. Being able to invent your own future. Write your own software, in any area you like. On the other hand, as I read the comments, I found myself wondering, "it's great that you can do this, but does that mean you should?" Of course, the answer varies. It depends on what your goals are. A certain part of me feels that I don't really want to care this much about how the implementation works. Isn't Regex one of those things that's grown up enough now, that we can just use off-the-shelf services provided by the host platforms? If we just reuse those, we might get some speed (benefits of thousands of code monkeys tuning and tweaking them over the years), and some standardization.

So I went on a little journey. Read on if what I discovered is of interest...

The Package

I put my work in a package called ExRegex. As in "External Regex", using the ExternalInterface features of VisualWorks to bind to an outside-of-smalltalk shared library. Plus I just liked the way ExRegex looked almost palindromic. And I made a Test package called ExRegex-Tests. No surprises here, pretty standard operating procedure.

I only began to implement the substitution stuff; for now, it's just good for matching. One thing I did do differently than Regex11 was employ a symmetric pair of binary selectors for doing Regex matches.
    '(a|b)c' ?= 'ac' --> true
'Travis' =? '(Senior|Griggs)' --> false

The ? character always leans towards the side with the regex pattern.

The Tests

I stole the tests from the Regex11-Testing package. I made my own SUnitToo copy of it. I soon found I wasn't satisfied with them. The original tests have a single test method. But said method fetches a clauseArray which is a series of inputs, and expected outputs. There's 3 or 4 relatively involved methods between the loop in the single test method and actually executing a single regex match.

What this means, is that when you get a failure, you don't know much. First you have to dig through the little framework of interpreting the clauseArray. When I'm trying to understand how someone's Regex framework works, I don't want to understand how their Regex test framework works as well. I just want to see very simple lines of code, the kind I would put in a workspace.

Secondly, as soon, as you hit a failure, you're done. The clauseArray has 137 entries in it. So if it fails for any reason (error or assertion failure) on test number 22... you have no idea how the others work. Is it just this one test? Or do others fail too? Is there a common pattern amongst the failures?

I think we forget sometimes, that just as easy as it is to write a little DSL or interpretation framework, it's just as easy to write a little bit of code that interprets that into real Smalltalk methods. It's easy to generate code and it's easy to compile it and it;s easy to install it. Here's what I did for this particular case:

RegexTest new clauseArray keysAndValuesDo:
[:index :array |
array size > 2
[ws := String new writeStream.
nextPutAll: 'match';
print: index.
ws nextPutAll: ' <test> '.
ws nextPutAll: ' | regex | '.
ws nextPutAll: ' regex := ' , array first printString , ' exRegex. '.
2 to: array size
by: 3
[:n |
ws nextPutAll: ' self '.
nextPutAll: ((array at: n + 1) ifTrue: [' assert: '] ifFalse: [' deny: ']).
nextPutAll: ' (regex match: ';
print: (array at: n);
nextPutAll: ').'].
RegexTest compile: (RBParser parseMethod: ws contents) formattedCode
classified: 'tests match']]

I didn't even have to bother to format the code, I let the RB services do that for me. What I ended up with was nice easy methods to read like this:

And when I run the tests from within the IDE I get nice feedback like this:

This made it much easier to unravel things that, well, started to unravel after that.

So much for Regularity

Regularity means (among others) "conforming to a standard or pattern." I'm not sure a more ironic name was ever chosen as a moniker for a "standard set" of pattern matching. The C library interface is pretty standard. You have 4 basic functions:

  • regcomp() - builds up a regex structure based on a pattern and some flags
  • regexec() - compares an input source against the regex structure for matches
  • regerror() - makes human readable strings for your regex structure when parse errors exist
  • regfree() - used to free the various chunks of memory associated with the regfree structure

I knew Regex's varied a little from platform to platform. What surprised me was that the Regex11 did stuff the C library one couldn't (e.g. \d as sugar for [0-9] and \s for separators). And vice versa, the C library variant will do [0-9]{1,3} (match 1 to 3 digits), but the Smalltalk Regex11 can't do the {1,3} thing. You'd have to live with + (1 or more) or repeat it 3 times.

But the fun didn't stop there. While both are part of the standard libc, the BSD variant that one finds on MacOSX, is different than the one finds under GNU platforms, such as Linux. And not just in their behavior. While the 4 C functions are the same between the two, and the same symbolic defines for the flags exist, they actually map to different values. For example, the option RE_NOSUB (what you use when you just care if it matches or not, not where), is a 0x08 on a GNU platform and 0x04 on a BSD platform. And the regex_t structure, which you're responsible to allocate the memory for, is entirely different between the two. For a C program using these libraries, you just use the defines, and you're fine, but for a Smalltalk (or any other not-compiled-by-the-C-Compiler/Preprocessor language), that more intimate knowledge of what the symbolic values mean, becomes more important. Getting around these differences was interesting.

Regular? Anything but. This is kind of alarming actually. It's not like these things give you an error when you try to use something like \d. It really just tries to match an escaped d (rather than a digit), and now it fails. The upshot is that you don't have much portability with these things, and as you move from environment to environment, language to language, platform to platform, you'd better have a pretty good test suite around your regexes to make sure they're still matching the way they were when you developed your original application. On this aspect, this is one time where the "do it all in Smalltalk" actually has a compelling advantage, depending on your constraints.

There's an excellent exhaustive table at this link that shows a number of common implementations, and how they compare across a whole plethora of different options.

The Speed Factor

What about the speed issue? Again, it depends. For doing simple matches, Regex11 will come out faster. Less overhead in going across the C boundaries. Let's take a regex for matching IP addresses:

If we match this against simple strings like '' and 'abc', Regex11 is faster. As much as 3 times faster. But if we match against larger strings, such as the guts of the IPSocketAddress class>>allAboutHostAddresses method (an expected match) and the result of Object comment (shouldn't match), then the tables turn quite a bit. The Smalltalk version is suddenly 210 times slower!. This measures both the time to create the Regex object, as well as do the match.

So, I guess it depends on what kind of Regex's you're doing, and what kinds of frequencies (remember, it should always be "Make it Work, Make it Right, Make it Fast").

I don't know why exactly, the pattern above actually requires the leading and trailing .* values to match the ip patterns mid source, I had to add it for Regex11, but the ExRegex interface didn't need it.

Thursday, May 5, 2011

Easiest Example That Could Possibly Click

I'm a big fan of unit test cases. Mapped 1:1 with the objects they test (as opposed to "let's say it tests unit of functionality"). But I also write a lot of class side example methods, for the sake of prototyping, feedback, and ad hoc testing. Especially as you get into UI development, this becomes very helpful. It's hard (not impossible) to test how the pixels end up looking and interacting with unit tests.

The classic pattern you see in Smalltalk is

  "self exampleMethod"


Then what you do, is navigate to the method, highlight the comment and do it. If you're doing it a lot, it gets tedious to select the comment and then execute it.

So I made a simple little package called RBEasyExampleMethods, put in the repository. It makes it so you can just double click on a method in the browser, and if it looks like an example, it runs it. It looks like an example when it's

  • a class side method

  • unary (no arguments)

  • starts with 'example' OR lives in a category that starts with 'example'

It should work with VisualWorks versions from quite a ways back. Enjoy!