Ron Burk: March 2010

One of the first programming books I ever bought was Software Tools, by Kernighan and Plauger (which cost me a hefty $11.95 back in 1980). Part design instruction, part anthropology, it was an amazing exercise -- expert programmers presenting actual code to do something useful, and walking you through their thought processes. Software tools let us bootstrap ourselves up to solving more complex problems with less effort. What could be better?

So what does a few decades of technological acceleration do to the realm of software tools? The developments are not all happy ones, I would say. At least not in the area I want to talk about: parser generator tools. (Warning: basic understanding of yacc and language theory required.)

The Bad Old Days

John Aycock has written about the stodgy old tools we used to have to rely on to implement languages, epitomized by yacc and lex and thereafter mimicked in many similar descendants. He skewers yacc with a short example. Imagine you want to implement an assembly language whose syntax looks something like this:

loop:  lda   foo
       shift bar
       halt

He then shows how one might naively capture the syntax with a grammar (I've used yacc syntax here):

%token IDENT
%%
statement
    : label opcode operand
    ;
label
    : IDENT ':'
    | /* empty */
    ;
opcode
    : IDENT
    ;
operand
    : IDENT
    | /* empty */
    ;

As Aycock points out, this plausible first crack at capturing the syntax with a yacc grammar will get you the less than helpful message:

example.y contains 1 shift/reduce conflict

The Good New Days?

All is not lost, though. As an academic interested in parsers, Aycock reports that brand new modern parser generators can blow dear old yacc and lex into the weeds. See, yacc can't handle all context-free grammars, just a subset of them. In Aycock's narrative, the reason such inferior algorithms were used in the bad old days was merely a lack of memory, CPU speed, and state of the art. Nowadays, he says, we've got better algorithms, and memory and CPU are cheap, so why are you living in the 1970's with these old parser generators when you can use a tool that will take just about any context-free grammar you can throw at it? He offers multiple tools that can perform much more generally than yacc, and asks why we aren't all using them.

It is certainly true that experienced programmers need to keep up with the times, or risk being betrayed by our own experience. For example, cache memory speed has continued to outpace advances in main memory speed to such a degree that, if you're an old programmer who just assumes main memory is fast, your attempts at speed optimization may go awry. Likewise, the balance has tilted towards machines with an embarrasment of riches of both CPU speed and memory for many applications. Not so many years ago, I would have always thought of streaming or paging schemes if working on an application related to documents. Today, I would start with the assumption that 99% of all documents fit in main memory with no problem. But having spent some time using and writing parser generator tools, I'm just a wee bit suspicious of the utopia Aycock (and others) paint.

And of course, it is a very old truism in computers that making CPU fast and memory cheap doesn't make all hardware fast and memory-rich -- it also makes slow CPUs with limited memory cheap enough to use in new situations. It's a good rule of thumb that CPU is fast and memory cheap, so long as you don't forget there will always be common cases where that is not the case. For example, if I'm building a standalone application, I probably really don't care if it uses a generated parser that's a few times slower than what yacc can generate. On the other hand, if I turn that same application into a web service that I plan to put into the cloud (where I'm renting CPU/memory by the minute from Microsoft/Google/Amazon), then suddenly I may begin to care quite a bit about what were previously too-cheap-to-care inefficiencies.

The Bad New Days

Let's go back to that embarrassing inability of yacc to handle this natural-looking grammar:

%token IDENT
%%
statement
    : label opcode operand
    ;
label
    : IDENT ':'
    | /* empty */
    ;
opcode
    : IDENT
    ;
operand
    : IDENT
    | /* empty */
    ;

What went wrong here? Well, when yacc first sees an IDENT, it doesn't know whether it is seeing a label or an opcode. Aycock would say the heck with yacc, then, and use a cool new tool that will look further ahead, find that ':', and then decide whether that initial identifier is a label or an opcode. Stupid ol' yacc.

But wait just a minute. This is the kind of toy grammar that shows up a lot when academics talk about parsing. Let's move to the real world. One real world problem is that you kinda want to reserve opcode identifiers and not let the user accidentally use a label name that is the same as an opcode identifier. Otherwise, the user is one goofed colon away from not being able to figure out what they did wrong (But that's an opcode! Oh... I guess not -- where did that colon come from?). In fact, you might say that yacc's cryptic error message could be interpreted as I'm a little confused because I can't tell a 'label' from an 'opcode'! So somebody has to keep a table of all the strings that represent opcode names, and be able to give us some unique convenient integer ID for each one, since we don't want to be doing string compares all the time. Oh wait, in the bad old days, we have a tool that does all that: lex. Let's suppose we've typed our opcodes into lex and it's identifying them for us. That might lead us to this slightly more realistic grammar:

%token IDENT
%token OPCODE
%%
statement
    : label OPCODE operand
    ;
label
    : IDENT ':'
    | /* empty */
    ;
operand
    : IDENT
    | /* empty */
    ;

With this tiny move towards realism, the most amazing thing happens: yacc can now handle the grammar without the slightest problem! Now remember, Aycock picked the grammar, I didn't. But still, if you're thinking you can't really say the constraints of yacc are related in a sensible way to the constraints of real-world parsing problems based on a single example, I'll agree. I think there's a fairly obvious general principle at work here.

Why Modern Parser Generators Are Awful

Aycock was dead-on in mocking the horrifically bad error messages and general user-unfriendliness of yacc and its descendants. But the brand of modern parser generator he recommends doesn't really address those problems -- instead, they address the problem of accepting grammars with fewer constraints. Are fewer constraints in language design an unabashed Good Thing? Let me quote from the decades-old yacc manual:

moreover, the constructions which are difficult for Yacc to handle are also frequently difficult for human beings to handle. Some users have reported that the discipline of formulating valid Yacc specifications for their input revealed errors of conception or design early in the program development.

So even 30 years go, folks were quite aware that constraints on grammar were often a Good Idea. Indeed, FORTRAN is often offered as the poster child for what kind of ambiguous ad-hoc monstrosity could arise before the mild constraints of Backus-Naur Form became widely accepted.

But Aycock puts it forth as a Good Thing that the average programmer can grab a Cool Modern Parser Generator and have it accept any old context-free grammar -- even an ambiguous grammar. Let's think about that last statement. Ambiguous grammar means... well, could mean this, could mean that. The intent of the grammar is not clear. One thing that hasn't improved one whit in 30 years is that software isn't getting any less buggy. But now we have tools that will generate a parser for you based on, well, from most perspectives in most situations, what I would call a buggy grammar.

Now wait just a minute. If a parser generator accepts an ambiguous grammar, then how the heck do you ever test the thing? Well, if one gigantic modern tool causes you a problem, you just need another gigantic modern tool! So, for example, the Accent compiler-compiler comes with Amber, a separate tool that

allows the user to annote grammars to resolve ambiguities. It also offers a default strategy. Unless an ambiguity is resolved in this way, it is detected at parsing time whether a given text is ambiguous.

So, with yacc, you have to tweak your grammar to make it acceptable. With the Cool Modern Tools, you have to tweak your grammar to make it acceptable, or I guess you can just emit a runtime error for the user to try to deal with (what would it say? Hey, I couldn't decide what the grammar should do at this point -- email me your vote!). This is progress? Well, in some sense, it's the opposite of progress. yacc will flat-out tell you (in about a fraction of second) if your grammar won't work. But with the Cool Modern Tool (the separate one you have to remember to use to check for problems in your grammar):

if the grammar is unambiguous the algorithm may not terminate. [...] one has a good chance to detect a problem

Cool, so you have a good chance of being able to check your grammar. As you can see, the new generation of parser generator tools places a very low priority on a variety of aspects of software quality.

Now, Aycock's not an idiot. There are particular uses for ambiguous grammars and, for those (narrow, out-of-the ordinary, specialized) applications, I say God bless the tool that can generate you a solution. But there are no such qualifications in his endorsement of using these tools. In fact, what Aycock encourages is exactly what's happening. Programmers with little or no education about computer languages and the parsing thereof are grabbing all sorts of parser generator tools that give them more power than yacc -- more power to implement lousy languages, more power to generate parsers for grammars that are full of bugs, and more power to create software whose behavior they really can't specify at all.

In Aycock's narrative, programmers who can't deal with those confusing yacc error messages will just get a Cool Modern Tool, which will accept any grammar they throw it at it, so no error messages, and no problems! But if you hang out with the people actually using these tools, you see a very different narrative. First, the tools are still darn complicated and documentation is, well, about what you would expect from programmers, so people are still struggling to use the tool. But more disturbingly, lots of folks can't figure out what the tool is actually doing. Why does it accept this input and not that one? The purpose of a tool that generates a parser from a grammar specification is precisely to increase the ease of specifying and understanding exactly what syntax will be accepted. The direction of modern tools of this ilk is contradictory to the fundamental original purpose of this category of tools.

The Bigger Picture

I'm picking on parser generators because one of my back burner projects I'll never get finished is a little parser generator (welcome to Google Code -- the programmer equivalent of 1950's puttering around in the garage with the ol' Chevy!) But the problem of software tools (sometimes in the form of frameworks) in the modern age is a general one.

Consider the lowly and well-understood Remote Procedure Call. Writing networking applications is complicated, what with having to shove packets around asynchronously and whatnot. Let's simplify by making network I/O look like a simple procedure call. Is this an advance? Absolutely. Right up until programmers use it without understanding how it works beneath the covers. In my own state (Washington), we got to pay many millions of dollars to cancel a networked computer system that (when last audited before they decided it was cheaper to just throw it all away) had an order of magnitude too slow a response time to be usable. Was that because programmers were using software tools for networking without understanding them? I don't know, but that's certainly the first place I would look.

We want tools to hide complexity from us. Unfortunately, hiding complexity means the tool is responsible for making complex choices on our behalf, and we're not getting better at creating tools that don't regularly screw up those choices. We also want tools to save us from having to understand the complexity the tool is designed to hide from us. And we're not getting better at creating tools that make their limitations obvious to us, that inform us when we've asked it to do something that a human had better take a look at first.

The new generation of parser generator tools epitomizes the wrong direction for software tool development. Programmers naturally want to focus on new algorithms and taking advantage of new hardware. But the problems that need solving lie more in the direction of human factors. If programmers make poor interfaces, documentation, and error handling for end users, we make spectacularly bad interfaces, documentation, and error handling for ourselves.

The acceleration of technology means that the software world is incredibly more complex now than it was 30 years ago. Back then, it only took 4 years to get a CS degree. Now... it still only takes 4 years -- good luck out there, new graduates! We are too focused on creating software tools that put more and more power into the hands of (from the perspective of any particular narrow tool) less and less well-educated practitioners. Instead, we need much more focus on making tools that make their own limitations transparent, that educate the tool user on how to make good choices, and that work tirelessly to keep the tool user from creating bad code through ignorance. These are largely not problems that require new algorithms and breakthrough ideas, they mostly just require hard work and continual refinement.

As part of reading the book Technopoly, I'm listening to the old author interview from Booknotes on C-SPAN. Because Neil Postman (before his death in 2003) provided a convenient anti-thesis to Ray Kurzweil's Snoopy-like dance of technological elation, I am sympathetic, looking for things to agree with, stretching to find common ground. But then he had to go and bring up Mozart.

Poor dead Mozart gets dragged into more cultural battles than just about anybody except Jesus and the Founding Fathers. In 1998, the governor of Georgia would send out copies of Mozart for pregnant mothers to listen to (and hopefully take their minds off of Georgia's third-world level health care for neonates). Postman argued no less than that when children hear Mozart, they can't help but feel that there is order in the universe, whereas when they listen to rock'n'roll they feel that life is just one damn thing after another. OK, it's a 1992 interview, but I really thought all this musical snobbery by intellectuals was reasonably passé by the time the Beatles broke up.

If structure makes you feel there's order in the universe, surely a simpler structure makes more people feel more order more... um, better? How about "My Sharona"? Plenty of variation (guitar break in the middle, pregnant pauses, time changes, etc.), but plenty of structure and order. Light that up around some kids the right age (before others teach them to be snobs about what they are willing to listen to) and soon you'll see them all shout "Wooo!" right on cue -- if that's not evidence they hear "order", then what is?

We've been through all this rock-doesn't-have-this-or-that sniping decades ago, right? Yeah, the Beatles used three chords a lot on the Red/Blue albums, so what? Modern musical historians aren't that impressed with young Mozart (and more than a few think his Dad was helping ol' Amadeus out with his homework!). The Beatles learned a lot more chords and a lot more techniques, just like the Beach Boys with their journey from barbershop to complex tonal progressions. And do you think you can tap out all the time changes in Zeppelin's "Black Dog"? OK, in all fairness, Zeppelin couldn't really do "Black Dog" that well themselves live, but still, they got it on tape at least once.

Of course, since Postman rejected the notion that psychology is a "science", it would have been no use pointing out that psychology supports what common experience tells us -- your view of music is strongly influenced by what you listen to in your youth. Fortunately, I was blessed with a large palette of music in my youth: Gospel, country (old-time country, before it became just rock with a twang), bluegrass, rock, classical, swing, jazz -- and that was well before technology made it easy to tune into any type of music you wanted at any time. I presume Postman was not so blessed, and his formal education did nothing to make up for it. I recall that my overtly liberal public education included instruction on defending ourselves against media manipulation, and one class in particular showed how the same song could be recast as rock, country, or R&B to sell the same product to different audiences. I suppose that sort of useful relativism has since been outlawed by the Texas Board of Education.

I do agree with Postman that you can't appreciate Mozart if you don't have the necessary training. It's unfortunate he didn't see that the same is true for most any music of any complexity -- and all genres offer complexity. What you hear if you try to listen to Led Zeppelin without ever having heard Delta blues or read Tolkien, I have no idea, but it will be something short of what is actually present in the music. It's just plain sad that someone who decried "the surrender of culture to technology" didn't have the necessary training to appreciate the large body of culture represented by modern music.

No doubt the deeper underlying myth here is the familiar pining for a "simpler time". It is undeniable that there were some simpler and happier times (with more "order") in this country -- for some. But when Lead Belly sang "The Midnight Special", he was not celebrating the special kind of "order" African Americans could expect in Houston. It's always hard to separate this brand of nostalgia from one implicit -ism (racism, sexism, rankism) or another. And technology is making that even harder.

As technology accelerates, the conservative desire for "simpler times" is made both more acute and more contradictory. For people are rarely willing to give up anything to get back to simpler times -- they only want undesirable things taken away. Note that at its peak, conservatism in America did absolutely nothing to swell the ranks of the Amish. The folks most deeply concerned about retaining "traditional" gun rights are still pretty much eating corn-fed meat that was killed in a pen with a steel slug coming out of a compressed air hose, friendo.

Here, then, is my common ground with Neil Postman. For he was clear that technology always gets used, and it always brings both advantages and drawbacks. His call was for a discussion on how to minimize those drawbacks before a new technology (like electronic voting machines) takes over and has its way. Unfortunately, such a discussion is virtually impossible. America has a highly optimized machine for taking any national topic of discussion and turning it immediately into a red-vs.-blue search for advantage or slander. There just ain't nearly enough of us watching C-SPAN, I'm afraid, instead of the daily news frenzy shows.

Ron Burk

Saturday, March 13, 2010

Software Tools For Buggier Bloated Software

The Bad Old Days

The Good New Days?

The Bad New Days

Why Modern Parser Generators Are Awful

The Bigger Picture

Wednesday, March 10, 2010

Let Mozart Be Dead Already

Blog Archive

Links

About Me