Saturday, March 13, 2010

Software Tools For Buggier Bloated Software

One of the first programming books I ever bought was Software Tools, by Kernighan and Plauger (which cost me a hefty $11.95 back in 1980). Part design instruction, part anthropology, it was an amazing exercise -- expert programmers presenting actual code to do something useful, and walking you through their thought processes. Software tools let us bootstrap ourselves up to solving more complex problems with less effort. What could be better?

So what does a few decades of technological acceleration do to the realm of software tools? The developments are not all happy ones, I would say. At least not in the area I want to talk about: parser generator tools. (Warning: basic understanding of yacc and language theory required.)

The Bad Old Days

John Aycock has written about the stodgy old tools we used to have to rely on to implement languages, epitomized by yacc and lex and thereafter mimicked in many similar descendants. He skewers yacc with a short example. Imagine you want to implement an assembly language whose syntax looks something like this:

loop:  lda   foo
       shift bar
       halt
He then shows how one might naively capture the syntax with a grammar (I've used yacc syntax here):
%token IDENT
%%
statement
    : label opcode operand
    ;
label
    : IDENT ':'
    | /* empty */
    ;
opcode
    : IDENT
    ;
operand
    : IDENT
    | /* empty */
    ;
As Aycock points out, this plausible first crack at capturing the syntax with a yacc grammar will get you the less than helpful message:
example.y contains 1 shift/reduce conflict

The Good New Days?

All is not lost, though. As an academic interested in parsers, Aycock reports that brand new modern parser generators can blow dear old yacc and lex into the weeds. See, yacc can't handle all context-free grammars, just a subset of them. In Aycock's narrative, the reason such inferior algorithms were used in the bad old days was merely a lack of memory, CPU speed, and state of the art. Nowadays, he says, we've got better algorithms, and memory and CPU are cheap, so why are you living in the 1970's with these old parser generators when you can use a tool that will take just about any context-free grammar you can throw at it? He offers multiple tools that can perform much more generally than yacc, and asks why we aren't all using them.

It is certainly true that experienced programmers need to keep up with the times, or risk being betrayed by our own experience. For example, cache memory speed has continued to outpace advances in main memory speed to such a degree that, if you're an old programmer who just assumes main memory is fast, your attempts at speed optimization may go awry. Likewise, the balance has tilted towards machines with an embarrasment of riches of both CPU speed and memory for many applications. Not so many years ago, I would have always thought of streaming or paging schemes if working on an application related to documents. Today, I would start with the assumption that 99% of all documents fit in main memory with no problem. But having spent some time using and writing parser generator tools, I'm just a wee bit suspicious of the utopia Aycock (and others) paint.

And of course, it is a very old truism in computers that making CPU fast and memory cheap doesn't make all hardware fast and memory-rich -- it also makes slow CPUs with limited memory cheap enough to use in new situations. It's a good rule of thumb that CPU is fast and memory cheap, so long as you don't forget there will always be common cases where that is not the case. For example, if I'm building a standalone application, I probably really don't care if it uses a generated parser that's a few times slower than what yacc can generate. On the other hand, if I turn that same application into a web service that I plan to put into the cloud (where I'm renting CPU/memory by the minute from Microsoft/Google/Amazon), then suddenly I may begin to care quite a bit about what were previously too-cheap-to-care inefficiencies.

The Bad New Days

Let's go back to that embarrassing inability of yacc to handle this natural-looking grammar:

%token IDENT
%%
statement
    : label opcode operand
    ;
label
    : IDENT ':'
    | /* empty */
    ;
opcode
    : IDENT
    ;
operand
    : IDENT
    | /* empty */
    ;
What went wrong here? Well, when yacc first sees an IDENT, it doesn't know whether it is seeing a label or an opcode. Aycock would say the heck with yacc, then, and use a cool new tool that will look further ahead, find that ':', and then decide whether that initial identifier is a label or an opcode. Stupid ol' yacc.

But wait just a minute. This is the kind of toy grammar that shows up a lot when academics talk about parsing. Let's move to the real world. One real world problem is that you kinda want to reserve opcode identifiers and not let the user accidentally use a label name that is the same as an opcode identifier. Otherwise, the user is one goofed colon away from not being able to figure out what they did wrong (But that's an opcode! Oh... I guess not -- where did that colon come from?). In fact, you might say that yacc's cryptic error message could be interpreted as I'm a little confused because I can't tell a 'label' from an 'opcode'! So somebody has to keep a table of all the strings that represent opcode names, and be able to give us some unique convenient integer ID for each one, since we don't want to be doing string compares all the time. Oh wait, in the bad old days, we have a tool that does all that: lex. Let's suppose we've typed our opcodes into lex and it's identifying them for us. That might lead us to this slightly more realistic grammar:

%token IDENT
%token OPCODE
%%
statement
    : label OPCODE operand
    ;
label
    : IDENT ':'
    | /* empty */
    ;
operand
    : IDENT
    | /* empty */
    ;
With this tiny move towards realism, the most amazing thing happens: yacc can now handle the grammar without the slightest problem! Now remember, Aycock picked the grammar, I didn't. But still, if you're thinking you can't really say the constraints of yacc are related in a sensible way to the constraints of real-world parsing problems based on a single example, I'll agree. I think there's a fairly obvious general principle at work here.

Why Modern Parser Generators Are Awful

Aycock was dead-on in mocking the horrifically bad error messages and general user-unfriendliness of yacc and its descendants. But the brand of modern parser generator he recommends doesn't really address those problems -- instead, they address the problem of accepting grammars with fewer constraints. Are fewer constraints in language design an unabashed Good Thing? Let me quote from the decades-old yacc manual:

moreover, the constructions which are difficult for Yacc to handle are also frequently difficult for human beings to handle. Some users have reported that the discipline of formulating valid Yacc specifications for their input revealed errors of conception or design early in the program development.
So even 30 years go, folks were quite aware that constraints on grammar were often a Good Idea. Indeed, FORTRAN is often offered as the poster child for what kind of ambiguous ad-hoc monstrosity could arise before the mild constraints of Backus-Naur Form became widely accepted.

But Aycock puts it forth as a Good Thing that the average programmer can grab a Cool Modern Parser Generator and have it accept any old context-free grammar -- even an ambiguous grammar. Let's think about that last statement. Ambiguous grammar means... well, could mean this, could mean that. The intent of the grammar is not clear. One thing that hasn't improved one whit in 30 years is that software isn't getting any less buggy. But now we have tools that will generate a parser for you based on, well, from most perspectives in most situations, what I would call a buggy grammar.

Now wait just a minute. If a parser generator accepts an ambiguous grammar, then how the heck do you ever test the thing? Well, if one gigantic modern tool causes you a problem, you just need another gigantic modern tool! So, for example, the Accent compiler-compiler comes with Amber, a separate tool that

allows the user to annote grammars to resolve ambiguities. It also offers a default strategy. Unless an ambiguity is resolved in this way, it is detected at parsing time whether a given text is ambiguous.
So, with yacc, you have to tweak your grammar to make it acceptable. With the Cool Modern Tools, you have to tweak your grammar to make it acceptable, or I guess you can just emit a runtime error for the user to try to deal with (what would it say? Hey, I couldn't decide what the grammar should do at this point -- email me your vote!). This is progress? Well, in some sense, it's the opposite of progress. yacc will flat-out tell you (in about a fraction of second) if your grammar won't work. But with the Cool Modern Tool (the separate one you have to remember to use to check for problems in your grammar):
if the grammar is unambiguous the algorithm may not terminate. [...] one has a good chance to detect a problem
Cool, so you have a good chance of being able to check your grammar. As you can see, the new generation of parser generator tools places a very low priority on a variety of aspects of software quality.

Now, Aycock's not an idiot. There are particular uses for ambiguous grammars and, for those (narrow, out-of-the ordinary, specialized) applications, I say God bless the tool that can generate you a solution. But there are no such qualifications in his endorsement of using these tools. In fact, what Aycock encourages is exactly what's happening. Programmers with little or no education about computer languages and the parsing thereof are grabbing all sorts of parser generator tools that give them more power than yacc -- more power to implement lousy languages, more power to generate parsers for grammars that are full of bugs, and more power to create software whose behavior they really can't specify at all.

In Aycock's narrative, programmers who can't deal with those confusing yacc error messages will just get a Cool Modern Tool, which will accept any grammar they throw it at it, so no error messages, and no problems! But if you hang out with the people actually using these tools, you see a very different narrative. First, the tools are still darn complicated and documentation is, well, about what you would expect from programmers, so people are still struggling to use the tool. But more disturbingly, lots of folks can't figure out what the tool is actually doing. Why does it accept this input and not that one? The purpose of a tool that generates a parser from a grammar specification is precisely to increase the ease of specifying and understanding exactly what syntax will be accepted. The direction of modern tools of this ilk is contradictory to the fundamental original purpose of this category of tools.

The Bigger Picture

I'm picking on parser generators because one of my back burner projects I'll never get finished is a little parser generator (welcome to Google Code -- the programmer equivalent of 1950's puttering around in the garage with the ol' Chevy!) But the problem of software tools (sometimes in the form of frameworks) in the modern age is a general one.

Consider the lowly and well-understood Remote Procedure Call. Writing networking applications is complicated, what with having to shove packets around asynchronously and whatnot. Let's simplify by making network I/O look like a simple procedure call. Is this an advance? Absolutely. Right up until programmers use it without understanding how it works beneath the covers. In my own state (Washington), we got to pay many millions of dollars to cancel a networked computer system that (when last audited before they decided it was cheaper to just throw it all away) had an order of magnitude too slow a response time to be usable. Was that because programmers were using software tools for networking without understanding them? I don't know, but that's certainly the first place I would look.

We want tools to hide complexity from us. Unfortunately, hiding complexity means the tool is responsible for making complex choices on our behalf, and we're not getting better at creating tools that don't regularly screw up those choices. We also want tools to save us from having to understand the complexity the tool is designed to hide from us. And we're not getting better at creating tools that make their limitations obvious to us, that inform us when we've asked it to do something that a human had better take a look at first.

The new generation of parser generator tools epitomizes the wrong direction for software tool development. Programmers naturally want to focus on new algorithms and taking advantage of new hardware. But the problems that need solving lie more in the direction of human factors. If programmers make poor interfaces, documentation, and error handling for end users, we make spectacularly bad interfaces, documentation, and error handling for ourselves.

The acceleration of technology means that the software world is incredibly more complex now than it was 30 years ago. Back then, it only took 4 years to get a CS degree. Now... it still only takes 4 years -- good luck out there, new graduates! We are too focused on creating software tools that put more and more power into the hands of (from the perspective of any particular narrow tool) less and less well-educated practitioners. Instead, we need much more focus on making tools that make their own limitations transparent, that educate the tool user on how to make good choices, and that work tirelessly to keep the tool user from creating bad code through ignorance. These are largely not problems that require new algorithms and breakthrough ideas, they mostly just require hard work and continual refinement.

6 comments:

Vince A said...

Computing has progressed so much, it is now the equivalent of us letting 17 years olds drive 700 horsepower cars.

BTW, you have vastly improved your presence in the new video (less swaying about, etc).

Ron Burk said...

Thanks -- I had multiple weeks to rehearse the last talk instead of barely one, and was healthy instead of just getting over the flu. I'm happy to think the extra work invested might have shown. :-)

Anonymous said...

Learning a compiler compiler is more work than just doing it, for the rare times you need it. Silliest thing I've heard of.

tim said...

Would anyone seriously use yacc for that grammar anyway? Seems like overkill, just read a line and split on whitespace.

Ron Burk said...

The grammar Aycock chose was, of course, tiny for didactic purposes. And while using a parser generator may be overkill for parsing a .ini file, most programmers cannot write correct code to parse simple expressions (although they often convince themselves they can), so knowing how to use a parser generator can be a useful skill for a wide variety of programmers. Most domain specific languages (see Martin Fowler's upcoming book!) of much use will need to parse some variety of mathematical expression or another. I have actually seen people try to connect to an SQL database to get it to evaluate simple math expressions entered by a user. It violates some sense of proportion, I think.

Anonymous said...

Ron, had to agree with each and every point, i appreciate your time an efforts.world-class write up and share this with facebook!@Sara
Drivers Download