Strictness and correctness

Sunday 29 April 2007This is close to 18 years old. Be careful.

Jeff Atwood, in JavaScript and HTML: Forgiveness by Default writes about how a design decision in XML doomed XHTML:

Unfortunately, the Draconians won: when rendering as strict XHTML, any error in your page results in a page that not only doesn’t render, but also presents a nasty error message to users.

They may not have realized it at the time, but the Draconians inadvertently destroyed the future of XHTML with this single, irrevocable decision.

The lesson here, it seems to me, is that forgiveness by default is absolutely required for the kind of large-scale, worldwide adoption that the web enjoys.

Getting upset now about the draconian error handling of XML seems kind of quaint.

At this point, I think it is clear that XML’s strictness about well-formedness is very easy to satisfy. It is easy to write automatic producers of XML that do it correctly, and hand-edited XML is also easy to fix when it has missing angle brackets or mismatched tags.

The main problem with XHTML has nothing to do with XML’s strictness. The problem is that it’s XML masquerading as HTML. HTML has different lexical rules than XML. Writing a single document that is both valid XHTML and an acceptable HTML document that will be understood by legacy browsers is very difficult, if not impossible. It’s essentially a polyglot programming exercise, where one file can be interpreted correctly according to two different sets of rules. Except that we all kidded ourselves into thinking it wasn’t, because HTML and XML both use tags.

HTML is derived from SGML, which has a dizzying array of shortcuts to minimize the markup in a document. Take a quick look at Tag Minimization from Martin Bryan’s book to see the kind of stuff SGML lets you do. Some of this is still in HTML, which is why XML’s <br/> doesn’t do what you think in an HTML document.

Other issues include the special treatment browsers give to script content, where less-thans really are less-thans, while in XML, they have to be escaped as &lt;. A fuller run-down of the problems is in Ian Hickson’s Sending XHTML as text/html Considered Harmful.

So to my mind, the problem here is not that XML is strict, but that it is different from HTML. You can’t easily write a page which works as both. Jeff gives the example of an author publishing a page and then finding out from his horde of angry readers that the page won’t display. This is not the kind of problem that happens: well-formedness is easy to check and fix.

That said, it’s also true that being strict about well-formedness does nothing to help with checking validity, and beyond that, nothing to help with checking for correct rendition. It’s that last level of correctness that is the hobgoblin of web development: once the tag stream is correct according to some criteria, the browser must then draw a page, and there is where things really run off the rails.

Certainly invalid pages will have more rendering problems that valid pages, but validity is not enough to guarantee that the page will look correct. So XML’s strictness is easy to achieve, and also fairly useless. In the end, Jeff is right:

Even though programmers have learned to like draconian strictness, forgiveness by default is what works. It’s here to stay. We should learn to love our beautiful soup instead.

» 9 reactions

Comments

[gravatar]
Ned, The lack of permalinks in your blog means that I can't submit your most recent post to reddit, because I already submitted one this month. reddit strips the data after the #, and thinks I've already submitted 200704.html.

Is there another link I can use?
[gravatar]
Thanks, Bill. I've republished the entry on its own page.

I guess I should think about how to provide individual entries with unique URLs...
[gravatar]
Many times common browsers cheat anyway. Quirk Mode is a good example of this. It can easily be made to be entered when strict XHTML is bad.
[gravatar]
Hi Ned,

Great response. One caveat though:

> Jeff gives the example of an author publishing a page and then finding out from his horde of angry readers that the page won't display.

Actually, the example is a bit more specific than that: most web pages these days use content from other sites in some form, so they're also banking on the fact that these external sources won't accidentally introduce some malformed markup into their own page. So if you're using draconian error handling, you better pray every bit of markup in your page, from whatever source you're getting it from, is 100% compliant, forever.

http://diveintomark.org/archives/2004/01/14/thought_experiment
[gravatar]
Strictness is only easy to achieve is you are a programmer. Plenty of non programmers know a small amount of good old HTML. Part of the reason they where able to pick it up was its forgiving nature. The first time they popped open notepad and typed in some text and a few tags, they got something back in the browser.

In fact no matter how many mistakes they made, they still got something back, and no scary intimidating error messages. There are plenty of professional programmers today who started battering together bad HTML in notepad, and then taught themselves a little PHP and several years later are turning out respectable code. For the amateur, with the most primitive tools, Strictness is not easy.
[gravatar]
So "the web is full of JavaScript errors" but we'd rather not hear about them, even though this probably means that a load of flashy AJAX-only applications don't work properly (and explains why serious Web applications, eg. Internet banking, don't rely on or, in some cases, even use JavaScript).

Why did the W3C try and move everyone up to a well-formed markup language? The author should try reading up on HTML parsing and having to write tools and browsers for all the sloppy output produced by Web authors who seem to think that closing their tags is just too much work (and yet seem to have the inclination to dive into one of the most inconsistently implemented programming languages ever produced and tell us all about it).

Reliable systems require parts that behave consistently and interact coherently, not some bag of "do what I mean" bricks where the end result needs a "quirks mode" guide because a bunch of people couldn't face seeing error messages.
[gravatar]
The point about beginners getting started with web pages is a good one, but I suspect that the draw of the web was and is strong enough that people would manage to produce well-formed XML in any case.
[gravatar]
@Ned:

It *might* have caught on, but it certainly wouldn't have been the open forum for explosive innovation that it was for anyone with a text editor.

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
Comment text is Markdown.