Tuesday, October 12, 2004

Where XML goes astray...

It seems like every programmer and their brother has picked up XML and is using it as the proverbial hammer to nail some solution. Sometimes it works, sometimes it doesn’t. A lot of people have written about how XML doesn’t scale, how XML isn’t the right solution for problem X, but for all those complaints, XML has helped solve a lot of problems. What is more interesting is to see what problems it does appear to have gotten some of the most traction on.

First, some background: XML was originally designed as an evolution of SGML, a simplification that mostly matched a lot of then existing common usage patterns. Most of its creators saw XML and evolving and expanding the role of SGML, namely text markup. XML was primarily intended to support taking a stream of text intended to be interpreted as a human readable document, and delineate portions according to some role. This sequence of characters is a paragraph. That sequence should be displayed with a link to some other information. Et cetera, et cetera. Much of the process in defining XML based on the assumption that the text in an XML document would eventually be exposed for human consumption. You can see this in the rules for what characters are allowed in XML content, what are valid characters in Names, and even in “</tagname>” being required rather than just “</>”.

All of that is why I find it so interesting that XML has become so popular for such things as SOAP. XML was not designed with the SOAP scenarios in mind. Other examples of popular scenarios which deviate XML’s original goals are configuration files, quick-n-dirty databases, and RDF. I’ll call these ‘data’ scenarios, as opposed to the ‘document’ scenarios for which XML was originally intended. In fact, I think it is safe to say that there is more usage of XML for ‘data’ scenarios than for ‘document’ scenarios, today. I choose the terms ‘data’ and ‘document’, because these are the terms that are most often used when this issue is discussed on the XML-DEV mailing list and at work. Personally, I dislike the terminology, because there are many cases where a single document mixes both usage patterns, and because (strictly speaking) documents are data.

As often happens when an existing tool is reused for a purpose beyond its original purposes, XML is not exactly a perfect fit. It is a surprisingly good fit, but far from perfect. In fact, one of the few things that mess with XML’s fit for these applications, isn’t even something in the original XML specification, it got its own specification released less than a year later: XML Namespaces.
The 2 main things that XML 1.0 (pre-Namespaces) mucked up: whitespace and allowed characters. I’ll go at these issues in the reverse order to how I just listed them.

Allowed Characters

The logic went something like this: XML is all about marking up text documents, so the characters in an XML document should conform to what Unicode says are reasonable for a text document. That rules out most control characters, and means that surrogate pairs should be checked. All sounds good until you see some of the consequences. For example, most databases allow any character in a text column. What happens when you publish your database as XML? What do you do about values that include characters which are control characters that the XML specification disallowed? XML did not provide any escaping mechanism, and if you ask many XML experts they will tell you to base64 encode your data if it may include invalid characters. It gets worse.

The characters allowed in an XML name are far more limited. Basically, when designing XML, they allowed everything that Unicode (as defined then) considered a ‘letter’ or a ‘number’. Only 2 problems with that: (1) It turns out many characters common in Asian texts were left out of that category by the then-current Unicode specification. (2) The list of characters is sparse and random, making implementation slow and error prone. Issue (1) has been a significant problem for a number of customers I have worked with, and the only options are to either avoid those character ranges that are not allowed or to implement an application specific escaping mechanism. The fact that many early parsers (including some of Microsoft’s) did not correctly enforce the rules made the problem worse. I have looked at the code for uncounted XML parsers, and this is one of the areas that many parsers skip on. The major supported parsers typically implement this properly, but it is still a source of constant bugs and unexpected complexity, as well as a constraint on performance.

Whitespace

When we were first coding up MSXML, whitespace was one of our perpetual nightmares. In hand-authored XML documents (the most common form of documents back then), there tended to be a great deal of whitespace. Humans have a hard time reading XML if everything is jammed on one line. We like a tag per line and indenting. All those extra characters, just there so that our feeble minds could make sense of this awkward jumble of characters, ended up contributing significantly to our memory footprint, and caused many problems to our users. Consider this example:
	<customer>

<name>Joe Schmoe</name>
<addr>123 Seattle Ave</addr>
</customer>

A customer coming to XML from a database back ground would normally expect that the first child of the <customer> element would be the <name> element. I can’t explain how many times I had to explain that it was actually a text node with the value newline+tab. For the first official release version of MSXML, we found an awkward compromise, that confuses customers to this day, because it depends on some unexposed internal hints. It works great, so long as you don’t edit the DOM and write it out, expecting a pretty format, like the original version. It has been interesting to talk with people about this issue over the intervening years. I have had people claim that we violated the XML specification and had others thank us for saving them from having to care about all that extra noise in the DOM.

The problem is that XML doesn’t know the difference between the above scenario and something more like: (this is using the html tag vocabulary)
	<ul>

<li><pre>
<b>this</b> is a test</pre></li>
</ul>

This last example is actually quite interesting. The whitespace between the <ul> and the <li> tags is not significant, yet the whitespace between the <pre> and <b> tags is significant. The only way to know this is to actually have an innate understanding of the semantics of the tag vocabulary. That means that there is effectively no universal answer, and it is up to the application to do the right thing… an almost universal guarantee of applications bugs.

XML Namespaces

Namespaces is still, years after its release, a source of problems and disagreement. The XML Namespaces specification is simple and gets the job done with minimum fuss. The problem? It pushes an immense burden of complexity onto the APIs and XML reader/writer implementations. Supporting XML Namespaces introduces significant complexity in the parsers, because it forces parsers to parse the entire start-tag before returning any text information. It complicates XML stores, such as DOM implementations, because the XML Namespace specification only discusses parsing XML, and introduces a number of serious complications to edit scenarios. It complicates XML writers, because it introduces new constraints and ambiguities.

Then there is the issue of the ‘default namespace’. I still see regular emails from people confused about why their XPath doesn’t work because of namespace issues. Namespaces is possibly the single largest obstacle for people new to XML. So much else about XML seems common sense, and then XML Namespaces rears it’s ugly head. I still regularly argue how our code should handle odd edge cases introduced by namespaces.

Conclusion

Note that nowhere above do I talk about how XML should have handled these issues. In most cases, when the original decisions were made and they made sense to me. I like to believe that I have learned a lesson or two since, but who knows. My purpose in writing this was to educate people about where XML goes astray from what you expect. Proposing solutions is of no real use, since XML is a standard and isn’t changing significantly anytime soon. It is worth understanding where we made our worst mistakes to avoid making similar mistakes again. The above are some of the hard lessons I have learned, having been implementing XML APIs for customers for almost 7 years. These are not the only issues I have with the XML 1.0 specification; they are only the most glaring. If I could go back in time, these are the areas I would have attempted to influence in a difference direction the most.

10 Comments:

Anonymous Anonymous said...

I personally use the notataion "container style" or "overlay style" for data and document respectively.

2:10 PM  
Blogger derek said...

Oleg, Xml 1.1 addresses the problem with control characters, but does not provide a normative solution for how to encode invalid name characters in names, and does nothing to address the complexity of either whitespace or xml-namespaces. Honestly, I don't think there is anything that can really be done about either, short of a major over-haul. The confusion customers experience with prefixes vs namespace-uri's is something we are stuck with. Great for the consultants making money teaching this stuff, but not great for APIs designers or users. The fact that 'a:b' can mean on thing in my document, and something completely different in my XPath query is just hard to wrap your head around. None of this is to say we should give up... just keep these issues in mind when building a new XML system, and make sure to handle both sides of the issue.

7:13 PM  
Anonymous Anonymous said...

Derek,
May we republish this splendid blog, under your byline of course, at sys-con.com/xml?

We'd need a brief author bio + contact e-mail.

Let me know, yes?

Thnx in advance! :)

--
Jeremy Geelan
Group Publisher, SYS-CON Media
http://sys-con.com

email: jeremy@sys-con.com

Web Services Edge 2005 East - International Web Services Conference & Expo
Hynes Convention Center, Boston, MA - February 15 - 17, 2005

Call for papers now open!

Tuesday -> 2/15, Conference & Expo
Wednesday -> 2/16, Conference & Expo
Thursday -> 2/17, Conference & Expo

http://sys-con.com/edge

5:59 AM  
Anonymous Anonymous said...

Could you please explain im mor detail the complexities introduced by XML namespaces. Specifically, what do you mean by: "It complicates XML stores, such as DOM implementations, because the XML Namespace specification only discusses parsing XML, and introduces a number of serious complications to edit scenarios. It complicates XML writers, because it introduces new constraints and ambiguities."
Moreover, when you say namespaces forces parsers to parse the entire start-tag before returning any text information, are you talking about the performance overhead or about something else also? Please clarify.

Thanks,
Venkat

7:44 AM  
Anonymous Anonymous said...

Enterprise best practice is to use XML whenever it is obviously unsuitable, such as as a syntax for scripting languages, build files, and configuration information. It is also an enterprise best practice to not use XML for what it is well designed for, such as document management.

3:13 AM  
Blogger derek said...

I have a new post specifically on Namespaces that I will be posting soon.

7:33 PM  
Anonymous Anonymous said...

I don't buy your argument at all that
[1] people put garbage in their databases
[2] XML is to blame for not accepting garbage in element content.

If it's garbage, don't put it there. If you have a legitimate use for it, i.e.it's representing information, you can convert it, e.g. into elements.

On the subject of whitespace, you don't need "intimate knowledge" of a vocabulary -- the DTD (or Schema) tells you which whitespace is significant, and in a standard way.

I won't claim XML to be perfect, but let's not invent problems with it.

Liam [Liam Quin, liam at w3 dot org]

5:09 PM  
Blogger Rakesh Pai said...

Nice read. I've been battling with character encoding for XML myself. When you complicate it with database driven XML files and entries coming from other (badly encoded) sites, you just want to throw your hands up.

3:50 AM  
Blogger Mike Dierken said...

> I don't buy your argument at all that
> [1] people put garbage in their databases
Well, sad to say, but I work at a popular bookstore in Seattle and this issue actually happens. A legacy data file used control characters as delimiters and that gets loaded directly into the RDBMS. The service exposes the data as XML and those text characters cannot be represented. I think it was 0x01 or something like that.

I don't remember the resolution, but just wanted to mention that it does happen.

12:24 AM  
Anonymous Anonymous said...

I have to also say that the problemz presented aren't really with XML, but design flaws in bad usage. I agree that people abuse XML by using it when something else would be more appropriate, that's true for everything, but I don't completely agree with your examples. XML as a languages is perfectly suitable for databases and settings, but it may not be appropriate to use an XML parsing library which was designed with formatted text in mind, rather than one which was intended to deliver XML markup as a hierarchical structure of elements and attributes. The only criticsm that I have with XML myself are minor redundancies, like the XML declaration doesn't need an extra question mark towards the end, and the common comment syntax that takes 8 keystrokes to type. Also the lack of a single line comment. I don't see any problem in the way whitespace is handled either, all whitespace should be delivered without significant alteration, and interpretation should be application specific to allow for many possible applications.

6:47 AM  

Post a Comment

<< Home