ad info




CNN.com
 MAIN PAGE
 WORLD
 ASIANOW
 U.S.
 LOCAL
 POLITICS
 WEATHER
 BUSINESS
 SPORTS
 TECHNOLOGY
   computing
   personal technology
   space
 NATURE
 ENTERTAINMENT
 BOOKS
 TRAVEL
 FOOD
 HEALTH
 STYLE
 IN-DEPTH

 custom news
 Headline News brief
 daily almanac
 CNN networks
 CNN programs
 on-air transcripts
 news quiz

  CNN WEB SITES:
CNN Websites
 TIME INC. SITES:
 MORE SERVICES:
 video on demand
 video archive
 audio on demand
 news email services
 free email accounts
 desktop headlines
 pointcast
 pagenet

 DISCUSSION:
 message boards
 chat
 feedback

 SITE GUIDES:
 help
 contents
 search

 FASTER ACCESS:
 europe
 japan

 WEB SERVICES:
COMPUTING

Practical XML with Linux

October 1, 1999
Web posted at: 12:30 p.m. EDT (1630 GMT)

by Uche Ogbuji

From...
LinuxWorld
linux

(IDG) -- In March, I wrote an article (see "There is more to XML than roll-your-own HTML," link below) about the Extensible Markup Language and its affinity to Linux and the Linux way of doing things. Due to overwhelming reader feedback, LinuxWorld has scheduled a series of follow-up articles. In this article and others to follow, I'll take a closer look at some of the practical things you can do with XML.

Luckily for this purpose, the Linux community has taken to XML as well as I could have hoped. Many Linux development projects and languages use XML processors and libraries. The Cocoon project is building around Apache an XML-processing system that is, in at least one area, ahead of most commercial equivalents. KDE, the K Desktop Environment, uses XML as the native file format for its impressive KOffice. GNOME has an entire menagerie of XML tools, libraries, and applications, some part of the general releases and some strictly in CVS, the Concurrent Versions System. It is also nice to see that a trickle of apps -- mostly GNOME Toolkit apps using libxml -- are moving to XML-based config files, as I advocated in April.
  MESSAGE BOARD
Linux
 

In commercial space, Oracle 8i has come to Linux with an aggressive adoption of XML and many tools for XML document management. As I write, IBM's DB2 Universal Database 6.1 should be winging its way by courier to those who signed up for development copies; it, too, makes a strong commitment to XML. The other Universal DBMS engines for Linux are not far behind.

But there's no need for me to trot out a long list of the XML projects for Linux. A quick search of Freshmeat with keywords such as XML, DOM, and XSL will yield riches for XML newbies and gurus alike.

What's new in tag land?

MORE COMPUTING INTELLIGENCE
IDG.net   IDG.net home page
  LinuxWorld's home page
  LinuxWorld free e-mail alerts
  LinuxWorld features & columns
  FAQ: What is XML?
 Reviews & in-depth info at IDG.net
  IDG.net's personal news page
  Year 2000 World
  Questions about computers? Let IDG.net's editors help you
  Subscribe to IDG.net's free daily newsletter for IT leaders
  Search IDG.net in 12 languages
 News Radio
 * Computerworld Minute
 * Fusion audio primers
   
The standards bodies have been just as busy as the Linux hackers. It's pretty safe to say that right now the key standards for XML are either at 1.0 or getting there quickly. These are the standards that will help truly establish XML's interoperability:
  • XML: the core standard

  • Namespaces: a way to resolve tag-name clashes within XML documents

  • DOM: the Document Object Model, a set of standard interfaces for accessing XML and HTML document components

  • XSLT: Extensible Stylesheet Language Tranformation, a general language that is the best approach for exporting XML to an HTML-browser world

  • XLink: the extended linking capabilities discussed in the April article

  • XML Schema: perhaps not as far along as the others, this important spec addresses several shortcomings of DTD, the document type definition, for expressing constraints on XML documents

Other standards are complete or in process; they include XML Fragments, XML Query Language, and XHTML, an XML-compliant and thus thankfully strict dialect of HTML. But those are not as central as the standards in the list.

This article briefly describes these standards. Later, I'll discuss how all these technologies can come together under Linux to lend a great deal of power to Linux applications.

First a few words on the font of all this activity: The XML 1.0 specification is still an amazing piece of work. It tackles the core language and the normal behavior of XML processors, dealing with such complex issues as character encodings along the way. If you want an esoteric discussion of what the standard should or should not have inherited from SGML, and other such minutiae, you could always watch such expert mailing lists as xml-dev, but for most purposes the core XML spec is a rock -- and a particularly good example for other standards in every area except readability.

Namespaces in XML

One key area that the core XML didn't address is name clashes. Suppose we are using tags from two specifications; perhaps one is for general document formats (for emphasis, titles, paragraphs, etc.) and another is for marketing terminology. In the former, the element-type name code refers to text that should be formatted as computer source code; in the latter, code is part of a product specification.

<document>

  <description>Memo</description>

  <title>Re: Widget 404 Request</title>

  <paragraph>

    We need 5 of

      <product>

       <code>00808</code>

       <name>Widget 404</name>

       <description>Gee-gaw and doo-dad</description>

      </product>

     to send out to reviewers this week.

  </paragraph>

</document>

Although it might be clear to a person that code here refers to a product code and not to computer source code, it might not be clear to search engines and style processors, and the results of their processing might reflect the confusion. Notice that the element-type name description suffers from the same problem. XML needs the capability, common in programming languages and such, of specifying universally unique names. Enter the Namespaces in XML recommendation from W3C, the Worldwide Web Consortium. Under this spec, the code could be rewritten as follows.


<document xmlns:mkt="http://our.industry.org/schema/product-info" xmlns="http://xml-typographers.org/typo-markup-standard/1.0">

  <description>Memo</description>

  <title>Re: Widget 404 Request</title>

  <paragraph>

    We need 5 of

      <mkt:product>

       <mkt:code>00808</mkt:code>

       <mkt:name>Widget 404</mkt:name>

       <mkt:description>Gee-gaw and doo-dad</mkt:description>

      </mkt:product>

     to send out to reviewers this week.

  </paragraph>

</document>

Now all the potentially ambiguous names are qualified in a standard manner. The <document> tag defines two namespaces. The first is marked by the prefix mkt, an alias for the URL http://our.industry.org/schema/product-info. A special attribute name starting with xmlns: indicates that a namespace is being defined, and the rest of the attribute name specifies the prefix to be used in the names of elements in the new namespace. To an XML processor that handles namespaces, the tag <product> is qualified by that URI. Note that a URI (Uniform Resource Identifier) is a superset of URL. Also note that a namespace definition is a URI by format, but the spec disavows any particular meaning for the URI. If it is a URL, there is no guarantee you will find anything at that URL: it is merely a unique string. There has been some debate about this lack of namespace meaning, and about the choice of URIs for naming, but the system does work well.

The second attribute of <document> defines the default namespace marked by xmlns with no prefix indicated. The XML processor assumes that all nonprefixed element-type names are in that default namespace. Namespaces, including default namespaces, can be overridden. For instance, the code above is basically equivalent to the following:


<document xmlns:mkt="http://bogus.com/name-space-squatters" xmlns="http://xml-typographers.org/typo-markup-standard/1.0">

  <description>Memo</description>

  <title>Re: Widget 404 Request</title>

  <paragraph>

    We need 5 of

      <mkt:product xmlns:mkt="http://our.industry.org/schema/product-info">

       <mkt:code>00808</mkt:code>

       <mkt:name>Widget 404</mkt:name>

       <mkt:description>Gee-gaw and doo-dad</mkt:description>

      </mkt:product>

     to send out to reviewers this week.

  </paragraph>

</document>

Attribute names can also be in a particular namespace.


  <foo xmlns:a="http://www.a.cd/ns1" b="http://www.b.de/ns2">

  <a:bar a:spam="ha" b:eggs="ho"/>

</foo>

The spam attribute name is in the namespace http://www.a.cd/ns1, and the eggs name in the namespace http://www.b.de/ns2. The rules for default namespaces differ a bit between elements and attributes, however. For more information, read James Clark's tutorial on namespaces (see link below). Versions 1.4 and later of libxml support XML namespaces, although some bug fixes for attribute namespaces have been earmarked for the 1.6 release. This is pretty good progress considering that not many XML processors support namespaces yet. However, the popular Simple API for XML (SAX), which many parsers use as a frontend, does not yet support namespaces. Nor does the DOM support namespaces. And speaking of the DOM ...

The Document Object Model (DOM)

You might be familiar with the DOM from JavaScript browser scripts. The W3C developed the spec to ease the headaches of Web programmers by allowing code to be generally portable from browser to browser. The DOM group recognized the need for a similar standardization of XML tools and incorporated XML support into the DOM's core. Many XML parsers, processors, and other applications use the DOM as a standard for object and method names in XML APIs.

The DOM reflects the natural tree structure of XML documents: most XML components are instances of the abstract class Node, which has attributes such as parentNode, firstChild, and so on.

The DOM defines more specialized interfaces for documents, elements, text, attributes, entities, and other abstractions. It also provides standard collection classes for nodes.

Several XML processing tools for Linux provide DOM interfaces. The most versatile libraries and parsers provide SAX for a straightforward, sequential processing of XML source, and DOM for cases where random access to document components is desired. libxml-perl, a collection of Perl XML tools that is very different from libxml, supports both SAX and the DOM, as does the Python XML package. IBM's XML4J and XML4C, powerful XML libraries for Java and C, support both interfaces as well. libxml has a standard SAX interface and bases its internal data structures on the DOM. A few important features haven't been standardized for the DOM, particularly support for namespaces, for information provided in DTDs, and for interfaces for reading and writing XML source. Those lacks are being addressed in DOM updates (called Levels by the W3C).

XML Stylesheet Language Transformations (XSLT)

The most exciting thing about XML is the way it allows people to define their own sets of tags with their own meanings. The natural complication of this has been an explosion of languages, some standard and some proprietary, based on XML. The XML community recognized the need for a standard approach to transforming XML, one that could convert documents from one form to another -- and even allow the general processing of XML data. XSLT provides that facility.

But the development of XSLT wasn't in quite so straight a line as one might have thought. As its name implies, it is part of the XML Stylesheet Language effort for rendering XML documents into various media. The W3C first developed a general XML vocabulary for expressing presentation elements (similar to the role of HTML) and specified XSLT as a way to process rich XML data into the pure presentation format, known as formatting objects (FOs).

Of course, the main problem people were trying to solve all the while was how to render XML documents to HTML-based Web browsers. They used the XSL transforms mostly to produce HTML rather than FOs. The W3C realized that the transformation language was really an entity all its own and began to develop it separately.

XSLT is a powerful language, but its syntax might be perplexing to C/C++ programmers. Really more of a functional language, XSLT will at first probably come more naturally to users of Lisp, Scheme, and the like. But XSLT doesn't take too long to figure out. As a basic example, let us look at a stylesheet we might use to render a memo from the Namespaces example as HTML.


<?xml version="1.0"?>

<xsl:stylesheet xmlns:xsl="http://www.w3.org/XSL/Transform/1.0">

 <xsl:template match="/">

  <HTML>

  <HEAD><TITLE><xsl:value-of select='/document/title'/></TITLE>

  </HEAD>

  <BODY>

  <xsl:apply-templates/>

  </BODY>

  </HTML>

 </xsl:template>

 <xsl:template match="title">

  <H1><xsl:apply-templates/></H1>

 </xsl:template>

 <xsl:template match="paragraph">

  <P>

   <xsl:apply-templates/>

  </P>

 </xsl:template>

 <xsl:template match="product">

  <xsl:element name='A'>

  <xsl:attribute name='href'><xsl:value-of select='code'/></xsl:attribute> <xsl:value-of select='name'/>

  </xsl:element>

 </xsl:template>

</xsl:stylesheet>

Don't worry if your head is swimming. I'll go over how the code works. But it might be useful to first view the output. If you run a stylesheet processor against the first version of the marketing memo, the one that does not use namespaces, and the stylesheet above, you should get the following result (or one very similar):


<HTML>

 <HEAD>

  <TITLE>Re: Widget 404 Request</TITLE>

 </HEAD>

 <BODY>Memo<H1>Re: Widget 404 Request</H1>

  <P>We need 5 of

   <A href = '00808'>Widget 404

   </A>

   to send out to reviewers this week.</P>

 </BODY>

</HTML>

First, several stylesheet processors are available for Linux. I used 4XSLT, a Python XSLT processor developed by my company, but there is also XT by James Clark, who deserves as much recognition as any other open source pioneer for the tremendous work he has put into freeware SGML and XML tools. Countless SGML and XML users were raised on his Expat, Jade, XT, and other tools. There is also IBM's LotusXSL. See "Related Sites" below for links to these tools, all of which are freeware. There are other processors that will work in Linux, but those should be a good start. In most cases you can just specify the files containing the XML source and the stylesheet, but there is also a standard processing instruction for specifying stylesheets -- which can be the older cascading stylesheets (CSS) or XSLT -- in XML documents.

To try the example using 4XSLT, install the software and copy the XML source to xslt-demo.xml, the stylesheet to xslt-demo.xslt, and then enter

python Process.py xslt-demo.xml xslt-demo.xslt

The result will be printed to standard output.

And now to explain the stylesheet. As you can see, it is regular XML. You might also notice that, as mandated by the standard, it declares a namespace, http://www.w3.org/XSL/Transform/1.0. The elements in the namespace are are known as instructions; they direct the flow and output of processing. You'll notice the several template instructions, known as template rules. These indicate to the processor that we have rules to apply whenever we come across parts of the source XML that match the template. Examine the first template

<xsl:template match="/">...

The match attribute has a value of "/", which indicates that it should match the entire document. The notation in the match clauses is a special and somewhat complex pattern language in which slashes separate levels of tags in a manner analogous to slashes in the Unix directory hierarchy. The entire document contains one element with the name document, so to specify this element, we would say /document. The document contains a paragraph element, which we could indicate with /document/paragraph, and so on. There is much more to the patterns, but that will suffice for our example.

Once the stylesheet processor finds the match, it processes the instructions within the template. Following special, implicit instructions, it simply echoes the text and any elements that are not instructions within a template. Those elements are known as literals, and accordingly, the processor begins by putting out


<HTML>

 <HEAD>

  <TITLE>

Then it runs into the value-of instruction. This instruction evaluates the contents of its select attribute using an expression language that is a superset of the pattern language we mentioned. The expression language is a separate standard called XPath. You can see that this value would refer to the title element inside the document element. In XPath expressions, the value of an element is the concatenated value of all its contents. The content of the element in question is the text "Re: Widget 404 Request." The XSLT processor then writes that to the output and continues echoing the literals after the value-of instruction, so we now have


<HTML>

 <HEAD>

  <TITLE>Re: Widget 404 Request</TITLE>

 </HEAD>

 <BODY>

It's beginning to take shape, no? Now we come to the apply-templates instruction. This is the recursive heart of the processing language. The matching of a template establishes a context that has several aspects. Most important, the context marks where in the XML source we are. Since the first template matched the entire document, that is our context. An analogy to the Linux file system works here: The context is similar to the current working directory. Within the first template the current working directory is analogous to the root directory. Within a context, just as in Linux file systems, you don't need to always specify the entire path to a template match. In our current context, document is the same as /document. What apply-templates does is look through all the templates and process all those that match in the current context. Let's look at the second template:

<xsl:template match="title">...

The processor will look for an element named title at the top level of the XML source. Of course the search will fail, as will all the other searches. Where does the processor go from here? XSLT defines a few built-in or default template rules that match XML source that explicit rules don't address. One of these matches any element at the top level of the current context and simply executes apply-templates on its contents. There is only one element at this level, document. So the processor now calls apply-templates within this built-in rule, and the context is shifted to the document element.

Now several templates match. The first element in the current context is description. No template matches this, so the processor uses the built-in rule and calls apply-templates for the /document/description context. All there is at this level is the text "Memo." Another built-in rule simply echoes the text into the output. Next is the title element, matched by the second template we defined. This template puts out the literal <H1> tag, and calls apply-templates within it. Again the only XML source at this /document/title context is the text "Re: Widget 404 Request," which is echoed to output according to the built-in rule for text. So our output at this point is


<HTML>

 <HEAD>

  <TITLE>Re: Widget 404 Request</TITLE>

 </HEAD>

 <BODY>Memo<H1>Re: Widget 404 Request</H1>

Hopefully, this is enough explanation to get you started. Our example stylesheet uses features we haven't discussed, such as the element and attribute instructions, but I'm out of space to explain them. The concepts I have gone over -- templates, patterns, expressions, and context -- are central to XSLT, however, and if you follow them, you'll probably find the rest of XSLT pretty straightforward. James Tauber's XSLT tutorial, listed in the Related Sites section, is a good place to start, although as I write it is still a bit out-of-date. If you are lucky enough to have someone who can pay for instructional material, or have a few dollars to spare, you'll also find a link to inexpensive commercial training materials in Related Sites.

Conclusion

We have covered a lot of ground in this article. I have tried to give enough of an introduction to the core technologies that we can get down to the grit of using XML practically in Linux systems. The further tutorials and Linux software I have pointed out may get you experimenting in the meantime.

You may notice that I have not discussed two technologies that I claim are key: XLink and Schemas. The main reason is that little software implements those technologies at the moment, so there isn't anywhere to go with them in a practical view of XML under Linux. I don't think it will be long before suitable implementations emerge. My company is itself working on an open-source Python XLink processor and is also examining Schemas. As implementations emerge, I'll be happy to add them to the survey.

And of course, I would love to hear how you are already using XML in Linux, what tools you find useful, and what your impressions are of the general technology so far.

 

Uche Ogbuji is a consultant and cofounder of FourThought LLC, a consulting firm specializing in custom software development for enterprise applications, particularly Web-based integration platforms for small or medium-size business.


RELATED STORIES:
Build a Web site in a day
September 9, 1999
What exactly is Windows DNA?
September 15, 1999
Idiom refines Web translation
September 9, 1999

RELATED IDG.net STORIES:
There is more to XML than roll-your-own HTML
(LinuxWorld)
Net management with XML
(Network World Fusion)
FAQ: What is XML?
(Computerworld)
The XML wars: Vendors gird their loins with standards bodies
(Windows Tech Edge)
XML and Java: A powerful combination
(JavaWorld)
XML comes of age at Internet World
(InfoWorld)
XML finding a place in Web procurement
(Computerworld)
Year 2000 World
(IDG.net)
Note: Pages will open in a new browser window
External sites are not endorsed by CNN Interactive.

RELATED SITES:
The Cocoon project
Freshmeat
James Clark's tutorial on namespaces in XML
libxml, a general Linux/Unix library for XML
XML package for Python
libxml-perl, a collection of Perl tools for XML processing
FourThought's XSLT processor
James Clark's XT XSLT processor
XSLT processor from IBM's laudable alphaWork's project
James Tauber's XSLT tutorial
Commercial training materials for XSLT and XPath
Note: Pages will open in a new browser window
External sites are not endorsed by CNN Interactive.
 LATEST HEADLINES:
SEARCH CNN.com
Enter keyword(s)   go    help

Back to the top   © 2001 Cable News Network. All Rights Reserved.
Terms under which this service is provided to you.
Read our privacy guidelines.