
Practical XML with Linux
October 1, 1999
Web posted at: 12:30 p.m. EDT (1630 GMT)
by Uche Ogbuji
|
From...
|
(IDG) -- In March, I wrote an article (see
"There is more to XML than roll-your-own HTML," link
below) about the Extensible Markup Language and its affinity to
Linux and the Linux way of doing things. Due to overwhelming reader
feedback, LinuxWorld has scheduled a series of follow-up
articles. In this article and others to follow, I'll take a closer
look at some of the practical things you can do with XML.
Luckily for this purpose, the Linux community has taken to XML as
well as I could have hoped. Many Linux development projects and
languages use XML processors and libraries. The Cocoon project is
building around Apache an XML-processing system that is, in at least
one area, ahead of most commercial equivalents. KDE, the K Desktop
Environment, uses XML as the native file format for its impressive
KOffice. GNOME has an entire menagerie of XML tools, libraries, and
applications, some part of the general releases and some strictly in
CVS, the Concurrent Versions System. It is also nice to see that a
trickle of apps -- mostly GNOME Toolkit apps using
libxml -- are moving to XML-based config files, as I
advocated in April.
 | MESSAGE BOARD |
| |
In commercial space, Oracle 8i has come to Linux with an
aggressive adoption of XML and many tools for XML document
management. As I write, IBM's DB2 Universal Database 6.1 should be
winging its way by courier to those who signed up for development
copies; it, too, makes a strong commitment to XML. The other
Universal DBMS engines for Linux are not far behind.
But there's no need for me to trot out a long list of the XML
projects for Linux. A quick search of Freshmeat with keywords such
as XML, DOM, and XSL will yield riches for XML
newbies and gurus alike.
What's new in tag
land?
The standards bodies have been
just as busy as the Linux hackers. It's pretty safe to say that
right now the key standards for XML are either at 1.0 or getting
there quickly. These are the standards that will help truly
establish XML's interoperability:
- XML: the core standard
- Namespaces: a way to resolve tag-name clashes within XML
documents
- DOM: the Document Object Model, a set of standard interfaces
for accessing XML and HTML document components
- XSLT: Extensible Stylesheet Language Tranformation, a
general language that is the best approach for exporting XML to
an HTML-browser world
- XLink: the extended linking capabilities discussed in the
April article
- XML Schema: perhaps not as far along as the others, this
important spec addresses several shortcomings of DTD, the
document type definition, for expressing constraints on XML
documents
Other standards are complete or in process; they include XML
Fragments, XML Query Language, and XHTML, an XML-compliant and thus
thankfully strict dialect of HTML. But those are not as central as
the standards in the list.
This article briefly describes these standards. Later, I'll
discuss how all these technologies can come together under Linux to
lend a great deal of power to Linux applications.
First a few words on the font of all this activity: The XML 1.0
specification is still an amazing piece of work. It tackles the core
language and the normal behavior of XML processors, dealing with
such complex issues as character encodings along the way. If you
want an esoteric discussion of what the standard should or should
not have inherited from SGML, and other such minutiae, you could
always watch such expert mailing lists as xml-dev, but for most
purposes the core XML spec is a rock -- and a particularly good
example for other standards in every area except readability.
Namespaces in XML One
key area that the core XML didn't address is name clashes. Suppose
we are using tags from two specifications; perhaps one is for
general document formats (for emphasis, titles, paragraphs, etc.)
and another is for marketing terminology. In the former, the
element-type name code refers to text that should be
formatted as computer source code; in the latter, code is
part of a product specification.
<document>
<description>Memo</description>
<title>Re: Widget 404 Request</title>
<paragraph>
We need 5 of
<product>
<code>00808</code>
<name>Widget
404</name>
<description>Gee-gaw and doo-dad</description>
</product>
to send out to reviewers this week.
</paragraph>
</document>
Although it might be clear to a person that code here
refers to a product code and not to computer source code, it might
not be clear to search engines and style processors, and the results
of their processing might reflect the confusion. Notice that the
element-type name description suffers from the same
problem. XML needs the capability, common in programming languages
and such, of specifying universally unique names. Enter the
Namespaces in XML recommendation from W3C, the Worldwide Web
Consortium. Under this spec, the code could be rewritten as follows.
<document
xmlns:mkt="http://our.industry.org/schema/product-info"
xmlns="http://xml-typographers.org/typo-markup-standard/1.0">
<description>Memo</description>
<title>Re: Widget 404 Request</title>
<paragraph>
We need
5 of
<mkt:product>
<mkt:code>00808</mkt:code>
<mkt:name>Widget
404</mkt:name>
<mkt:description>Gee-gaw and doo-dad</mkt:description>
</mkt:product>
to send out to reviewers this week.
</paragraph>
</document>
Now all the potentially ambiguous names are qualified in a
standard manner. The <document> tag defines two
namespaces. The first is marked by the prefix mkt , an
alias for the URL http://our.industry.org/schema/product-info.
A special attribute name starting with xmlns: indicates
that a namespace is being defined, and the rest of the attribute
name specifies the prefix to be used in the names of elements in the
new namespace. To an XML processor that handles namespaces, the tag
<product> is qualified by that URI. Note that a
URI (Uniform Resource Identifier) is a superset of URL. Also note
that a namespace definition is a URI by format, but the spec
disavows any particular meaning for the URI. If it is a URL, there
is no guarantee you will find anything at that URL: it is merely a
unique string. There has been some debate about this lack of
namespace meaning, and about the choice of URIs for naming, but the
system does work well.
The second attribute of <document> defines the
default namespace marked by xmlns with no prefix
indicated. The XML processor assumes that all nonprefixed
element-type names are in that default namespace. Namespaces,
including default namespaces, can be overridden. For instance, the
code above is basically equivalent to the following:
<document
xmlns:mkt="http://bogus.com/name-space-squatters"
xmlns="http://xml-typographers.org/typo-markup-standard/1.0">
<description>Memo</description>
<title>Re: Widget 404 Request</title>
<paragraph>
We need
5 of
<mkt:product
xmlns:mkt="http://our.industry.org/schema/product-info">
<mkt:code>00808</mkt:code>
<mkt:name>Widget
404</mkt:name>
<mkt:description>Gee-gaw and doo-dad</mkt:description>
</mkt:product>
to send out to reviewers this week.
</paragraph>
</document>
Attribute names can also be in a particular namespace.
<foo xmlns:a="http://www.a.cd/ns1"
b="http://www.b.de/ns2">
<a:bar
a:spam="ha" b:eggs="ho"/>
</foo>
The spam attribute name is in the namespace
http://www.a.cd/ns1 , and the eggs name in
the namespace http://www.b.de/ns2 . The rules for
default namespaces differ a bit between elements and attributes,
however. For more information, read James Clark's tutorial on
namespaces (see link below). Versions 1.4 and later of
libxml support XML namespaces, although some bug fixes
for attribute namespaces have been earmarked for the 1.6 release.
This is pretty good progress considering that not many XML
processors support namespaces yet. However, the popular Simple API
for XML (SAX), which many parsers use as a frontend, does not yet
support namespaces. Nor does the DOM support namespaces. And
speaking of the DOM ...
The Document Object Model
(DOM)
You might be familiar with the DOM from JavaScript browser
scripts. The W3C developed the spec to ease the headaches of Web
programmers by allowing code to be generally portable from browser
to browser. The DOM group recognized the need for a similar
standardization of XML tools and incorporated XML support into the
DOM's core. Many XML parsers, processors, and other applications use
the DOM as a standard for object and method names in XML APIs.
The DOM reflects the natural tree structure of XML documents:
most XML components are instances of the abstract class
Node , which has attributes such as
parentNode , firstChild , and so on.
The DOM defines more specialized interfaces for documents,
elements, text, attributes, entities, and other abstractions. It
also provides standard collection classes for nodes.
Several XML processing tools for Linux provide DOM interfaces.
The most versatile libraries and parsers provide SAX for a
straightforward, sequential processing of XML source, and DOM for
cases where random access to document components is desired.
libxml-perl , a collection of Perl XML tools that is
very different from libxml , supports both SAX and the
DOM, as does the Python XML package. IBM's XML4J and XML4C, powerful
XML libraries for Java and C, support both interfaces as well.
libxml has a standard SAX interface and bases its
internal data structures on the DOM. A few important features
haven't been standardized for the DOM, particularly support for
namespaces, for information provided in DTDs, and for interfaces for
reading and writing XML source. Those lacks are being addressed in
DOM updates (called Levels by the W3C).
XML Stylesheet Language Transformations
(XSLT)
The most exciting thing about XML is the way it allows people to
define their own sets of tags with their own meanings. The natural
complication of this has been an explosion of languages, some
standard and some proprietary, based on XML. The XML community
recognized the need for a standard approach to transforming XML, one
that could convert documents from one form to another -- and even
allow the general processing of XML data. XSLT provides that
facility.
But the development of XSLT wasn't in quite so straight a line as
one might have thought. As its name implies, it is part of the XML
Stylesheet Language effort for rendering XML documents into various
media. The W3C first developed a general XML vocabulary for
expressing presentation elements (similar to the role of HTML) and
specified XSLT as a way to process rich XML data into the pure
presentation format, known as formatting objects (FOs).
Of course, the main problem people were trying to solve all the
while was how to render XML documents to HTML-based Web browsers.
They used the XSL transforms mostly to produce HTML rather than FOs.
The W3C realized that the transformation language was really an
entity all its own and began to develop it separately.
XSLT is a powerful language, but its syntax might be perplexing
to C/C++ programmers. Really more of a functional language, XSLT
will at first probably come more naturally to users of Lisp, Scheme,
and the like. But XSLT doesn't take too long to figure out. As a
basic example, let us look at a stylesheet we might use to render a
memo from the Namespaces example as HTML.
<?xml version="1.0"?>
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/XSL/Transform/1.0">
<xsl:template match="/">
<HTML>
<HEAD><TITLE><xsl:value-of
select='/document/title'/></TITLE>
</HEAD>
<BODY>
<xsl:apply-templates/>
</BODY>
</HTML>
</xsl:template>
<xsl:template match="title">
<H1><xsl:apply-templates/></H1>
</xsl:template>
<xsl:template
match="paragraph">
<P>
<xsl:apply-templates/>
</P>
</xsl:template>
<xsl:template match="product">
<xsl:element name='A'>
<xsl:attribute name='href'><xsl:value-of
select='code'/></xsl:attribute> <xsl:value-of
select='name'/>
</xsl:element>
</xsl:template>
</xsl:stylesheet>
Don't worry if your head is swimming. I'll go over how the code
works. But it might be useful to first view the output. If you run a
stylesheet processor against the first version of the marketing
memo, the one that does not use namespaces, and the
stylesheet above, you should get the following result (or one very
similar):
<HTML>
<HEAD>
<TITLE>Re: Widget 404 Request</TITLE>
</HEAD>
<BODY>Memo<H1>Re: Widget 404
Request</H1>
<P>We need 5 of
<A href = '00808'>Widget 404
</A>
to send out to
reviewers this week.</P>
</BODY>
</HTML>
First, several stylesheet processors are available for Linux. I
used 4XSLT, a Python XSLT processor developed by my company, but
there is also XT by James Clark, who deserves as much recognition as
any other open source pioneer for the tremendous work he has put
into freeware SGML and XML tools. Countless SGML and XML users were
raised on his Expat, Jade, XT, and other tools. There is also IBM's
LotusXSL. See "Related Sites" below for links to these
tools, all of which are freeware. There are other processors that
will work in Linux, but those should be a good start. In most cases
you can just specify the files containing the XML source and the
stylesheet, but there is also a standard processing instruction for
specifying stylesheets -- which can be the older cascading
stylesheets (CSS) or XSLT -- in XML documents.
To try the example using 4XSLT, install the software and copy the
XML source to xslt-demo.xml , the stylesheet to
xslt-demo.xslt , and then enter
python Process.py xslt-demo.xml xslt-demo.xslt
The result will be printed to standard output.
And now to explain the stylesheet. As you can see, it is regular
XML. You might also notice that, as mandated by the standard, it
declares a namespace,
http://www.w3.org/XSL/Transform/1.0 . The elements in
the namespace are are known as instructions; they direct
the flow and output of processing. You'll notice the several
template instructions, known as template rules. These indicate to
the processor that we have rules to apply whenever we come across
parts of the source XML that match the template. Examine the first
template
<xsl:template match="/">...
The match attribute has a value of
"/" , which indicates that it should match the
entire document. The notation in the match clauses is a special and
somewhat complex pattern language in which slashes separate levels
of tags in a manner analogous to slashes in the Unix directory
hierarchy. The entire document contains one element with the name
document, so to specify this element, we would say
/document . The document contains a paragraph element,
which we could indicate with /document/paragraph , and
so on. There is much more to the patterns, but that will suffice for
our example.
Once the stylesheet processor finds the match, it processes the
instructions within the template. Following special, implicit
instructions, it simply echoes the text and any elements that are
not instructions within a template. Those elements are known as
literals, and accordingly, the processor begins by putting out
<HTML>
<HEAD>
<TITLE>
Then it runs into the value-of instruction. This
instruction evaluates the contents of its select
attribute using an expression language that is a superset of the
pattern language we mentioned. The expression language is a separate
standard called XPath. You can see that this value would refer to
the title element inside the document
element. In XPath expressions, the value of an element is the
concatenated value of all its contents. The content of the element
in question is the text "Re: Widget 404 Request." The XSLT
processor then writes that to the output and continues echoing the
literals after the value-of instruction, so we now have
<HTML>
<HEAD>
<TITLE>Re: Widget 404 Request</TITLE>
</HEAD>
<BODY>
It's beginning to take shape, no? Now we come to the
apply-templates instruction. This is the recursive
heart of the processing language. The matching of a template
establishes a context that has several aspects. Most important, the
context marks where in the XML source we are. Since the first
template matched the entire document, that is our context. An
analogy to the Linux file system works here: The context is similar
to the current working directory. Within the first template the
current working directory is analogous to the root directory. Within
a context, just as in Linux file systems, you don't need to always
specify the entire path to a template match. In our current context,
document is the same as /document . What
apply-templates does is look through all the templates
and process all those that match in the current context. Let's look
at the second template:
<xsl:template match="title">...
The processor will look for an element named title
at the top level of the XML source. Of course the search will fail,
as will all the other searches. Where does the processor go from
here? XSLT defines a few built-in or default template rules that
match XML source that explicit rules don't address. One of these
matches any element at the top level of the current context and
simply executes apply-templates on its contents. There
is only one element at this level, document . So the
processor now calls apply-templates within this
built-in rule, and the context is shifted to the
document element.
Now several templates match. The first element in the current
context is description . No template matches this, so
the processor uses the built-in rule and calls
apply-templates for the
/document/description context. All there is at this
level is the text "Memo." Another built-in rule simply
echoes the text into the output. Next is the title
element, matched by the second template we defined. This template
puts out the literal <H1> tag, and calls
apply-templates within it. Again the only XML source at
this /document/title context is the text "Re:
Widget 404 Request," which is echoed to output according to the
built-in rule for text. So our output at this point is
<HTML>
<HEAD>
<TITLE>Re: Widget 404 Request</TITLE>
</HEAD>
<BODY>Memo<H1>Re: Widget 404
Request</H1>
Hopefully, this is enough explanation to get you started. Our
example stylesheet uses features we haven't discussed, such as the
element and attribute instructions, but
I'm out of space to explain them. The concepts I have gone over --
templates, patterns, expressions, and context -- are central to
XSLT, however, and if you follow them, you'll probably find the rest
of XSLT pretty straightforward. James Tauber's XSLT tutorial, listed
in the Related Sites section, is a good place to start, although as
I write it is still a bit out-of-date. If you are lucky enough to
have someone who can pay for instructional material, or have a few
dollars to spare, you'll also find a link to inexpensive commercial
training materials in Related Sites.
Conclusion
We have covered a lot
of ground in this article. I have tried to give enough of an
introduction to the core technologies that we can get down to the
grit of using XML practically in Linux systems. The further
tutorials and Linux software I have pointed out may get you
experimenting in the meantime.
You may notice that I have not discussed two technologies that I
claim are key: XLink and Schemas. The main reason is that little
software implements those technologies at the moment, so there isn't
anywhere to go with them in a practical view of XML under Linux. I
don't think it will be long before suitable implementations emerge.
My company is itself working on an open-source Python XLink
processor and is also examining Schemas. As implementations emerge,
I'll be happy to add them to the survey.
And of course, I would love to hear how you are already using XML
in Linux, what tools you find useful, and what your impressions are
of the general technology so far.
Uche Ogbuji is a consultant and cofounder of FourThought LLC, a consulting firm specializing in custom software development for enterprise applications, particularly Web-based integration platforms for small or medium-size business.
RELATED STORIES:
Build a Web site in a day September 9, 1999
What exactly is Windows DNA? September 15, 1999
Idiom refines Web translation September 9, 1999
RELATED IDG.net STORIES:
There is more to XML than roll-your-own HTML (LinuxWorld)
Net management with XML (Network World Fusion)
FAQ: What is XML? (Computerworld)
The XML wars: Vendors gird their loins with standards bodies (Windows Tech Edge)
XML and Java: A powerful combination (JavaWorld)
XML comes of age at Internet World (InfoWorld)
XML finding a place in Web procurement
(Computerworld)
Year 2000 World (IDG.net)
Note: Pages will open in a new browser window
External sites are not endorsed by CNN Interactive.
RELATED SITES:
The Cocoon project
Freshmeat
James Clark's tutorial on namespaces in XML
libxml, a general Linux/Unix library for XML
XML package for Python
libxml-perl, a collection of Perl tools for XML processing
FourThought's XSLT processor
James Clark's XT XSLT processor
XSLT processor from IBM's laudable alphaWork's project
James Tauber's XSLT tutorial
Commercial training materials for XSLT and XPath
Note: Pages will open in a new browser window
External sites are not endorsed by CNN Interactive.
|