advertising information

CNN.com
 MAIN PAGE
 WORLD
 ASIANOW
 U.S.
 LOCAL
 POLITICS
 WEATHER
 BUSINESS
 SPORTS
 TECHNOLOGY
   computing
   personal technology
   space
 NATURE
 ENTERTAINMENT
 BOOKS
 TRAVEL
 FOOD
 HEALTH
 STYLE
 IN-DEPTH

 custom news
 Headline News brief
 daily almanac
 CNN networks
 CNN programs
 on-air transcripts
 news quiz

  CNN WEB SITES:
CNN Websites
 TIME INC. SITES:
 MORE SERVICES:
 video on demand
 video archive
 audio on demand
 news email services
 free email accounts
 desktop headlines
 pointcast
 pagenet

 DISCUSSION:
 message boards
 chat
 feedback

 SITE GUIDES:
 help
 contents
 search

 FASTER ACCESS:
 europe
 japan

 WEB SERVICES:
COMPUTING

In search of XML

March 19, 1999
Web posted at: 7:19 p.m. EST (0019 GMT)

by Jason Meserve

From...
Network World Fusion

(IDG) -- The chicken or the egg?

When it comes to widespread use of XML on the Internet, that could be the question. None of the major search engines support XML, because so few Web sites use it. But without encouragement from the search engines, who's going to add XML to their pages?

Last year, the World Wide Web Consortium approved an XML specification that many thought would help spark a revolution in Web-based information retrieval. With XML tags more precisely defining the information contained in documents, the Web would become a far easier place to find just what you were looking for.

MORE COMPUTING INTELLIGENCE
  IDG.net home page
  Network World Fusion home page
  Free Network World Fusion newsletters
  Get Media Grok and The Industry Standard Intelligencer delivered for free
 Reviews & in-depth info at IDG.net
    IDG.net's bridges & routers page
  IDG.net's hubs & switches page
    IDG.net's network operating systems page
  IDG.net's network management software page
  IDG.net's personal news page
  Questions about computers? Let IDG.net's editors help you
  Subscribe to IDG.net free daily newsletter for network experts
  Search IDG.net in 12 languages
 News Radio
  Fusion audio primers
  Computerworld Minute
   

The problem, however, is the millions upon millions of legacy HTML documents out there.

"HTML is the name of the game," says Mark Sprague, co-founder and senior vice president of product design at Northern Light, a search company in Cambridge, Mass. "There would need to be a critical mass of XML or Dublin Core [a proposed metadata element set for use in the discovery of electronic resources] for us to support it. We'll be reactive to XML."

Separating the data from the presentation could make data interchange and retrieval across the Internet more efficient by letting authors and sites define just what's in their documents.

HTML describes how data should be displayed in a browser window, but does not define it. The page could contain a conference agenda, a classified ad, a book review or the latest product news, but it's all basically the same to a browser.

Still, because of its ease of use and mostly standardized tags, HTML has become wildly popular.

XML, in contrast, has few defined tags: Two Web sites could have XML tags with the same name, but might define them in very different ways, making XML more difficult to implement than HTML.

In turn, search engines have had to develop their own language for interpreting, sorting and indexing the data returned from their Web crawlers and spiders. Because they already have their own meta-data schema, they are reluctant to develop systems to handle another one that few people currently use - no matter how cumbersome those existing schema are.

Northern Light, which catalogs not just the Web, but a number of "special" information collections, uses a proprietary formatting system to handle data from its partners and Web crawlers.

For each new partner the company takes on, Northern Light engineers must develop a system for extracting information from the incoming data stream. "We have to take the third party's vocabulary and match it to ours," says Sheri Larsen, director of content processing at Northern Light. "Any new project is a big one, as we try to pull out as much as possible from documents."

XML could provide a standard metadata language for site developers, such as Northern Light and authors, such as its partners.

Web crawlers would have to travel no further than the XML tags to know exactly what is on any given page. The impetus to support XML in their crawler would have to come from people and companies developing sites with XML, says Sprague. At the moment, most sites are sticking with the tried and true HTML. "There is just so much HTML out there," Sprague says.

Northern Light is not alone in shunning XML when it comes to scouring the Web. Lycos, AltaVista and Excite also dismiss XML.

"We are not currently planning on using XML in the near future," says Ilene Quinn, spokeswoman for Compaq-owned Altavista. "That does not rule out any future use, but at the moment there is nothing underway to incorporate XML into the AltaVista search engine."

So what's it good for?

This not to say XML is a complete waste of time.

Excite and Lycos say they are looking to use XML behind the scenes to handle data from partners.

"Our intent is to use XML on the back end as a means of very structured access," says Graham Spencer, co-founder and chief technology officer at Excite. "That's really what XML is better for."

Spencer says that Excite is trying to convince new information partners to encode data streams being sent to Excite in XML. Many of Excite's existing data feeds have varying types of formats, meaning Excite's content team must develop different Perl scripts to sort all the incoming information.

Lycos developed an XML data type definition (DTD) for communications with certain partners, according to Lincoln Jackson, product manager for search and navigation. The DTD provides a standard format for data being streamed into the Lycos site.

But using XML to parse data from corporate partners, where there is a degree of control, is different from the Web, where anything goes.

Jackson frets that once unleashed on the Web, XML could mean the same type of "keyword spamming" that search engines now try to filter out of existing HTML meta tags. Unscrupulous developers could alter tag definitions for their own gain, defeating the purpose of a standard.

The Dublin Core initiative aims to bring a Dewey Decimal-like system to the Web, that will give mainstream users and Web catalogers alike a standard means of tagging their information resources. According to the "Dublin Core Metadata Initiative" Web site, the most of the specification's elements "have a commonly understood semantics of roughly the complexity of a library card catalog card."

While such a system could aid in a search engine's ability to catalog the Web, it is far from becoming a specification and put into mainstream use.

Sprague believes if and when XML begins to gain acceptance, many sites will split into two versions - one supporting XML and a sister HTML site.

For now, though, HTML remains king of the hill.

Jason Meserve is a staff writer for Network World.


RELATED STORIES:
IBM offers speech extension to XML
February 19, 1999
Windows 98 to get Web authoring features
September 5, 1998
New Web language promises smarter surfing
July 1, 1998

RELATED IDG.net STORIES:
XML document management tools debut
(Network World Fusion)
Fusion's XML resources page
(Network World Fusion)
Thinking about XML
(Network World Fusion)
XML explodes onto the scene; vendors rush to release tools
(InfoWorld Electric)
Palm-size XML tool arrives
(InfoWorld Electric)
Sun, Netscape announce business tools for XML
(InfoWorld Electric)
Sun extends Java to support XML
(InfoWorld Electric)
W3C boosts XML with namespace spec
(Network World Fusion)

Note: Pages will open in a new browser window
External sites are not endorsed by CNN Interactive.


RELATED SITES:
World Wide Web Consortium
(W3C)
Northern Light Technology LLC
AltaVista
Dublin Core Metadata Initiative

Note: Pages will open in a new browser window
External sites are not endorsed by CNN Interactive.

 LATEST HEADLINES:
SEARCH CNN.com
Enter keyword(s)   go    help

Back to the top   © 2001 Cable News Network. All Rights Reserved.
Terms under which this service is provided to you.
Read our privacy guidelines.