|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
In search of XML
(IDG) -- The chicken or the egg? When it comes to widespread use of XML on the Internet, that could be the question. None of the major search engines support XML, because so few Web sites use it. But without encouragement from the search engines, who's going to add XML to their pages? Last year, the World Wide Web Consortium approved an XML specification that many thought would help spark a revolution in Web-based information retrieval. With XML tags more precisely defining the information contained in documents, the Web would become a far easier place to find just what you were looking for.
The problem, however, is the millions upon millions of legacy HTML documents out there. "HTML is the name of the game," says Mark Sprague, co-founder and senior vice president of product design at Northern Light, a search company in Cambridge, Mass. "There would need to be a critical mass of XML or Dublin Core [a proposed metadata element set for use in the discovery of electronic resources] for us to support it. We'll be reactive to XML." Separating the data from the presentation could make data interchange and retrieval across the Internet more efficient by letting authors and sites define just what's in their documents. HTML describes how data should be displayed in a browser window, but does not define it. The page could contain a conference agenda, a classified ad, a book review or the latest product news, but it's all basically the same to a browser. Still, because of its ease of use and mostly standardized tags, HTML has become wildly popular. XML, in contrast, has few defined tags: Two Web sites could have XML tags with the same name, but might define them in very different ways, making XML more difficult to implement than HTML. In turn, search engines have had to develop their own language for interpreting, sorting and indexing the data returned from their Web crawlers and spiders. Because they already have their own meta-data schema, they are reluctant to develop systems to handle another one that few people currently use - no matter how cumbersome those existing schema are. Northern Light, which catalogs not just the Web, but a number of "special" information collections, uses a proprietary formatting system to handle data from its partners and Web crawlers. For each new partner the company takes on, Northern Light engineers must develop a system for extracting information from the incoming data stream. "We have to take the third party's vocabulary and match it to ours," says Sheri Larsen, director of content processing at Northern Light. "Any new project is a big one, as we try to pull out as much as possible from documents." XML could provide a standard metadata language for site developers, such as Northern Light and authors, such as its partners. Web crawlers would have to travel no further than the XML tags to know exactly what is on any given page. The impetus to support XML in their crawler would have to come from people and companies developing sites with XML, says Sprague. At the moment, most sites are sticking with the tried and true HTML. "There is just so much HTML out there," Sprague says. Northern Light is not alone in shunning XML when it comes to scouring the Web. Lycos, AltaVista and Excite also dismiss XML. "We are not currently planning on using XML in the near future," says Ilene Quinn, spokeswoman for Compaq-owned Altavista. "That does not rule out any future use, but at the moment there is nothing underway to incorporate XML into the AltaVista search engine." So what's it good for? This not to say XML is a complete waste of time. Excite and Lycos say they are looking to use XML behind the scenes to handle data from partners. "Our intent is to use XML on the back end as a means of very structured access," says Graham Spencer, co-founder and chief technology officer at Excite. "That's really what XML is better for." Spencer says that Excite is trying to convince new information partners to encode data streams being sent to Excite in XML. Many of Excite's existing data feeds have varying types of formats, meaning Excite's content team must develop different Perl scripts to sort all the incoming information. Lycos developed an XML data type definition (DTD) for communications with certain partners, according to Lincoln Jackson, product manager for search and navigation. The DTD provides a standard format for data being streamed into the Lycos site. But using XML to parse data from corporate partners, where there is a degree of control, is different from the Web, where anything goes. Jackson frets that once unleashed on the Web, XML could mean the same type of "keyword spamming" that search engines now try to filter out of existing HTML meta tags. Unscrupulous developers could alter tag definitions for their own gain, defeating the purpose of a standard. The Dublin Core initiative aims to bring a Dewey Decimal-like system to the Web, that will give mainstream users and Web catalogers alike a standard means of tagging their information resources. According to the "Dublin Core Metadata Initiative" Web site, the most of the specification's elements "have a commonly understood semantics of roughly the complexity of a library card catalog card." While such a system could aid in a search engine's ability to catalog the Web, it is far from becoming a specification and put into mainstream use. Sprague believes if and when XML begins to gain acceptance, many sites will split into two versions - one supporting XML and a sister HTML site. For now, though, HTML remains king of the hill.
Jason Meserve is a staff writer for Network World. RELATED STORIES: IBM offers speech extension to XML RELATED IDG.net STORIES: XML document management tools debut RELATED SITES: World Wide Web Consortium
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Back to the top |
© 2001 Cable News Network. All Rights Reserved. Terms under which this service is provided to you. Read our privacy guidelines. |