News websites seek more search control

NEW YORK - Leading news organizations and other publishers have proposed changing the rules that tell search engines what they can and can't collect when scouring the Web, saying the revisions would give site owners greater control over their content.

Google Inc., Yahoo Inc. and other top search companies now voluntarily respect a Web site's wishes as stated in a document known as "robots.txt," which a search engine's indexing software, called a crawler, knows to look for on a site.

Under the existing 13-year-old technology, a site can block indexing of individual Web pages, specific directories or the entire site. Some search engines have added their own commands to the rules, but they're not universally observed.

The Automated Content Access Protocol proposal, unveiled Thursday by a consortium of publishers at the global headquarters of The Associated Press, seeks to have those extra commands — and more — apply across the board.

With the ACAP commands, sites could try to limit how long search engines retain copies in their indexes, for instance, or tell the crawler not to follow any of the links that appear within a Web page.

If accepted by search engines, publishers say they would be willing to make more of their copyright-protected materials available online. But Web surfers also could find sites disappear from search engines more quickly, or find smaller versions of images called thumbnails missing if sites ban such presentations.

"Robots.txt was created for a different age," said Gavin O'Reilly, president of the World Association of Newspapers, one of the organizations behind the proposal. "It works well for search engines but doesn't work for content creators."

As with the current robots.txt, ACAP's use would be voluntary, so search engines ultimately would have to agree to recognize the new commands. So far, none of the leading ones have.

Search engines also could ignore the new commands and leave it to courts to resolve any disputes.

Robots.txt was developed in 1994 following concerns that some crawlers were taxing Web sites by visiting them too many times too quickly. Although the system has never been sanctioned by any standards body, major search engines have voluntarily complied.

As search engines expanded to offer services for displaying news and scanning printed books, news organizations and book publishers began to complain that their content was being lifted from their sites and displayed on those of the search engines.

News publishers had complained that Google was posting their news summaries, headlines and photos without permission. Google claimed that "fair use" provisions of copyright laws applied, though it eventually settled a lawsuit with Agence France-Presse and agreed to pay the AP without a lawsuit filed. Financial terms haven't been disclosed.

The proposed extensions partly grew out of those disputes. Leading the ACAP effort were groups representing publishers of newspapers, magazines, online databases, books and journals. The AP is one of dozens of organizations that have joined ACAP, and O'Reilly said those members collectively represent some 18,000 publications.

AP Chief Executive Tom Curley said the news cooperative spends hundreds of millions of dollars annually covering the world — and in many cases its employees risk their lives doing so. Technologies such as ACAP, he said, are important to protect the AP's original news reports from sites that distribute them without permission.

"The free riding deprives AP of economic returns on its investments," he said.

The new ACAP commands will use the same robots.txt file that search engines now recognize. ACAP developers tested their system with French search engine Exalead Inc. but had only informal discussions with others. Google, Yahoo and Microsoft Corp. sent representatives to Thursday's announcement but made no public promises to use ACAP.

Google spokeswoman Jessica Powell said the company supports all efforts to bring Web sites and search engines together but needed to evaluate ACAP to ensure it can meet the needs of millions of Web sites — not just those of a single community.

Joseph Siino, a senior vice president at Yahoo, said other technological initiatives exist, including the Sitemaps system for Web sites to tell search engines about which Web pages to index and how often they change.

The marketplace — and not any one group — "is ultimately going to dictate what's the right solution," Siino said. "Our industry is a rapidly evolving industry. No one will want to endorse any particular solution prematurely."

Date Posted: 29 November 2007 Last Modified: 29 November 2007