TN111 - Building Web Robots
111.1 Summary
A web robot is an application that browses the Internet as if it were a
person using a web browser. There are numerous applications for web
robots such as build indexes for a search engine, collect pricing from a
competitor's web site, or measuring the continuous performance of a web
site. Other names for web robots include spiders, web crawlers,
or web agents.
111.2 Ethics of Web Robots
Martijn Koster published an informal standard for web sites to inform
robots (via the /robots.txt file) of the web site's desire to be browsed
by an automated process. See:
A Standard for Robot Exclusion, Martijn Koster
http://www.robotstxt.org/wc/norobots.html
Failure to follow the standard is, at the very least, considered rude.
Some web sites have terms and conditions that do not permit automated
access. Failure to comply with the terms and conditions could result
loss of access to the site. In the case of eBay vs. Bidder's Edge, eBay
sues Bidder's Edge to force them to follow the /robots.txt standard. For
details see:
eBay, Inc. vs. Bidder's Edge, Inc.
http://pub.bna.com/lw/21200.htm
111.3 LWP as a Robot Development Tool
libwww-perl (or LWP for short) is a library of Perl packages and modules
that provide the developer with the tools to interact with a web server
via HTTP and parse the returned HTML.
Roy Fielding produced the first version based on Perl 4.036, now
typically referred to as libwww-perl4. Gisle Aas and Martijn Koster were
architects of the second generation of LWP for Perl version 5.
Official LWP web site
http://www.linpro.no/lwp/
LWP is typically part of the core perl distribution.
It is also on CPAN at:
http://search.cpan.org/author/GAAS/libwww-perl-5.65/lib/Bundle/LWP.pm
Active State version of LWP for a PC
http://aspn.activestate.com/ASPN/Products/ActivePerl/site/lib/Bundle/LWP.html
111.4 LWP Documentation
O'Reilly has two good books on LWP:
"Web Client Programming with Perl"
http://www.amazon.com/exec/obidos/ASIN/B00005R09X/
"Perl and LWP"
http://www.amazon.com/exec/obidos/ASIN/0596001789/
Other helpful web pages:
http://www.perldoc.com/perl5.6.1/lib/LWP.html
http://www.dcs.rochester.edu/Documentation/perlnut/ch17_01.htm
http://www.devshed.com/Server_Side/Perl/DataMining/page1.html
111.5 Example Robot
This example robot queries a popular search engine.
http://www.marchansen.com/bin/txt.cgi/tn111/autoq
|