Jump to the: [content for this page][navigation menu.]

Spidering Hacks

100 Industrial-Strength Tips and Tool

Spidering Hacks by Tara Calishain & Rael Dornfest
By Kevin Hemenway & Tara Calishain

Published by O'Reilly (7 Nov 2003)
Paperback, 424 pages
ISBN: 0-596-00577-6

Finding data on the internet is only the first step. Before it is truly useful, the data must be retrieved and extracted from the web pages in which it is embedded. Once you’ve got your hands on the raw data, you can re-purpose it to suit your needs or combine it with other data – often in surprising ways.

Spidering Hacks takes you to the next level in Internet data retrieval by showing you how to create and deploy spiders and scrapers to retrieve and work with information from your favourite site and data sources.

Before unleashing your spiders on the world, this book will enable you to get a handle on the common idioms, tried-and-true methodologies, philosophies, and ethical considerations. It will show you how to:
• Assemble a toolbox of freely available modules and scripts, frameworks and templates.
• Comb, extract, and aggregate data from disparate sources, performing remarkable feats of recombination and analysis.
• Integrate third-party data into your own applications or web sites.
• Build a media library of audio, video, and images – from comic strips to old movies from the US Library of Congress.
• Build alternative interfaces to rich online databases – both Internet and within your organisation – and mine them almost as easily as your own.
• Keep your dataset current, mirroring collections to your local hard drive and carefully scheduling your spiders to run on a regular basis.
• Share your own data in a way that makes it easier to scrape and more usable to others down stream.

This is a technical book written for developers, researchers, technical assistants, librarians, webmasters, and power users. However, Spidering Hacks provides practical, ingenious, real-world solutions to data spidering, scraping and manipulation for a broad range of applications.