WebSPHINX:
A Personal, Customizable Web Crawler

[ Home ] [ Setup ] [ Run ] [ Release History ]

WebSPHINX ( Website-Specific Processors for HTML INformation eXtraction) is a Java class library and interactive development environment for Web crawlers. A Web crawler (also called a robot or spider) is a program that browses and processes Web pages automatically.

WebSPHINX consists of two parts: the Crawler Workbench and the WebSPHINX class library.

Crawler Workbench

The Crawler Workbench is a Java applet that puts a customizable Web crawler right in your browser. Using the Crawler Workbench, you can:

WebSPHINX class library

The WebSPHINX class library provides support for writing Web crawlers in Java. The class library offers a number of features:

Getting Started

WebSPHINX is written in Java, so it runs on a variety of machine platforms. It has been tested in the following Java environments. Instructions for each environment are found below.

More Information

For more information about WebSPHINX, consult our paper:

Robert C. Miller and Krishna Bharat. SPHINX: A Framework for Creating Personal, Site-Specific Web Crawlers. In Proceedings of WWW7, Brisbane Australia, April 1998.

For documentation of the WebSPHINX class library, see:

websphinx package documentation

To get the source code for WebSPHINX, download:

websphinx source code (note the Conditions of Use below)

Conditions of Use

WebSPHINX is Copyright © 1998, 1999 - Carnegie Mellon University. The WebSPHINX binaries (in websphinx.jar, websphinx.cab, or websphinx.zip) are released for free general use and redistribution with your programs.

The WebSPHINX source code is Copyright © 1998, 1999 - Carnegie Mellon University, released under the terms of the GNU Library General Public License.

CARNEGIE MELLON UNIVERSITY (CMU) MAKES NO WARRANTIES OF ANY KIND, EITHER EXPRESSED OR IMPLIED AS TO ANY MATTER INCLUDING, BUT NOT LIMITED TO, WARRANTY OF FITNESS FOR PURPOSE, OR MERCHANTABILITY, EXCLUSIVITY OR RESULTS OBTAINED FROM SPONSOR'S USE OF ANY INTELLECTUAL PROPERTY DEVELOPED UNDER THIS AGREEMENT, NOR SHALL EITHER PARTY HERETO BE LIABLE TO THE OTHER FOR INDIRECT, SPECIAL, OR CONSEQUENTIAL DAMAGES SUCH AS LOSS OF PROFITS OR INABILITY TO USE SAID INTELLECTUAL PROPERTY OR ANY APPLICATIONS AND DERIVATION THEREOF. CMU DOES NOT MAKE ANY WARRANTY OF ANY KIND WITH RESPECT TO FREEDOM FROM PATENT, TRADEMARK, OR COPYRIGHT INFRINGEMENT, OR THEFT OF TRADE SECRETS AND DOES NOT ASSUME ANY LIABILITY HEREUNDER FOR ANY INFRINGEMENT OF ANY PATENT, TRADEMARK, OR COPYRIGHT ARISING FROM THE USE OF THE PROGRAM, INFORMATION, INTELLECTUAL PROPERTY, OR OTHER PROPERTY OR RIGHTS GRANTED OR PROVIDED TO IT HEREUNDER. THE USER AGREES THAT IT WILL NOT MAKE ANY WARRANTY ON BEHALF OF CMU, EXPRESSED OR IMPLIED, TO ANY PERSON CONCERNING THE APPLICATION OF OR THE RESULTS TO BE OBTAINED WITH THE PROGRAM UNDER THIS AGREEMENT. USERS ACKNOWLEDGE THAT THE PROGRAM IS A RESEARCH TOOL STILL IN THE DEVELOPMENT STAGE, THAT IT IS BEING SUPPLIED "AS IS," WITHOUT ANY ACCOMPANYING SERVICES OR IMPROVEMENTS FROM CMU.


Send comments or questions to Rob Miller (rcm@cs.cmu.edu)