All Packages  Class Hierarchy  This Package  Previous  Next  Index

Class websphinx.Crawler

java.lang.Object
   |
   +----websphinx.Crawler

public class Crawler
extends Object
implements Runnable, Serializable
Web crawler.

To write a crawler, extend this class and override shouldVisit () and visit() to create your own crawler.

To use a crawler:

  1. Initialize the crawler by calling setRoot() (or one of its variants) and setting other crawl parameters.
  2. Register any classifiers you need with addClassifier().
  3. Connect event listeners to monitor the crawler, such as websphinx.EventLog, websphinx.workbench.WebGraph, or websphinx.workbench.Statistics.
  4. Call run() to start the crawler.
A running crawler consists of a priority queue of Links waiting to be visited and a set of threads retrieving pages in parallel. When a page is downloaded, it is processed as follows:
  1. classify(): The page is passed to the classify() method of every registered classifier, in increasing order of their priority values. Classifiers typically attach informative labels to the page and its links, such as "homepage" or "root page".
  2. visit(): The page is passed to the crawler's visit() method for user-defined processing.
  3. expand(): The page is passed to the crawler's expand() method to be expanded. The default implementation tests every unvisited hyperlink on the page with shouldVisit(), and puts each link approved by shouldVisit() into the crawling queue.
By default, when expanding the links of a page, the crawler only considers hyperlinks (not applets or inline images, for instance) that point to Web pages (not mailto: links, for instance). If you want shouldVisit() to test every link on the page, use setLinkType(Crawler.ALL_LINKS).


Variable Index

 o ALL_LINKS
Specify ALL_LINKS as the link type to allow the crawler to visit any kind of link
 o HYPERLINKS
Specify HYPERLINKS as the link type to allow the crawler to visit only hyperlinks (A, AREA, and FRAME tags which point to http:, ftp:, file:, or gopher: URLs).
 o HYPERLINKS_AND_IMAGES
Specify HYPERLINKS_AND_IMAGES as the link type to allow the crawler to visit only hyperlinks and inline images.
 o SERVER
Specify SERVER as the crawl domain to limit the crawler to visit only pages on the same Web server (hostname and port number) as the root link from which it started.
 o SUBTREE
Specify SUBTREE as the crawl domain to limit the crawler to visit only pages which are descendants of the root link from which it started.
 o WEB
Specify WEB as the crawl domain to allow the crawler to visit any page on the World Wide Web.

Constructor Index

 o Crawler()
Make a new Crawler.

Method Index

 o addClassifier(Classifier)
Adds a classifier to this crawler.
 o addCrawlListener(CrawlListener)
Adds a listener to the set of CrawlListeners for this crawler.
 o addLinkListener(LinkListener)
Adds a listener to the set of LinkListeners for this crawler.
 o addRoot(Link)
Add a root to the existing set of roots.
 o clear()
Initialize the crawler for a fresh crawl.
 o clearVisited()
Clear the set of visited links.
 o enumerateClassifiers()
Enumerates the set of classifiers.
 o enumerateQueue()
Enumerate crawling queue.
 o expand(Page)
Expand the crawl from a page.
 o getAction()
Get action.
 o getActiveThreads()
Get number of threads currently working.
 o getClassifiers()
Get the set of classifiers
 o getCrawledRoots()
Get roots of last crawl.
 o getDepthFirst()
Get depth-first search flag.
 o getDomain()
Get crawl domain.
 o getDownloadParameters()
Get download parameters (such as number of threads, timeouts, maximum page size, etc.)
 o getIgnoreVisitedLinks()
Get ignore-visited-links flag.
 o getLinkPredicate()
Get link predicate.
 o getLinksTested()
Get number of links tested.
 o getLinkType()
Get legal link types to crawl.
 o getMaxDepth()
Get maximum depth.
 o getName()
Get human-readable name of crawler.
 o getPagePredicate()
Get page predicate.
 o getPagesLeft()
Get number of pages left to be visited.
 o getPagesVisited()
Get number of pages visited.
 o getRootHrefs()
Get starting points of crawl as a String of newline-delimited URLs.
 o getRoots()
Get starting points of crawl as an array of Link objects.
 o getState()
Get state of crawler.
 o getSynchronous()
Get synchronous flag.
 o main(String[])
 o markVisited(Link)
Register that a link has been visited.
 o pause()
Pause the crawl in progress.
 o removeAllClassifiers()
Clears the set of classifiers.
 o removeClassifier(Classifier)
Removes a classifier from the set of classifiers.
 o removeCrawlListener(CrawlListener)
Removes a listener from the set of CrawlListeners.
 o removeLinkListener(LinkListener)
Removes a listener from the set of LinkListeners.
 o run()
Start crawling.
 o sendCrawlEvent(int)
Send a CrawlEvent to all CrawlListeners registered with this crawler.
 o sendLinkEvent(Link, int)
Send a LinkEvent to all LinkListeners registered with this crawler.
 o sendLinkEvent(Link, int, Throwable)
Send an exceptional LinkEvent to all LinkListeners registered with this crawler.
 o setAction(Action)
Set the action.
 o setDepthFirst(boolean)
Set depth-first search flag.
 o setDomain(String[])
Set crawl domain.
 o setDownloadParameters(DownloadParameters)
Set download parameters (such as number of threads, timeouts, maximum page size, etc.)
 o setIgnoreVisitedLinks(boolean)
Set ignore-visited-links flag.
 o setLinkPredicate(LinkPredicate)
Set link predicate.
 o setLinkType(String[])
Set legal link types to crawl.
 o setMaxDepth(int)
Set maximum depth.
 o setName(String)
Set human-readable name of crawler.
 o setPagePredicate(PagePredicate)
Set page predicate.
 o setRoot(Link)
Set starting point of crawl as a single Link.
 o setRootHrefs(String)
Set starting points of crawl as a string of whitespace-delimited URLs.
 o setRoots(Link[])
Set starting points of crawl as an array of Links.
 o setSynchronous(boolean)
Set ssynchronous flag.
 o shouldVisit(Link)
Callback for testing whether a link should be traversed.
 o stop()
Stop the crawl in progress.
 o submit(Link)
Puts a link into the crawling queue.
 o submit(Link[])
Submit an array of Links for crawling.
 o toString()
Convert the crawler to a String.
 o visit(Page)
Callback for visiting a page.
 o visited(Link)
Test whether the page corresponding to a link has been visited (or queued for visiting).

Variables

 o WEB
 public static final String WEB[]
Specify WEB as the crawl domain to allow the crawler to visit any page on the World Wide Web.

 o SERVER
 public static final String SERVER[]
Specify SERVER as the crawl domain to limit the crawler to visit only pages on the same Web server (hostname and port number) as the root link from which it started.

 o SUBTREE
 public static final String SUBTREE[]
Specify SUBTREE as the crawl domain to limit the crawler to visit only pages which are descendants of the root link from which it started.

 o HYPERLINKS
 public static final String HYPERLINKS[]
Specify HYPERLINKS as the link type to allow the crawler to visit only hyperlinks (A, AREA, and FRAME tags which point to http:, ftp:, file:, or gopher: URLs).

 o HYPERLINKS_AND_IMAGES
 public static final String HYPERLINKS_AND_IMAGES[]
Specify HYPERLINKS_AND_IMAGES as the link type to allow the crawler to visit only hyperlinks and inline images.

 o ALL_LINKS
 public static final String ALL_LINKS[]
Specify ALL_LINKS as the link type to allow the crawler to visit any kind of link

Constructors

 o Crawler
 public Crawler()
Make a new Crawler.

Methods

 o run
 public void run()
Start crawling. Returns either when the crawl is done, or when pause() or stop() is called. Because this method implements the java.lang.Runnable interface, a crawler can be run in the background thread.

 o clear
 public void clear()
Initialize the crawler for a fresh crawl. Clears the crawling queue and sets all crawling statistics to 0. Stops the crawler if it is currently running.

 o pause
 public void pause()
Pause the crawl in progress. If the crawler is running, then it finishes processing the current page, then returns. The queues remain as-is, so calling run() again will resume the crawl exactly where it left off. pause() can be called from any thread.

 o stop
 public void stop()
Stop the crawl in progress. If the crawler is running, then it finishes processing the current page, then returns. Empties the crawling queue.

 o getState
 public int getState()
Get state of crawler.

Returns:
one of CrawlEvent.STARTED, CrawlEvent.PAUSED, STOPPED, CLEARED.
 o visit
 public void visit(Page page)
Callback for visiting a page. Default version does nothing.

Parameters:
page - Page retrieved by the crawler
 o shouldVisit
 public boolean shouldVisit(Link l)
Callback for testing whether a link should be traversed. Default version returns true for all links. Override this method for more interesting behavior.

Parameters:
l - Link encountered by the crawler
Returns:
true if link should be followed, false if it should be ignored.
 o expand
 public void expand(Page page)
Expand the crawl from a page. The default implementation of this method tests every link on the page using shouldVisit (), and submit()s the links that are approved. A subclass may want to override this method if it's inconvenient to consider the links individually with shouldVisit().

Parameters:
page - Page to expand
 o getPagesVisited
 public int getPagesVisited()
Get number of pages visited.

Returns:
number of pages passed to visit() so far in this crawl
 o getLinksTested
 public int getLinksTested()
Get number of links tested.

Returns:
number of links passed to shouldVisit() so far in this crawl
 o getPagesLeft
 public int getPagesLeft()
Get number of pages left to be visited.

Returns:
number of links approved by shouldVisit() but not yet visited
 o getActiveThreads
 public int getActiveThreads()
Get number of threads currently working.

Returns:
number of threads downloading pages
 o getName
 public String getName()
Get human-readable name of crawler. Default value is the class name, e.g., "Crawler". Useful for identifying the crawler in a user interface; also used as the default User-agent for identifying the crawler to a remote Web server. (The User-agent can be changed independently of the crawler name with setDownloadParameters().)

Returns:
human-readable name of crawler
 o setName
 public void setName(String name)
Set human-readable name of crawler.

Parameters:
name - new name for crawler
 o toString
 public String toString()
Convert the crawler to a String.

Returns:
Human-readable name of crawler.
Overrides:
toString in class Object
 o getRoots
 public Link[] getRoots()
Get starting points of crawl as an array of Link objects.

Returns:
array of Links from which crawler will start its next crawl.
 o getCrawledRoots
 public Link[] getCrawledRoots()
Get roots of last crawl. May differ from getRoots() if new roots have been set.

Returns:
array of Links from which crawler started its last crawl, or null if the crawler was cleared.
 o getRootHrefs
 public String getRootHrefs()
Get starting points of crawl as a String of newline-delimited URLs.

Returns:
URLs where crawler will start, separated by newlines.
 o setRootHrefs
 public void setRootHrefs(String hrefs) throws MalformedURLException
Set starting points of crawl as a string of whitespace-delimited URLs.

Parameters:
hrefs - URLs of starting point, separated by space, \t, or \n
Throws: MalformedURLException
if any of the URLs is invalid, leaving starting points unchanged
 o setRoot
 public void setRoot(Link link)
Set starting point of crawl as a single Link.

Parameters:
link - starting point
 o setRoots
 public void setRoots(Link links[])
Set starting points of crawl as an array of Links.

Parameters:
links - starting points
 o addRoot
 public void addRoot(Link link)
Add a root to the existing set of roots.

Parameters:
link - starting point to add
 o getDomain
 public String[] getDomain()
Get crawl domain. Default value is WEB.

Returns:
WEB, SERVER, or SUBTREE.
 o setDomain
 public void setDomain(String domain[])
Set crawl domain.

Parameters:
domain - one of WEB, SERVER, or SUBTREE.
 o getLinkType
 public String[] getLinkType()
Get legal link types to crawl. Default value is HYPERLINKS.

Returns:
HYPERLINKS, HYPERLINKS_AND_IMAGES, or ALL_LINKS.
 o setLinkType
 public void setLinkType(String type[])
Set legal link types to crawl.

Parameters:
domain - one of HYPERLINKS, HYPERLINKS_AND_IMAGES, or ALL_LINKS.
 o getDepthFirst
 public boolean getDepthFirst()
Get depth-first search flag. Default value is true.

Returns:
true if search is depth-first, false if search is breadth-first.
 o setDepthFirst
 public void setDepthFirst(boolean useDFS)
Set depth-first search flag. If neither depth-first nor breadth-first is desired, then override shouldVisit() to set a custom priority on each link.

Parameters:
useDFS - true if search should be depth-first, false if search should be breadth-first.
 o getSynchronous
 public boolean getSynchronous()
Get synchronous flag. Default value is false.

Returns:
true if crawler must visit the pages in priority order; false if crawler can visit pages in any order.
 o setSynchronous
 public void setSynchronous(boolean f)
Set ssynchronous flag.

Parameters:
f - true if crawler must visit the pages in priority order; false if crawler can visit pages in any order.
 o getIgnoreVisitedLinks
 public boolean getIgnoreVisitedLinks()
Get ignore-visited-links flag. Default value is true.

Returns:
true if search skips links whose URLs have already been visited (or queued for visiting).
 o setIgnoreVisitedLinks
 public void setIgnoreVisitedLinks(boolean f)
Set ignore-visited-links flag.

Parameters:
f - true if search skips links whose URLs have already been visited (or queued for visiting).
 o getMaxDepth
 public int getMaxDepth()
Get maximum depth. Default value is 5.

Returns:
maximum depth of crawl, in hops from starting point.
 o setMaxDepth
 public void setMaxDepth(int maxDepth)
Set maximum depth.

Parameters:
maxDepth - maximum depth of crawl, in hops from starting point
 o getDownloadParameters
 public DownloadParameters getDownloadParameters()
Get download parameters (such as number of threads, timeouts, maximum page size, etc.)

 o setDownloadParameters
 public void setDownloadParameters(DownloadParameters dp)
Set download parameters (such as number of threads, timeouts, maximum page size, etc.)

Parameters:
dp - Download parameters
 o setLinkPredicate
 public void setLinkPredicate(LinkPredicate pred)
Set link predicate. This is an alternative way to specify the links to walk. If the link predicate is non-null, then only links that satisfy the link predicate AND shouldVisit() are crawled.

Parameters:
pred - Link predicate
 o getLinkPredicate
 public LinkPredicate getLinkPredicate()
Get link predicate.

Returns:
current link predicate
 o setPagePredicate
 public void setPagePredicate(PagePredicate pred)
Set page predicate. This is a way to filter the pages passed to visit(). If the page predicate is non-null, then only pages that satisfy it are passed to visit().

Parameters:
pred - Page predicate
 o getPagePredicate
 public PagePredicate getPagePredicate()
Get page predicate.

Returns:
current page predicate
 o setAction
 public void setAction(Action act)
Set the action. This is an alternative way to specify an action performed on every page. If act is non-null, then every page passed to visit() is also passed to this action.

Parameters:
act - Action
 o getAction
 public Action getAction()
Get action.

Returns:
current action
 o submit
 public void submit(Link link)
Puts a link into the crawling queue. If the crawler is running, the link will eventually be retrieved and passed to visit().

Parameters:
link - Link to put in queue
 o submit
 public void submit(Link links[])
Submit an array of Links for crawling. If the crawler is running, these links will eventually be retrieved and passed to visit().

Parameters:
links - Links to put in queue
 o enumerateQueue
 public Enumeration enumerateQueue()
Enumerate crawling queue.

Returns:
an enumeration of Link objects which are waiting to be visited.
 o addClassifier
 public void addClassifier(Classifier c)
Adds a classifier to this crawler. If the classifier is already found in the set, does nothing.

Parameters:
c - a classifier
 o removeClassifier
 public void removeClassifier(Classifier c)
Removes a classifier from the set of classifiers. If c is not found in the set, does nothing.

Parameters:
c - a classifier
 o removeAllClassifiers
 public void removeAllClassifiers()
Clears the set of classifiers.

 o enumerateClassifiers
 public Enumeration enumerateClassifiers()
Enumerates the set of classifiers.

Returns:
An enumeration of the classifiers.
 o getClassifiers
 public Classifier[] getClassifiers()
Get the set of classifiers

Returns:
An array containing the registered classifiers.
 o addCrawlListener
 public void addCrawlListener(CrawlListener listen)
Adds a listener to the set of CrawlListeners for this crawler. If the listener is already found in the set, does nothing.

Parameters:
listen - a listener
 o removeCrawlListener
 public void removeCrawlListener(CrawlListener listen)
Removes a listener from the set of CrawlListeners. If it is not found in the set, does nothing.

Parameters:
listen - a listener
 o addLinkListener
 public void addLinkListener(LinkListener listen)
Adds a listener to the set of LinkListeners for this crawler. If the listener is already found in the set, does nothing.

Parameters:
listen - a listener
 o removeLinkListener
 public void removeLinkListener(LinkListener listen)
Removes a listener from the set of LinkListeners. If it is not found in the set, does nothing.

Parameters:
listen - a listener
 o sendCrawlEvent
 protected void sendCrawlEvent(int id)
Send a CrawlEvent to all CrawlListeners registered with this crawler.

Parameters:
id - Event id
 o sendLinkEvent
 protected void sendLinkEvent(Link l,
                              int id)
Send a LinkEvent to all LinkListeners registered with this crawler.

Parameters:
l - Link related to event
id - Event id
 o sendLinkEvent
 protected void sendLinkEvent(Link l,
                              int id,
                              Throwable exception)
Send an exceptional LinkEvent to all LinkListeners registered with this crawler.

Parameters:
l - Link related to event
id - Event id
exception - Exception associated with event
 o visited
 public boolean visited(Link link)
Test whether the page corresponding to a link has been visited (or queued for visiting).

Parameters:
link - Link to test
Returns:
true if link has been passed to walk() during this crawl
 o markVisited
 protected void markVisited(Link link)
Register that a link has been visited.

Parameters:
link - Link that has been visited
 o clearVisited
 protected void clearVisited()
Clear the set of visited links.

 o main
 public static void main(String args[]) throws Exception

All Packages  Class Hierarchy  This Package  Previous  Next  Index