All Packages  Class Hierarchy  This Package  Previous  Next  Index

Class websphinx.Page

java.lang.Object
   |
   +----websphinx.Region
           |
           +----websphinx.Page

public class Page
extends Region
A Web page. Although a Page can represent any MIME type, it mainly supports HTML pages, which are automatically parsed. The parsing produces a list of tags, a list of words, an HTML parse tree, and a list of links.


Constructor Index

 o Page(Link)
Make a Page by downloading and parsing a Link.
 o Page(Link, HTMLParser)
Make a Page by downloading a Link.
 o Page(String)
Make a Page from a string of content.
 o Page(URL, String)
Make a Page from a URL and a string of HTML.
 o Page(URL, String, HTMLParser)
Make a Page from a URL and a string of HTML.

Method Index

 o discardContent()
Unlock the page's content (allowing it to be garbage-collected, to save space during a Web crawl).
 o download(HTMLParser)
 o getBase()
Get the base URL, relative to which the page's links were interpreted.
 o getContent()
Get the content of the page.
 o getContentEncoding()
Get content encoding of page.
 o getContentType()
Get MIME type of page.
 o getDepth()
Get depth of page in crawl.
 o getElements()
Get the HTML elements in the page.
 o getExpiration()
Get expiration date of page.
 o getLastModified()
Get last-modified date of page.
 o getLinks()
Get the links found in the page.
 o getOrigin()
Get the Link that points to this page.
 o getResponseCode()
Get response code returned by the Web server.
 o getResponseMessage()
Get response message returned by the Web server.
 o getRootElement()
Get the root HTML element of the page.
 o getTags()
Get the tag sequence of the page.
 o getTitle()
Get the title of the page.
 o getTokens()
Get the token sequence of the page.
 o getURL()
Get the URL.
 o getWords()
Get the words in the page.
 o hasContent()
Test if page content is available.
 o isHTML()
Test whether page is HTML.
 o isImage()
 o isParsed()
Test whether page has been parsed.
 o keepContent()
Lock the page's content (to prevent it from being discarded).
 o main(String[])
 o parse(HTMLParser)
Parse the page.
 o setContentEncoding(String)
Set content encoding of page.
 o setContentType(String)
Set MIME type of page.
 o setExpiration(long)
Set expiration date of page.
 o setLastModified(long)
Set last-modified date of page.
 o substringCanonicalTags(int, int)
Get canonicalized HTML tags found in a region.
 o substringContent(int, int)
Get raw content found in a region.
 o substringHTML(int, int)
Get HTML found in a region.
 o substringTags(int, int)
Get HTML tags found in a region.
 o substringText(int, int)
Get tagless text found in a region.
 o toDescription()
Generate a human-readable description of the page.
 o toString()
Get page containing the region.
 o toURL()
Convert the link's URL to a String

Constructors

 o Page
 public Page(Link link) throws IOException
Make a Page by downloading and parsing a Link.

Parameters:
link - Link to download
 o Page
 public Page(Link link,
             HTMLParser parser) throws IOException
Make a Page by downloading a Link.

Parameters:
link - Link to download
parser - HTML parser to use
 o Page
 public Page(URL url,
             String html)
Make a Page from a URL and a string of HTML. The created page has no originating link, so calls to getURL(), getProtocol(), etc. will fail.

Parameters:
url - URL to use as a base for relative links on the page
html - the HTML content of the page
 o Page
 public Page(URL url,
             String html,
             HTMLParser parser)
Make a Page from a URL and a string of HTML. The created page has no originating link, so calls to getURL(), getProtocol(), etc. will fail.

Parameters:
url - URL to use as a base for relative links on the page
html - the HTML content of the page
parser - HTML parser to use
 o Page
 public Page(String content)
Make a Page from a string of content. The content is not parsed. The created page has no originating link, so calls to getURL(), getProtocol(), etc. will fail.

Parameters:
content - HTML content of the page

Methods

 o download
 public void download(HTMLParser parser) throws IOException
 o parse
 public void parse(HTMLParser parser)
Parse the page. Assumes the page has already been downloaded.

Parameters:
parser - HTML parser to use
Throws: IOException
if an error occurs in downloading the page
 o isParsed
 public boolean isParsed()
Test whether page has been parsed. Pages are parsed during download only if its MIME type is HTML or unspecified.

Returns:
true if page was parsed, false if not
 o isHTML
 public boolean isHTML()
Test whether page is HTML.

Returns:
true if page is HTML, false if not
 o isImage
 public boolean isImage()
 o keepContent
 public void keepContent()
Lock the page's content (to prevent it from being discarded). This method increments a lock counter, representing all the callers interested in preserving the content. The lock counter is set to 1 when the page is initially downloaded.

 o discardContent
 public void discardContent()
Unlock the page's content (allowing it to be garbage-collected, to save space during a Web crawl). This method decrements a lock counter. If the counter falls to 0 (meaning no callers are interested in the content), the content is released. At least the following fields are discarded: content, tokens, tags, words, elements, and root. After the content has been discarded, calling getContent() (or getTokens(), getTags(), etc.) will force the page to be downloaded again. Hopefully the download will come from the cache, however.

Links are not considered part of the content, and are not subject to discarding by this method. Also, if the page was created from a string (rather than by downloading), its content is not subject to discarding (since there would be no way to recover it).

 o hasContent
 public final boolean hasContent()
Test if page content is available.

Returns:
true if content is downloaded and available, false if content has not been downloaded or has been discarded.
 o getDepth
 public int getDepth()
Get depth of page in crawl.

Returns:
depth of page from root (depth of page is same as depth of its originating link)
 o getOrigin
 public Link getOrigin()
Get the Link that points to this page.

Returns:
the Link object that was used to download this page.
 o getBase
 public URL getBase()
Get the base URL, relative to which the page's links were interpreted. The base URL defaults to the URL of the Link that was used to download the page. If any redirects occur while downloading the page, the final location becomes the new base URL. Lastly, if a element is found in the page, that becomes the new base URL.

Returns:
the page's base URL.
 o getURL
 public URL getURL()
Get the URL.

Returns:
the URL of the link that was used to download this page
 o getTitle
 public String getTitle()
Get the title of the page.

Returns:
the page's title, or null if the page hasn't been parsed.
 o getContent
 public String getContent()
Get the content of the page.

Returns:
the Page object, or null if the page hasn't been downloaded.
 o getTokens
 public Region[] getTokens()
Get the token sequence of the page. Tokens are tags and whitespace-delimited text.

Returns:
token regions in the page, or null if the page hasn't been downloaded or parsed.
 o getTags
 public Tag[] getTags()
Get the tag sequence of the page.

Returns:
tags in the page, or null if the page hasn't been downloaded or parsed.
 o getWords
 public Text[] getWords()
Get the words in the page. Words are whitespace- and tag-delimited text.

Returns:
words in the page, or null if the page hasn't been downloaded or parsed.
 o getElements
 public Element[] getElements()
Get the HTML elements in the page. All elements in the page are included in the list, in the order they would appear in an inorder traversal of the HTML parse tree.

Returns:
HTML elements in the page ordered by inorder, or null if the page hasn't been downloaded or parsed.
 o getRootElement
 public Element getRootElement()
Get the root HTML element of the page.

Returns:
first top-level HTML element in the page, or null if the page hasn't been downloaded or parsed.
Overrides:
getRootElement in class Region
 o getLinks
 public Link[] getLinks()
Get the links found in the page.

Returns:
links in the page, or null if the page hasn't been downloaded or parsed.
 o toURL
 public String toURL()
Convert the link's URL to a String

Returns:
the URL represented as a string
 o toDescription
 public String toDescription()
Generate a human-readable description of the page.

Returns:
a description of the link, in the form "title [url]".
 o toString
 public String toString()
Get page containing the region.

Returns:
page containing the region
Overrides:
toString in class Region
 o getLastModified
 public long getLastModified()
Get last-modified date of page.

Returns:
the date when the page was last modified, or 0 if not known. The value is number of seconds since January 1, 1970 GMT
 o setLastModified
 public void setLastModified(long last)
Set last-modified date of page.

Parameters:
last - the date when the page was last modified, or 0 if not known. The value is number of seconds since January 1, 1970 GMT
 o getExpiration
 public long getExpiration()
Get expiration date of page.

Returns:
the expiration date of the page, or 0 if not known. The value is number of seconds since January 1, 1970 GMT.
 o setExpiration
 public void setExpiration(long expire)
Set expiration date of page.

Parameters:
expire - the expiration date of the page, or 0 if not known. The value is number of seconds since January 1, 1970 GMT.
 o getContentType
 public String getContentType()
Get MIME type of page.

Returns:
the MIME type of page, such as "text/html", or null if not known.
 o setContentType
 public void setContentType(String type)
Set MIME type of page.

Parameters:
type - the MIME type of page, such as "text/html", or null if not known.
 o getContentEncoding
 public String getContentEncoding()
Get content encoding of page.

Returns:
the encoding type of page, such as "base-64", or null if not known.
 o setContentEncoding
 public void setContentEncoding(String encoding)
Set content encoding of page.

Parameters:
encoding - the encoding type of page, such as "base-64", or null if not known.
 o getResponseCode
 public int getResponseCode()
Get response code returned by the Web server. For list of possible values, see java.net.HttpURLConnection.

Returns:
response code, such as 200 (for OK) or 404 (not found). Code is -1 if unknown.
See Also:
HttpURLConnection
 o getResponseMessage
 public String getResponseMessage()
Get response message returned by the Web server.

Returns:
response message, such as "OK" or "Not Found". The response message is null if the page failed to be fetched or not known.
 o substringContent
 public String substringContent(int start,
                                int end)
Get raw content found in a region.

Parameters:
start - starting offset of region
end - ending offset of region
Returns:
raw HTML contained in the region
 o substringHTML
 public String substringHTML(int start,
                             int end)
Get HTML found in a region.

Parameters:
start - starting offset of region
end - ending offset of region
Returns:
representation of region as HTML
 o substringText
 public String substringText(int start,
                             int end)
Get tagless text found in a region. Runs of whitespace and tags are reduced to a single space character.

Parameters:
start - starting offset of region
end - ending offset of region
Returns:
tagless text contained in the region
 o substringTags
 public String substringTags(int start,
                             int end)
Get HTML tags found in a region. Whitespace and text among the tags are deleted.

Parameters:
start - starting offset of region
end - ending offset of region
Returns:
tags contained in the region
 o substringCanonicalTags
 public String substringCanonicalTags(int start,
                                      int end)
Get canonicalized HTML tags found in a region. A canonicalized tag looks like the following:
 <tagname#index attr=value attr=value attr=value ...>
 
 where tagname and attr are all lowercase, index is the tag's
 index in the page's tokens array.  Attributes are sorted in
 increasing order by attribute name. Attributes without values
 omit the entire "=value" portion.  Values are delimited by a 
 space.  All occurences of <, >, space, and % characters 
 in a value are URL-encoded (e.g., space is converted to %20).  
 Thus the only occurences of these characters in the canonical 
 tag are the tag delimiters.
 

For example, raw HTML that looks like:

 <IMG SRC="http://foo.com/map<>.gif" ISMAP>Image</IMG>
 
would be canonicalized to:
 <img ismap src=http://foo.com/map%3C%3E.gif></img>
 

Comment and declaration tags (whose tag name is !) are omitted from the canonicalization.

Parameters:
start - starting offset of region
end - ending offset of region
Returns:
canonicalized tags contained in the region
 o main
 public static void main(String args[]) throws Exception

All Packages  Class Hierarchy  This Package  Previous  Next  Index