All Packages Class Hierarchy This Package Previous Next Index
Class websphinx.Page
java.lang.Object
|
+----websphinx.Region
|
+----websphinx.Page
- public class Page
- extends Region
A Web page. Although a Page can represent any MIME type, it mainly
supports HTML pages, which are automatically parsed. The parsing produces
a list of tags, a list of words, an HTML parse tree, and a list of links.
-
Page(Link)
- Make a Page by downloading and parsing a Link.
-
Page(Link, HTMLParser)
- Make a Page by downloading a Link.
-
Page(String)
- Make a Page from a string of content.
-
Page(URL, String)
- Make a Page from a URL and a string of HTML.
-
Page(URL, String, HTMLParser)
- Make a Page from a URL and a string of HTML.
-
discardContent()
- Unlock the page's content (allowing it to be garbage-collected, to
save space during a Web crawl).
-
download(HTMLParser)
-
-
getBase()
- Get the base URL, relative to which the page's links were interpreted.
-
getContent()
- Get the content of the page.
-
getContentEncoding()
- Get content encoding of page.
-
getContentType()
- Get MIME type of page.
-
getDepth()
- Get depth of page in crawl.
-
getElements()
- Get the HTML elements in the page.
-
getExpiration()
- Get expiration date of page.
-
getLastModified()
- Get last-modified date of page.
-
getLinks()
- Get the links found in the page.
-
getOrigin()
- Get the Link that points to this page.
-
getResponseCode()
- Get response code returned by the Web server.
-
getResponseMessage()
- Get response message returned by the Web server.
-
getRootElement()
- Get the root HTML element of the page.
-
getTags()
- Get the tag sequence of the page.
-
getTitle()
- Get the title of the page.
-
getTokens()
- Get the token sequence of the page.
-
getURL()
- Get the URL.
-
getWords()
- Get the words in the page.
-
hasContent()
- Test if page content is available.
-
isHTML()
- Test whether page is HTML.
-
isImage()
-
-
isParsed()
- Test whether page has been parsed.
-
keepContent()
- Lock the page's content (to prevent it from being discarded).
-
main(String[])
-
-
parse(HTMLParser)
- Parse the page.
-
setContentEncoding(String)
- Set content encoding of page.
-
setContentType(String)
- Set MIME type of page.
-
setExpiration(long)
- Set expiration date of page.
-
setLastModified(long)
- Set last-modified date of page.
-
substringCanonicalTags(int, int)
- Get canonicalized HTML tags found in a region.
-
substringContent(int, int)
- Get raw content found in a region.
-
substringHTML(int, int)
- Get HTML found in a region.
-
substringTags(int, int)
- Get HTML tags found in a region.
-
substringText(int, int)
- Get tagless text found in a region.
-
toDescription()
- Generate a human-readable description of the page.
-
toString()
- Get page containing the region.
-
toURL()
- Convert the link's URL to a String
Page
public Page(Link link) throws IOException
- Make a Page by downloading and parsing a Link.
- Parameters:
- link - Link to download
Page
public Page(Link link,
HTMLParser parser) throws IOException
- Make a Page by downloading a Link.
- Parameters:
- link - Link to download
- parser - HTML parser to use
Page
public Page(URL url,
String html)
- Make a Page from a URL and a string of HTML.
The created page has no originating link, so calls to getURL(), getProtocol(), etc. will fail.
- Parameters:
- url - URL to use as a base for relative links on the page
- html - the HTML content of the page
Page
public Page(URL url,
String html,
HTMLParser parser)
- Make a Page from a URL and a string of HTML.
The created page has no originating link, so calls to getURL(), getProtocol(), etc. will fail.
- Parameters:
- url - URL to use as a base for relative links on the page
- html - the HTML content of the page
- parser - HTML parser to use
Page
public Page(String content)
- Make a Page from a string of content. The content is not parsed.
The created page has no originating link, so calls to getURL(), getProtocol(), etc. will fail.
- Parameters:
- content - HTML content of the page
download
public void download(HTMLParser parser) throws IOException
parse
public void parse(HTMLParser parser)
- Parse the page. Assumes the page has already been downloaded.
- Parameters:
- parser - HTML parser to use
- Throws: IOException
- if an error occurs in downloading the page
isParsed
public boolean isParsed()
- Test whether page has been parsed. Pages are parsed during
download only if its MIME type is HTML or unspecified.
- Returns:
- true if page was parsed, false if not
isHTML
public boolean isHTML()
- Test whether page is HTML.
- Returns:
- true if page is HTML, false if not
isImage
public boolean isImage()
keepContent
public void keepContent()
- Lock the page's content (to prevent it from being discarded).
This method increments a lock counter, representing all the
callers interested in preserving the content. The lock
counter is set to 1 when the page is initially downloaded.
discardContent
public void discardContent()
- Unlock the page's content (allowing it to be garbage-collected, to
save space during a Web crawl). This method decrements a lock counter.
If the counter falls to
0 (meaning no callers are interested in the content),
the content is released. At least the following
fields are discarded: content, tokens, tags, words, elements, and
root. After the content has been discarded, calling getContent()
(or getTokens(), getTags(), etc.) will force the page to be downloaded
again. Hopefully the download will come from the cache, however.
Links are not considered part of the content, and are not subject to
discarding by this method. Also, if the page was created from a string
(rather than by downloading), its content is not subject to discarding
(since there would be no way to recover it).
hasContent
public final boolean hasContent()
- Test if page content is available.
- Returns:
- true if content is downloaded and available, false if content has not been downloaded
or has been discarded.
getDepth
public int getDepth()
- Get depth of page in crawl.
- Returns:
- depth of page from root (depth of page is same as depth of its originating link)
getOrigin
public Link getOrigin()
- Get the Link that points to this page.
- Returns:
- the Link object that was used to download this page.
getBase
public URL getBase()
- Get the base URL, relative to which the page's links were interpreted.
The base URL defaults to the URL of the
Link that was used to download the page. If any redirects occur
while downloading the page, the final location becomes the new base
URL. Lastly, if a element is found in the page, that
becomes the new base URL.
- Returns:
- the page's base URL.
getURL
public URL getURL()
- Get the URL.
- Returns:
- the URL of the link that was used to download this page
getTitle
public String getTitle()
- Get the title of the page.
- Returns:
- the page's title, or null if the page hasn't been parsed.
getContent
public String getContent()
- Get the content of the page.
- Returns:
- the Page object, or null if the page hasn't been downloaded.
getTokens
public Region[] getTokens()
- Get the token sequence of the page. Tokens are tags and whitespace-delimited text.
- Returns:
- token regions in the page, or null if the page hasn't been downloaded or parsed.
getTags
public Tag[] getTags()
- Get the tag sequence of the page.
- Returns:
- tags in the page, or null if the page hasn't been downloaded or parsed.
getWords
public Text[] getWords()
- Get the words in the page. Words are whitespace- and tag-delimited text.
- Returns:
- words in the page, or null if the page hasn't been downloaded or parsed.
getElements
public Element[] getElements()
- Get the HTML elements in the page. All elements in the page
are included in the list, in the order they would appear in
an inorder traversal of the HTML parse tree.
- Returns:
- HTML elements in the page ordered by inorder, or null if the page
hasn't been downloaded or parsed.
getRootElement
public Element getRootElement()
- Get the root HTML element of the page.
- Returns:
- first top-level HTML element in the page, or null
if the page hasn't been downloaded or parsed.
- Overrides:
- getRootElement in class Region
getLinks
public Link[] getLinks()
- Get the links found in the page.
- Returns:
- links in the page, or null
if the page hasn't been downloaded or parsed.
toURL
public String toURL()
- Convert the link's URL to a String
- Returns:
- the URL represented as a string
toDescription
public String toDescription()
- Generate a human-readable description of the page.
- Returns:
- a description of the link, in the form "title [url]".
toString
public String toString()
- Get page containing the region.
- Returns:
- page containing the region
- Overrides:
- toString in class Region
getLastModified
public long getLastModified()
- Get last-modified date of page.
- Returns:
- the date when the page was last modified, or 0 if not known.
The value is number of seconds since January 1, 1970 GMT
setLastModified
public void setLastModified(long last)
- Set last-modified date of page.
- Parameters:
- last - the date when the page was last modified, or 0 if not known.
The value is number of seconds since January 1, 1970 GMT
getExpiration
public long getExpiration()
- Get expiration date of page.
- Returns:
- the expiration date of the page, or 0 if not known.
The value is number of seconds since January 1, 1970 GMT.
setExpiration
public void setExpiration(long expire)
- Set expiration date of page.
- Parameters:
- expire - the expiration date of the page, or 0 if not known.
The value is number of seconds since January 1, 1970 GMT.
getContentType
public String getContentType()
- Get MIME type of page.
- Returns:
- the MIME type of page, such as "text/html", or null if not known.
setContentType
public void setContentType(String type)
- Set MIME type of page.
- Parameters:
- type - the MIME type of page, such as "text/html", or null if not known.
getContentEncoding
public String getContentEncoding()
- Get content encoding of page.
- Returns:
- the encoding type of page, such as "base-64", or null if not known.
setContentEncoding
public void setContentEncoding(String encoding)
- Set content encoding of page.
- Parameters:
- encoding - the encoding type of page, such as "base-64", or null if not known.
getResponseCode
public int getResponseCode()
- Get response code returned by the Web server. For list of
possible values, see java.net.HttpURLConnection.
- Returns:
- response code, such as 200 (for OK) or 404 (not found).
Code is -1 if unknown.
- See Also:
- HttpURLConnection
getResponseMessage
public String getResponseMessage()
- Get response message returned by the Web server.
- Returns:
- response message, such as "OK" or "Not Found". The response message is null if the page failed to be fetched or not known.
substringContent
public String substringContent(int start,
int end)
- Get raw content found in a region.
- Parameters:
- start - starting offset of region
- end - ending offset of region
- Returns:
- raw HTML contained in the region
substringHTML
public String substringHTML(int start,
int end)
- Get HTML found in a region.
- Parameters:
- start - starting offset of region
- end - ending offset of region
- Returns:
- representation of region as HTML
substringText
public String substringText(int start,
int end)
- Get tagless text found in a region.
Runs of whitespace and tags are reduced to a single space character.
- Parameters:
- start - starting offset of region
- end - ending offset of region
- Returns:
- tagless text contained in the region
substringTags
public String substringTags(int start,
int end)
- Get HTML tags found in a region. Whitespace and text among the
tags are deleted.
- Parameters:
- start - starting offset of region
- end - ending offset of region
- Returns:
- tags contained in the region
substringCanonicalTags
public String substringCanonicalTags(int start,
int end)
- Get canonicalized HTML tags found in a region.
A canonicalized tag looks like the following:
<tagname#index attr=value attr=value attr=value ...>
where tagname and attr are all lowercase, index is the tag's
index in the page's tokens array. Attributes are sorted in
increasing order by attribute name. Attributes without values
omit the entire "=value" portion. Values are delimited by a
space. All occurences of <, >, space, and % characters
in a value are URL-encoded (e.g., space is converted to %20).
Thus the only occurences of these characters in the canonical
tag are the tag delimiters.
For example, raw HTML that looks like:
<IMG SRC="http://foo.com/map<>.gif" ISMAP>Image</IMG>
would be canonicalized to:
<img ismap src=http://foo.com/map%3C%3E.gif></img>
Comment and declaration tags (whose tag name is !) are omitted
from the canonicalization.
- Parameters:
- start - starting offset of region
- end - ending offset of region
- Returns:
- canonicalized tags contained in the region
main
public static void main(String args[]) throws Exception
All Packages Class Hierarchy This Package Previous Next Index