| |
- HTMLTag
- TagConfig
-
- TagCanOnlyHaveConfig
- TagCannotHaveConfig
- exceptions.Exception(exceptions.BaseException)
-
- HTMLTagError
-
- HTMLNotAllowedError
- HTMLTagAttrLookupError(HTMLTagError, exceptions.LookupError)
- HTMLTagIncompleteError
- HTMLTagProcessingInstructionError
- HTMLTagUnbalancedError
- sgmllib.SGMLParser(markupbase.ParserBase)
-
- HTMLReader
class HTMLReader(sgmllib.SGMLParser) |
|
NOTES
* Special attention is required regarding tags like <p> and <li> which
sometimes are closed and sometimes not. HTMLReader can deal with both situations
(closed and not) provided that:
* the file doesn't change conventions for a given tag
* the reader knows ahead of time what to expect
Be default, HTMLReader assumes that <p> and <li> will be closed with </p> and
</li> as the official HTML spec, as well as upcomer XHTML, encourage or require,
respectively.
But if your files don't close certain tags that are supposed to be required,
you can do this:
HTMLReader(extraEmptyTags=['p', 'li'])
or:
reader.extendEmptyTags(['p', 'li'])
Or just set them entirely:
HTMLReader(emptyTags=['br', 'hr', 'p'])
reader.setEmptyTags(['br', 'hr', 'p'])
Although there are quite a few. Consider the DefaultEmptyTags global list
(which is used to initialize the reader's tags) which contains about 16 tag
names.
If an HTML file doesn't conform to the reader's expectation, you will get an
exception (see more below for details).
If your HTML file doesn't contain root <html> ... </html> tags wrapping
everything, a fake root tag will be constructed for you, unless you pass
in fakeRootTagIfNeeded=False.
Besides fixing your reader manually, you could conceivably loop through the
permutations of the various empty tags to see if one of them resulted in a
correct read.
Or you could fix the HTML.
* The reader ignores extra preceding and trailing whitespace by stripping it
from strings. I suppose this is a little harsher than reducing spans of
preceding and trailing whitespace down to one space, which is what really
happens in an HTML browser.
* The reader will not read past the closing </html> tag.
* The reader is picky about the correctness of the HTML you feed it. If tags
are not closed, overlap (instead of nest) or left unfinished, an exception is
thrown. These include HTMLTagUnbalancedError, HTMLTagIncompleteError and
HTMLNotAllowedError which all inherit HTMLTagError.
This pickiness can be quite useful for the validation of the HTML of your
own applications.
I believe it is possible that others kinds of HTML errors could raise
exceptions from sgmlib.SGMLParser (from which HTMLReader inherits),
although in practice, I have not seen them.
TO DO
* Could the "empty" tag issue be dealt with more sophistication by
automatically closing <p> and <li> (e.g., popping them off the _tagStack) when
other major tags were encountered such as <p>, <li>, <table>, <center>, etc.?
* Readers don't handle processing instructions: <? foobar ?>.
* The tagContainmentConfig class var can certainly be expanded for even better
validation. |
|
- Method resolution order:
- HTMLReader
- sgmllib.SGMLParser
- markupbase.ParserBase
Methods defined here:
- __init__(self, emptyTags=None, extraEmptyTags=None, fakeRootTagIfNeeded=True)
- close(self)
- computeTagContainmentConfig(self)
- emptyTags(self)
- Returns a list of empty tags. See also: class docs and setEmptyTags().
- extendEmptyTags(self, tagList)
- Extends the current list of empty tags with the given list.
- filename(self)
- Returns the filename that was read, or None if no file was processed.
- finish_endtag = unknown_endtag(self, name)
- finish_starttag = unknown_starttag(self, name, attrs)
- handle_data(self, data)
- handle_pi(self, data)
- main(self, args=None)
- The command line equivalent of readFileNamed().
Invoked when HTMLTag is run as a program.
- pprint(self, out=None)
- Pretty prints the tag, its attributes and all its children.
Indentation is used for subtags.
Print 'Empty.' if there is no root tag.
- printsStack(self)
- readFileNamed(self, filename, retainRootTag=True)
- Reads the given file. Relies on readString(). See that method for more
information.
- readString(self, string, retainRootTag=True)
- Reads the given string, storing the results and returning the root tag. You
could continue to use HTMLReader object or disregard it and simply use the root
tag.
- rootTag(self)
- Returns the root tag. May return None if no HTML has been read yet, or if the
last invocation of one of the read methods was passed retainRootTag=False.
- setEmptyTags(self, tagList)
- Sets the HTML tags that are considered empty such as <br> and <hr>.
The default is found in the global, DefaultEmptyTags, and is fairly thorough,
but does not include <p>, <li> and some other tags that HTML authors often use
as empty tags.
- setPrintsStack(self, flag)
- Sets the boolean value of the "prints stack" option. This is a debugging
option which will print the internal tag stack during HTML processing. The
default value is False.
- unknown_endtag(self, name)
- unknown_starttag(self, name, attrs)
- usage(self)
Data and other attributes defined here:
- tagContainmentConfig = {'body': 'cannotHave html head body', 'head': 'cannotHave html head body', 'html': 'canOnlyHave head body', 'select': 'canOnlyHave option', 'table': 'canOnlyHave tr thead tbody tfoot a', 'td': 'cannotHave td tr', 'tr': 'canOnlyHave th td'}
Methods inherited from sgmllib.SGMLParser:
- convert_charref(self, name)
- Convert character reference, may be overridden.
- convert_codepoint(self, codepoint)
- convert_entityref(self, name)
- Convert entity references.
As an alternative to overriding this method; one can tailor the
results by setting up the self.entitydefs mapping appropriately.
- error(self, message)
- feed(self, data)
- Feed some data to the parser.
Call this as often as you want, with as little or as much text
as you want (may include '
'). (This just saves the text,
all the processing is done by goahead().)
- finish_shorttag(self, tag, data)
- # Internal -- finish parsing of <tag/data/ (same as <tag>data</tag>)
- get_starttag_text(self)
- goahead(self, end)
- # Internal -- handle data as far as reasonable. May leave state
# and data to be processed by a subsequent call. If 'end' is
# true, force handling all data as if followed by EOF marker.
- handle_charref(self, name)
- Handle character reference, no need to override.
- handle_comment(self, data)
- # Example -- handle comment, could be overridden
- handle_decl(self, decl)
- # Example -- handle declaration, could be overridden
- handle_endtag(self, tag, method)
- # Overridable -- handle end tag
- handle_entityref(self, name)
- Handle entity references, no need to override.
- handle_starttag(self, tag, method, attrs)
- # Overridable -- handle start tag
- parse_endtag(self, i)
- # Internal -- parse endtag
- parse_pi(self, i)
- # Internal -- parse processing instr, return length or -1 if not terminated
- parse_starttag(self, i)
- # Internal -- handle starttag, return length or -1 if not terminated
- report_unbalanced(self, tag)
- # Example -- report an unbalanced </...> tag.
- reset(self)
- Reset this instance. Loses all unprocessed data.
- setliteral(self, *args)
- Enter literal mode (CDATA).
Intended for derived classes only.
- setnomoretags(self)
- Enter literal mode (CDATA) till EOF.
Intended for derived classes only.
- unknown_charref(self, ref)
- unknown_entityref(self, ref)
Data and other attributes inherited from sgmllib.SGMLParser:
- entity_or_charref = <_sre.SRE_Pattern object>
- entitydefs = {'amp': '&', 'apos': "'", 'gt': '>', 'lt': '<', 'quot': '"'}
Methods inherited from markupbase.ParserBase:
- getpos(self)
- Return current line number and offset.
- parse_comment(self, i, report=1)
- # Internal -- parse comment, return length or -1 if not terminated
- parse_declaration(self, i)
- # Internal -- parse declaration (for use by subclasses).
- parse_marked_section(self, i, report=1)
- # Internal -- parse a marked section
# Override this to handle MS-word extension syntax <![if word]>content<![endif]>
- unknown_decl(self, data)
- # To be overridden -- handlers for unknown objects
- updatepos(self, i, j)
- # Internal -- update line number and offset. This should be
# called for each piece of data exactly once, in order -- in other
# words the concatenation of all the input strings to this
# function should be exactly the entire input.
|
class HTMLTag |
|
Tags essentially have 4 major attributes:
* name
* attributes
* children
* subtags
Name is simple:
print tag.name()
Attributes are dictionary-like in nature:
print tag.attr('color') # throws an exception if no color
print tag.attr('bgcolor', None) # returns none if no bgcolor
print tag.attrs()
Children are all the leaf parts of a tag, consisting of other tags and strings
of character data.
print tag.numChildren()
print tag.childAt(0)
print tag.children()
Subtags is a convenient list of only the tags in the children:
print tag.numSubtags()
print tag.subtagAt(0)
print tag.subtags()
You can search a tag and all the tags it contains for a tag with a particular
attribute matching a particular value:
print tag.tagWithMatchingAttr('width', '100%')
An HTMLTagAttrLookupError is raised if no matching tag is found. You can avoid
this by providing a default value:
print tag.tagWithMatchingAttr('width', '100%', None)
Looking for specific 'id' attributes is common in regression testing (it allows
you to zero in on logical portions of a page), so a convenience method is
provided:
tag = htmlTag.tagWithId('accountTable')
TO DO
* A walker() method for traversing the tag tree.
* Search for a subtag with a given name, recursive or not.
* Attribute traversal with dotted notation?
* Do we need to convert tag names and attribute names to lower case, or does
SGMLParser already do that?
* Should attribute values be strip()ed?
Probably not. SGMLParser probably strips them already unless they really do
have spaces as in " quoted ". But that's speculation. |
|
Methods defined here:
- __init__(self, name, lineNumber=None)
- __repr__(self)
- addChild(self, child)
- Adds a child to the receiver. The child will be another tag or a string
(CDATA).
- attr(self, name, default=<class MiscUtils.NoDefault>)
- attrs(self)
- childAt(self, index)
- children(self)
- closedBy(self, name, lineNumber)
- hasAttr(self, name)
- name(self)
- numAttrs(self)
- numChildren(self)
- numSubtags(self)
- pprint(self, out=None, indent=0)
- readAttr(self, name, value)
- Sets an attribute of the tag with the given name and value. An assertion
fails if an attribute is set twice.
- subtagAt(self, index)
- subtags(self)
- tagWithId(self, id, default=<class MiscUtils.NoDefault>)
- Finds and returns the tag with the given id. As in:
<td id=foo> bar </td>
This is just a cover for:
tagWithMatchingAttr('id', id, default)
But searching for id's is so popular (at least in regression testing web
sites) that this convenience method is provided.
Why is it so popular? Because by attaching ids to logical portions of your
HTML, your regression test suite can quickly zero in on them for examination.
- tagWithMatchingAttr(self, name, value, default=<class MiscUtils.NoDefault>)
- Performs a depth-first search for a tag with an attribute that matches the
given value. If the tag cannot be found, a KeyError will be raised *unless* a
default value was specified, which is then returned.
tag = tag.tagWithMatchingAttr('bgcolor', '#FFFF', None)
|
|