tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jukka Zitting <jukka.zitt...@gmail.com>
Subject Re: Question about XPath Matcher code & MatchingContentHandler
Date Tue, 04 Sep 2012 17:12:39 GMT

On Mon, Sep 3, 2012 at 7:50 PM, Ken Krugler <kkrugler_lists@transpac.com> wrote:
> No, the html _does_ match, which it needs to as it descends the DOM hierarchy.

Note that we're dealing with SAX events instead of DOM hierarchies
here. So what the startElement() methods does not (and is not meant
to) traverse the underlying subtree, but rather just decide whether to
pass on or filter out that specific SAX event.

The MatchingContentHandler class is essentially a state machine that
switches state by calling the descend() method of the current Matcher
object to get another Matcher object appropriate for matching or
filtering SAX events that occur at that position of the event stream.
In the endElement() method the stack of Matcher objects is rewound to
maintain the correct matching state at each level of the tree.

> Note the pattern used by BodyContentHandler is:
>     private static final Matcher MATCHER =
>         PARSER.parse("/xhtml:html/xhtml:body/descendant::node()");
> The problem I'm seeing (details in my previous email) is that once the /xhtml:html portion
> of the path has been matched, the code decides that it doesn't have a match, and if there
> are no attributes then it bails out.

As explained above, that startElement() call simply decides whether
that specific <html> start element should be passed on or filtered out
from the event stream. Since the pattern only matches elements inside
the <body> element, it correctly infers that the <html> element should
be filtered out.

A simple example sequence of SAX events would be processed like this:

    startElement("html"); // no match, ignore
    startElement("head"); // no match, ignore
    endElement("head"); // no match, ignore
    startElement("body"); // no match, ignore
    startElement("p"); // match, call super.startElement("p")
    endElement("p"); // match, call super.endElement("p")
    endElement("body"); // no match, ignore
    endElement("html"); // no match, ignore


Jukka Zitting

View raw message