tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Question about XPath Matcher code & MatchingContentHandler
Date Thu, 30 Aug 2012 17:35:11 GMT
Hi Jukka,

I was looking into a failure in a Bixo test, when using BodyContentHandler (wrapped by XHTMLContentHandler).

The issue is that BodyContentHandler uses MatchingContentHandler to find only text in nodes
under the /html/body hierarchy.

And this in turn winds up not matching the <html> element.

Looking at the MatchingContentHandler's startElement() method (code below), the issue I see
is that the initial matcher.descend() works, in that it matches the html element, but the
Matcher it returns is a NamedElementMatcher. This in turn always returns false from its matchesElement()
method, so you only wind up actually descending if the html element has attributes - which
I don't, for this case.

It seems like NamedElementMatcher needs to set state (matched or not) when descend() is called,
that state needs to be returned by its matchesElement(), and the initial matcher should be
queried for matchesElement, not the matcher you descend into.

But this whole Matcher/XPath parser stuff is pretty convoluted, so I didn't want to file an
issue & try fixing it until I got some input from you.


-- Ken

    public void startElement(
            String uri, String localName, String name, Attributes attributes)
            throws SAXException {
        matcher = matcher.descend(uri, localName);

        AttributesImpl matches = new AttributesImpl();
        for (int i = 0; i < attributes.getLength(); i++) {
            String attributeURI = attributes.getURI(i);
            String attributeName = attributes.getLocalName(i);
            if (matcher.matchesAttribute(attributeURI, attributeName)) {
                        attributeURI, attributeName, attributes.getQName(i),
                        attributes.getType(i), attributes.getValue(i));

        if (matcher.matchesElement() || matches.getLength() > 0) {
            super.startElement(uri, localName, name, matches);
            if (!matcher.matchesElement()) {
                // Force the matcher to match the current element, so the
                // endElement method knows to emit the correct event
                matcher =
                    new CompositeMatcher(matcher, ElementMatcher.INSTANCE);

Ken Krugler
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message