Hi
2008/9/24 Brian Levay <brian.levay@gmail.com>
> I'll submit the updates when I'm done (along with unit tests). I'm having
> a
> problem though. I sync'ed my tika baseline this morning and the Matcher
> stopped matching the <meta> tags. Any idea what my be causing this? I've
> tried many variations of the xpath expressions to match the <meta> tags.
> Right now my code in HTMLParser looks like this:
>
> Matcher body = xpath.parse("/HTML/BODY//node()");
> Matcher title = xpath.parse("/HTML/HEAD/TITLE//node()");
> Matcher meta = xpath.parse("/HTML/HEAD/META//node()");
> handler = new TeeContentHandler(
> new MatchingContentHandler(getBodyHandler(xhtml), body),
> new MatchingContentHandler(getTitleHandler(metadata),
> title),
> new MatchingContentHandler(getMetaHandler(metadata), meta));
>
> The <meta> handler isn't being called. If I use /HTML/HEAD//node() the
> handler will get called for the <head> and <title> tags but it will skip
> right past the <meta> tags. I know the tika code is seeing the META tags
> because I see the tags trying to be matched in the startElement method of
> MatchingContentHandler. Any ideas?
>
> --Brian
>
I am using effectively the same thing in a local copy and have just re-based
it again HEAD (shown in the diff below), and it appears to be working fine
for me.
What is your test XML like?
Cheers,
Dave
Index: src/main/java/org/apache/tika/parser/html/HtmlParser.java
===================================================================
--- src/main/java/org/apache/tika/parser/html/HtmlParser.java (revision
698705)
+++ src/main/java/org/apache/tika/parser/html/HtmlParser.java (working
copy)
@@ -95,9 +95,11 @@
XPathParser xpath = new XPathParser(null, "");
Matcher body = xpath.parse("/HTML/BODY//node()");
Matcher title = xpath.parse("/HTML/HEAD/TITLE//node()");
+ Matcher meta = xpath.parse("/HTML/HEAD/META//node()");
handler = new TeeContentHandler(
new MatchingContentHandler(getBodyHandler(xhtml), body),
- new MatchingContentHandler(getTitleHandler(metadata),
title));
+ new MatchingContentHandler(getTitleHandler(metadata),
title),
+ new MatchingContentHandler(getMetaHandler(metadata),
meta));
// Parse the HTML document
xhtml.startDocument();
@@ -116,6 +118,17 @@
};
}
+ private ContentHandler getMetaHandler(final Metadata metadata) {
+ return new WriteOutContentHandler() {
+ @Override
+ public void startElement(
+ String uri, String local, String name, Attributes atts)
+ throws SAXException {
+ metadata.set(atts.getValue(0), atts.getValue(1));
+ }
+ };
+ }
+
private ContentHandler getBodyHandler(final XHTMLContentHandler xhtml)
{
return new TextContentHandler(xhtml) {
|