tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: Extracting dublin core metadata in HtmlParser?
Date Tue, 19 Jan 2010 14:01:44 GMT
Hi Nick,

On Jan 19, 2010, at 5:41am, Nick Burch wrote:

> Hi All
>
> I've been taking a look at the HtmlParser, and I can't spot anything  
> in there that extracts any of the dublin core metadata that could be  
> there. It seems that it's only things like location and encoding  
> that get set onto the metadata object. Nothing like description,  
> author etc seems to get set.

Only location & encoding are explicitly looked for, but all meta tag  
values get put into the metadata map.

See HtmlHandler.startElement(), where it has:

         if (bodyLevel == 0 && discardLevel == 0) {
             if ("META".equals(name) && atts.getValue("content") !=  
null) {
                 if (atts.getValue("http-equiv") != null) {
                     metadata.set(
                             atts.getValue("http-equiv"),
                             atts.getValue("content"));
                 }
                 if (atts.getValue("name") != null) {
                     metadata.set(
                             atts.getValue("name"),
                             atts.getValue("content"));
                 }


Though the names defined in Tika's DublinCore enum seem to be missing  
the "dc." prefix.

-- Ken



--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message