tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject HTML mime-types
Date Mon, 07 Dec 2009 02:47:56 GMT
Currently the tika-config.xml file maps three mime-types to the  
HtmlParser:

         <parser name="parse-html"  
class="org.apache.tika.parser.html.HtmlParser">
                 <mime>text/html</mime>
                 <mime>application/xhtml+xml</mime>
                 <mime>application/x-asp</mime>
         </parser>

I notice that facebook.com, if you don't specify an Accept: value in  
the request header, returns this for the mime-type:

application/vnd.wap.xhtml+xml

Wondering if this should be added to the set, and if so then what  
other variants like this are floating around.

Or if we need something like "application/*.xhtml.xml" so that  
wildcards can be used in mimetype patterns.

-- Ken


--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





Mime
View raw message