tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Attributes in XHTML output
Date Tue, 11 May 2010 00:56:52 GMT
Hi all,

I was taking another look at TIKA-379, which is the issue of "Html  
elements and attributes not available in XHTML representation"

In a comment on that issue, Jukka said:

> The reason for the default HTML mapping rules in Tika are to  
> simplify and normalize the input documents so that client  
> applications could easily process all sorts of input (HTML or not)  
> without needing type- or source-specific heuristics. The basic idea  
> has been that clients should directly use the underlying parser  
> libraries when it needs custom processing of specific content types.

It feels to me like the issue of elements is a bit different than  
attributes. When processing the response, having a well-constrained  
set of (XHTML-valid) elements would definitely make it easier for  

But I don't see how restricting valid XHTML _attributes_ helps much.  
During processing of the result, you care about the structure of the  
DOM, not typically optional attributes.

Anybody care to weigh in on this?

My specific issue has to do with lang and rel attributes, which are  
very useful during crawling.

I know that the HtmlMapper support (with some improvements) could  
address my needs, but if there's a way to propagate safe attributes  
through to everybody, that seems like a superior solution.


-- Ken

Ken Krugler
+1 530-210-6378
e l a s t i c   w e b   m i n i n g

View raw message