tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: Attributes in XHTML output
Date Tue, 11 May 2010 09:40:13 GMT
On 2010-05-11 02:56, Ken Krugler wrote:
> Hi all,
> I was taking another look at TIKA-379, which is the issue of "Html
> elements and attributes not available in XHTML representation"
> In a comment on that issue, Jukka said:
>> The reason for the default HTML mapping rules in Tika are to simplify
>> and normalize the input documents so that client applications could
>> easily process all sorts of input (HTML or not) without needing type-
>> or source-specific heuristics. The basic idea has been that clients
>> should directly use the underlying parser libraries when it needs
>> custom processing of specific content types.
> It feels to me like the issue of elements is a bit different than
> attributes. When processing the response, having a well-constrained set
> of (XHTML-valid) elements would definitely make it easier for clients.
> But I don't see how restricting valid XHTML _attributes_ helps much.
> During processing of the result, you care about the structure of the
> DOM, not typically optional attributes.
> Anybody care to weigh in on this?
> My specific issue has to do with lang and rel attributes, which are very
> useful during crawling.


In my opinion this has to do with the level of knowledge that you expect
from the clients of this API, and the extent of a meaningful schema
mapping that you can perform by default.

If you pass through all valid attributes unchanged, then clients need to
be aware of "lang" and "rel" and their meaning, which poses a question:
what if some other format uses "language" and "function" instead? your
client then would have to handle all such variants of the same
(semantically speaking) data. It's a natural expectation that such
details should be handled by the library, and the library should know
that for this particular format "language" is semantically equivalent to
a better-known "lang" attribute...

Such 1:1 mapping is often impossible to do, but in many useful cases it
is possible. I think this should be a configurable component in Tika.

E.g. in many Nutch plugins we map format-specific attributes to a
"standard set" of attributes that other Nutch plugins can rely upon.
This is currently hardcoded in plugin implementations.

> I know that the HtmlMapper support (with some improvements) could
> address my needs, but if there's a way to propagate safe attributes
> through to everybody, that seems like a superior solution.

+1 for a component that knows how to map common format-specific
attributes to abstract attributes e.g. Dublin Core, HTML, Office, etc.
The classes in o.a.nutch.metadata may be helpful.

Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

View raw message