tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: Attributes in XHTML output
Date Tue, 11 May 2010 13:22:49 GMT
Hi Andrzej,

Thanks for responding. See my comments/questions at the end...

On May 11, 2010, at 2:40am, Andrzej Bialecki wrote:

> On 2010-05-11 02:56, Ken Krugler wrote:
>> Hi all,
>>
>> I was taking another look at TIKA-379, which is the issue of "Html
>> elements and attributes not available in XHTML representation"
>>
>> In a comment on that issue, Jukka said:
>>
>>> The reason for the default HTML mapping rules in Tika are to  
>>> simplify
>>> and normalize the input documents so that client applications could
>>> easily process all sorts of input (HTML or not) without needing  
>>> type-
>>> or source-specific heuristics. The basic idea has been that clients
>>> should directly use the underlying parser libraries when it needs
>>> custom processing of specific content types.
>>
>> It feels to me like the issue of elements is a bit different than
>> attributes. When processing the response, having a well-constrained  
>> set
>> of (XHTML-valid) elements would definitely make it easier for  
>> clients.
>>
>> But I don't see how restricting valid XHTML _attributes_ helps much.
>> During processing of the result, you care about the structure of the
>> DOM, not typically optional attributes.
>>
>> Anybody care to weigh in on this?
>>
>> My specific issue has to do with lang and rel attributes, which are  
>> very
>> useful during crawling.
>
> Hi,
>
> In my opinion this has to do with the level of knowledge that you  
> expect
> from the clients of this API, and the extent of a meaningful schema
> mapping that you can perform by default.
>
> If you pass through all valid attributes unchanged, then clients  
> need to
> be aware of "lang" and "rel" and their meaning, which poses a  
> question:
> what if some other format uses "language" and "function" instead? your
> client then would have to handle all such variants of the same
> (semantically speaking) data. It's a natural expectation that such
> details should be handled by the library, and the library should know
> that for this particular format "language" is semantically  
> equivalent to
> a better-known "lang" attribute...

If it's valid XHTML, and validates with (say) the XHTML 1.0 Strict  
DTD, then I don't think you would have this case of getting back a  
language (versus lang) attribute.

Or are you talking about ways to make it easier for parsers to return  
conformant attributes?

> Such 1:1 mapping is often impossible to do, but in many useful cases  
> it
> is possible. I think this should be a configurable component in Tika.
>
> E.g. in many Nutch plugins we map format-specific attributes to a
> "standard set" of attributes that other Nutch plugins can rely upon.
> This is currently hardcoded in plugin implementations.
>
>> I know that the HtmlMapper support (with some improvements) could
>> address my needs, but if there's a way to propagate safe attributes
>> through to everybody, that seems like a superior solution.
>
> +1 for a component that knows how to map common format-specific
> attributes to abstract attributes e.g. Dublin Core, HTML, Office, etc.
> The classes in o.a.nutch.metadata may be helpful.

So if I understand this correctly, it's not a concern about passing  
through valid XHTML attributes, but rather their value to clients -  
specifically in the context of normalizing the meaning for a variety  
of input formats.

I think the initial idea was to use the metadata map to return these  
in a generic way, which works for document-wide things...but most of  
what's interesting to me, at least, is on a per-element basis.

If we said that XHTML 1.0 Strict specified allowable attributes, would  
this address your concern about clients needing to handle multiple  
attribute names?

Thanks,

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





Mime
View raw message