tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: Attributes in XHTML output
Date Tue, 11 May 2010 15:04:56 GMT
On 2010-05-11 15:22, Ken Krugler wrote:

>> If you pass through all valid attributes unchanged, then clients need to
>> be aware of "lang" and "rel" and their meaning, which poses a question:
>> what if some other format uses "language" and "function" instead? your
>> client then would have to handle all such variants of the same
>> (semantically speaking) data. It's a natural expectation that such
>> details should be handled by the library, and the library should know
>> that for this particular format "language" is semantically equivalent to
>> a better-known "lang" attribute...
> 
> If it's valid XHTML, and validates with (say) the XHTML 1.0 Strict DTD,
> then I don't think you would have this case of getting back a language
> (versus lang) attribute.

No, of course not - but XHTML is not the original data that we have, we
generate it ourselves, and we have a choice of either dropping offending
attributes, or converting them to something acceptable under XHTML.


> Or are you talking about ways to make it easier for parsers to return
> conformant attributes?

Yes.

>> +1 for a component that knows how to map common format-specific
>> attributes to abstract attributes e.g. Dublin Core, HTML, Office, etc.
>> The classes in o.a.nutch.metadata may be helpful.
> 
> So if I understand this correctly, it's not a concern about passing
> through valid XHTML attributes, but rather their value to clients -
> specifically in the context of normalizing the meaning for a variety of
> input formats.

Passing translated attributes when we can (according to a mapping), and
passing original attributes in a non-offending way when we can't
translate them.

> 
> I think the initial idea was to use the metadata map to return these in
> a generic way, which works for document-wide things...but most of what's
> interesting to me, at least, is on a per-element basis.
> 
> If we said that XHTML 1.0 Strict specified allowable attributes, would
> this address your concern about clients needing to handle multiple
> attribute names?

Can't we put any attributes that we want if they are under a different
namespace, and still be XHTML conformant? You are right that top-level
maps may not cut - e.g. when parsing bilingual corpora (like europarl)
every other line should get a different <p lang="">.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Mime
View raw message