tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sami Siren <ssi...@gmail.com>
Subject Re: Extensible content type detection
Date Tue, 20 Jan 2009 10:07:44 GMT
Jukka Zitting wrote:
> Hi,
> On Mon, Jan 19, 2009 at 7:25 AM, Sami Siren <ssiren@gmail.com> wrote:
>> I like the idea, it allows us to use different strategies for detecting the
>> type for individual formats or change the whole strategy used. Only thing
>> that I am wondering is should we introduce some kind of confidence level to
>> the guesses , perhaps part of metadata?
> Good question.
> I'm personally not that big a fan of confidence levels, as there's no
> clear definition of how they should be set and interpreted. I also
> haven't seen any real world cases where confidence levels really would
> have been needed to accurately determine the type of a document.
Yes, the only special I had in mind was the various "text" formats out 
there, but as you say there is no real use case at the moment so let's 
keep it out.

 Sami Siren

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message