tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Høydahl <jan....@cominvent.com>
Subject Re: Pluggable language detection
Date Sun, 08 Apr 2012 23:16:47 GMT
In Solr, we made support for pluggable lang detectors, one being Tika's. See http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/langid/
The detectLanguage() method returns a list of DetectedLanguage objects with a normalized certainty
between 0.0 and 1.0. Think it's a step in right direction.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 22. mars 2012, at 11:22, Julien Nioche wrote:

> If you mean integrating a better third-party detector - that's exactly my
> point. We don't develop and maintain our own parsers, why should we follow
> a different logic when it comes to language identification? There are other
> resource around why don't we just use them? I assume that by default our
> existing detector (improved or not) could still be used, all we need is
> just a mechanism to be able to select an alternative implementation and a
> common interface. That's probably not a big deal to implement. Any thoughts
> on how to do it? Are there any things we should reuse from the way we deal
> with the parsers?
> 
> Thanks for your comments
> 
> Julien
> 
> 
> On 21 March 2012 16:55, Ken Krugler <kkrugler_lists@transpac.com> wrote:
> 
>> 
>> On Mar 21, 2012, at 8:51am, Julien Nioche wrote:
>> 
>>> Hi guys,
>>> 
>>> Just wondering about the best way to make the language detection
>> pluggable
>>> instead of having it hard-wired as it is now. We now that the resources
>>> that are currently in Tika are both slow and inaccurate [1] and there are
>>> other libraries that we could leverage. Why not having the option to
>> select
>>> a different implementation just like we do for parsers? Obviously we'd
>> need
>>> a common interface for the parsers etc...
>>> 
>>> What do you think?
>> 
>> I'd be more in favor of using that time to integrate a better language
>> detector into Tika, so that everybody wins from the work :)
>> 
>> -- Ken
>> 
>> 
>>> [1]
>>> 
>> http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html
>>> 
>>> --
>>> *
>>> *Open Source Solutions for Text Engineering
>>> 
>>> http://digitalpebble.blogspot.com/
>>> http://www.digitalpebble.com
>>> http://twitter.com/digitalpebble
>> 
>> --------------------------
>> Ken Krugler
>> http://www.scaleunlimited.com
>> custom big data solutions & training
>> Hadoop, Cascading, Mahout & Solr
>> 
>> 
>> 
>> 
>> 
> 
> 
> -- 
> *
> *Open Source Solutions for Text Engineering
> 
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble


Mime
View raw message