tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julien Nioche <lists.digitalpeb...@gmail.com>
Subject Re: Pluggable language detection
Date Thu, 22 Mar 2012 10:22:46 GMT
If you mean integrating a better third-party detector - that's exactly my
point. We don't develop and maintain our own parsers, why should we follow
a different logic when it comes to language identification? There are other
resource around why don't we just use them? I assume that by default our
existing detector (improved or not) could still be used, all we need is
just a mechanism to be able to select an alternative implementation and a
common interface. That's probably not a big deal to implement. Any thoughts
on how to do it? Are there any things we should reuse from the way we deal
with the parsers?

Thanks for your comments

Julien


On 21 March 2012 16:55, Ken Krugler <kkrugler_lists@transpac.com> wrote:

>
> On Mar 21, 2012, at 8:51am, Julien Nioche wrote:
>
> > Hi guys,
> >
> > Just wondering about the best way to make the language detection
> pluggable
> > instead of having it hard-wired as it is now. We now that the resources
> > that are currently in Tika are both slow and inaccurate [1] and there are
> > other libraries that we could leverage. Why not having the option to
> select
> > a different implementation just like we do for parsers? Obviously we'd
> need
> > a common interface for the parsers etc...
> >
> > What do you think?
>
> I'd be more in favor of using that time to integrate a better language
> detector into Tika, so that everybody wins from the work :)
>
> -- Ken
>
>
> > [1]
> >
> http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> > http://twitter.com/digitalpebble
>
> --------------------------
> Ken Krugler
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Mahout & Solr
>
>
>
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message