tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Oleg Tikhonov <o...@apache.org>
Subject Re: [jira] [Commented] (TIKA-546) Add ability to create language profiles to tika-app
Date Thu, 14 Apr 2011 14:16:02 GMT
Sami,
Chris and me, some time ago did that for developerWorks tutorial, the
"clean" code exist, although may be out of day.
I thought, is it good idea to use Nutch code inside Tika? Might be Nutch
guys could extend it as independent module?



On Thu, Apr 14, 2011 at 3:01 PM, Sami Siren (JIRA) <jira@apache.org> wrote:

>
>    [
> https://issues.apache.org/jira/browse/TIKA-546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13019793#comment-13019793]
>
> Sami Siren commented on TIKA-546:
> ---------------------------------
>
> bq. Do we build the "LanguageProfilerBuilder" from Nutch code here locally
> and ship it as binary package/library or as part of mvn install task/ ant
> task?
>
> I would just do what Jan suggested = get the relevant source files from
> Nutch, modify them as needed (like remove dependencies etc) and commit this
> into Tika svn repository.
>
>
>
> > Add ability to create language profiles to tika-app
> > ---------------------------------------------------
> >
> >                 Key: TIKA-546
> >                 URL: https://issues.apache.org/jira/browse/TIKA-546
> >             Project: Tika
> >          Issue Type: New Feature
> >          Components: cli, languageidentifier
> >    Affects Versions: 0.7
> >            Reporter: Jan H√łydahl
> >
> > Since TIKA-490 it is supposed to be easy adding new language profiles to
> TIKA. However, currently the process involves using Nutch's NGramProfile
> tool and editing the output.
> > We should port Nutch's profile builder to Tika and make it part of
> tika-app.jar:
> > # See http://wiki.apache.org/nutch/LanguageIdentifier
> > # java -jar tika-app.jar --create-profile [--gramsizes=<n>,<n>,...]
> [--maxlines=<max>] <profile-name> <filename> <encoding>
> > Using --gramsizes and --maxlines, we could support both Tika-style
> profiles and Nutch-style profiles and thus deprecate the Nutch tool.
> Defaults should be --gramsizes=3 --maxlines=1000
>
> --
> This message is automatically generated by JIRA.
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message