tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Høydahl (JIRA) <j...@apache.org>
Subject [jira] Created: (TIKA-546) Add ability to create language profiles to tika-app
Date Mon, 08 Nov 2010 08:39:10 GMT
Add ability to create language profiles to tika-app

                 Key: TIKA-546
                 URL: https://issues.apache.org/jira/browse/TIKA-546
             Project: Tika
          Issue Type: New Feature
          Components: cli, languageidentifier
    Affects Versions: 0.7
            Reporter: Jan Høydahl

Since TIKA-490 it is supposed to be easy adding new language profiles to TIKA. However, currently
the process involves using Nutch's NGramProfile tool and editing the output.

We should port Nutch's profile builder to Tika and make it part of tika-app.jar:
# See http://wiki.apache.org/nutch/LanguageIdentifier
# java -jar tika-app.jar --create-profile [--gramsizes=<n>,<n>,...] [--maxlines=<max>]
<profile-name> <filename> <encoding>

Using --gramsizes and --maxlines, we could support both Tika-style profiles and Nutch-style
profiles and thus deprecate the Nutch tool. Defaults should be --gramsizes=3 --maxlines=1000

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message