Add ability to create language profiles to tika-app
---------------------------------------------------
Key: TIKA-546
URL: https://issues.apache.org/jira/browse/TIKA-546
Project: Tika
Issue Type: New Feature
Components: cli, languageidentifier
Affects Versions: 0.7
Reporter: Jan Høydahl
Since TIKA-490 it is supposed to be easy adding new language profiles to TIKA. However, currently
the process involves using Nutch's NGramProfile tool and editing the output.
We should port Nutch's profile builder to Tika and make it part of tika-app.jar:
# See http://wiki.apache.org/nutch/LanguageIdentifier
# java -jar tika-app.jar --create-profile [--gramsizes=<n>,<n>,...] [--maxlines=<max>]
<profile-name> <filename> <encoding>
Using --gramsizes and --maxlines, we could support both Tika-style profiles and Nutch-style
profiles and thus deprecate the Nutch tool. Defaults should be --gramsizes=3 --maxlines=1000
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
|