From dev-return-12124-apmail-tika-dev-archive=tika.apache.org@tika.apache.org Sat Jul 5 00:40:36 2014 Return-Path: X-Original-To: apmail-tika-dev-archive@www.apache.org Delivered-To: apmail-tika-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1945411932 for ; Sat, 5 Jul 2014 00:40:35 +0000 (UTC) Received: (qmail 5273 invoked by uid 500); 5 Jul 2014 00:40:34 -0000 Delivered-To: apmail-tika-dev-archive@tika.apache.org Received: (qmail 5227 invoked by uid 500); 5 Jul 2014 00:40:34 -0000 Mailing-List: contact dev-help@tika.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@tika.apache.org Delivered-To: mailing list dev@tika.apache.org Received: (qmail 5213 invoked by uid 99); 5 Jul 2014 00:40:34 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 05 Jul 2014 00:40:34 +0000 Date: Sat, 5 Jul 2014 00:40:34 +0000 (UTC) From: "Chris A. Mattmann (JIRA)" To: dev@tika.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (TIKA-1343) Create a Tika Translator implementation that uses JoshuaDecoder MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/TIKA-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14052738#comment-14052738 ] Chris A. Mattmann commented on TIKA-1343: ----------------------------------------- hey Dave did you get a chance to try this out? > Create a Tika Translator implementation that uses JoshuaDecoder > --------------------------------------------------------------- > > Key: TIKA-1343 > URL: https://issues.apache.org/jira/browse/TIKA-1343 > Project: Tika > Issue Type: Bug > Components: general > Reporter: Chris A. Mattmann > Assignee: Chris A. Mattmann > Fix For: 1.6 > > > The Joshua Decoder toolkit is a BSD licensed Java-based statistical machine translation system hosted at Github: > http://joshua-decoder.org/ > Joshua takes in corpuses and trains models that can then be used to do language translation. Currently there is support for e.g., Spanisn->English, Indian dialects->English, Chinese->English, and a few others. > https://github.com/joshua-decoder/joshua/ > It would be nice to build a Tika Translator on top of Joshua. There are of course several issues with this: > * the models are huge - so we'll need a separate package or Maven module, maybe tika-translate-joshua or something to release the models and we'll need to build the models. I just went through the process of building the Spanish->English one, and it still needs to be rebuilt b/c I did it wrong, but it took over a day > * there is a configuration for Joshua, and so we need some way of passing that config into the Translator. Not sure of the best way to do this. > * Joshua isn't in the Central repository. I've started a discussion on the Joshua lists about this: https://groups.google.com/forum/#!topic/joshua_support/9Y04miboUj0 > Anyhoo, I've got a working patch right now with hard code stuff, and a manual install into my Maven repo for brave souls out there that want to try it. -- This message was sent by Atlassian JIRA (v6.2#6252)