tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Mattmann" <mattm...@apache.org>
Subject Re: Review Request 22761: Create a Tika Translator implementation that uses JoshuaDecoder
Date Wed, 18 Jun 2014 22:04:21 GMT

This is an automatically generated e-mail. To reply, visit:

(Updated June 18, 2014, 10:04 p.m.)

Review request for tika.

Bugs: tika-1343

Repository: tika


The Joshua Decoder toolkit is a BSD licensed Java-based statistical machine translation system
hosted at Github:


Joshua takes in corpuses and trains models that can then be used to do language translation.
Currently there is support for e.g., Spanisn->English, Indian dialects->English, Chinese->English,
and a few others.


It would be nice to build a Tika Translator on top of Joshua. There are of course several
issues with this:

* the models are huge - so we'll need a separate package or Maven module, maybe tika-translate-joshua
or something to release the models and we'll need to build the models. I just went through
the process of building the Spanish->English one, and it still needs to be rebuilt b/c
I did it wrong, but it took over a day
* there is a configuration for Joshua, and so we need some way of passing that config into
the Translator. Not sure of the best way to do this.
* Joshua isn't in the Central repository. I've started a discussion on the Joshua lists about
this: https://groups.google.com/forum/#!topic/joshua_support/9Y04miboUj0

Anyhoo, I've got a working patch right now with hard code stuff, and a manual install into
my Maven repo for brave souls out there that want to try it.

Diffs (updated)

  ./trunk/tika-translate/pom.xml 1603529 

Diff: https://reviews.apache.org/r/22761/diff/


ran through on my locally built Spanish->English corpus built using http://joshua-decoder.org/data/fisher-callhome-corpus/
My dataset isn't perfect, but it can do basic translations. Also wrote a unit test, part of
the patch.


Chris Mattmann

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message