tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1343) Create a Tika Translator implementation that uses JoshuaDecoder
Date Wed, 27 Apr 2016 22:05:12 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15261039#comment-15261039

ASF GitHub Bot commented on TIKA-1343:

GitHub user lewismc opened a pull request:


    TIKA-1343 Create a Tika Translator implementation that uses JoshuaDecoder

    This issue is this afternoons first attempt at addressing the long overdue https://issues.apache.org/jira/browse/TIKA-1343
    It also removes unused imports and material which is not required from within other Translation
    This has not be extensively tested, I will be testing it more tomorrow in particular debugging
the JSON response message and the REST API request. 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/lewismc/tika TIKA-1343

Alternatively you can review and apply these changes as the patch at:


To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #112
commit d4fb28f91d77458b15557942438f874b9f564e88
Author: Lewis John McGibbney <lewis.j.mcgibbney@jpl.nasa.gov>
Date:   2016-04-27T22:06:42Z

    TIKA-1343 Create a Tika Translator implementation that uses JoshuaDecoder


> Create a Tika Translator implementation that uses JoshuaDecoder
> ---------------------------------------------------------------
>                 Key: TIKA-1343
>                 URL: https://issues.apache.org/jira/browse/TIKA-1343
>             Project: Tika
>          Issue Type: New Feature
>          Components: translation
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>             Fix For: 1.14
> The Joshua Decoder toolkit is a BSD licensed Java-based statistical machine translation
system hosted at Github:
> http://joshua-decoder.org/
> Joshua takes in corpuses and trains models that can then be used to do language translation.
Currently there is support for e.g., Spanisn->English, Indian dialects->English, Chinese->English,
and a few others. 
> https://github.com/joshua-decoder/joshua/
> It would be nice to build a Tika Translator on top of Joshua. There are of course several
issues with this:
> * the models are huge - so we'll need a separate package or Maven module, maybe tika-translate-joshua
or something to release the models and we'll need to build the models. I just went through
the process of building the Spanish->English one, and it still needs to be rebuilt b/c
I did it wrong, but it took over a day
> * there is a configuration for Joshua, and so we need some way of passing that config
into the Translator. Not sure of the best way to do this.
> * Joshua isn't in the Central repository. I've started a discussion on the Joshua lists
about this: https://groups.google.com/forum/#!topic/joshua_support/9Y04miboUj0
> Anyhoo, I've got a working patch right now with hard code stuff, and a manual install
into my Maven repo for brave souls out there that want to try it.

This message was sent by Atlassian JIRA

View raw message