tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris A. Mattmann (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika
Date Tue, 04 Aug 2015 02:25:04 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14652967#comment-14652967

Chris A. Mattmann commented on TIKA-1699:

Sujen please update the PR with my 2 comments/updates and then also please let me know when
the rest of the JAR files are on central then I think we can integrate this. We should also
make a custom tika-config to override the default PDF parser, or better yet to somehow combine
it with this. That's one thing I thought too - it would make sense to combine these, right,
or are they separate parsers, really? It seems like they should be separate because potentially
they have overlapping keys, right?

We also need to make a page on the Tika wiki that describes how to install Grobid: http://wiki.apache.org/tika/GrobidParser

> Integrate the GROBID PDF extractor in Tika
> ------------------------------------------
>                 Key: TIKA-1699
>                 URL: https://issues.apache.org/jira/browse/TIKA-1699
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Sujen Shah
>            Assignee: Chris A. Mattmann
>              Labels: memex
>             Fix For: 1.11
> GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning library for extracting,
parsing and re-structuring raw documents such as PDF into structured TEI-encoded documents
with a particular focus on technical and scientific publications.
> It has a java api which can be used to augment PDF parsing for journals and help extract
extra metadata about the paper like authors, publication, citations, etc. 
> It would be nice to have this integrated into Tika, I have tried it on my local, will
issue a pull request soon.

This message was sent by Atlassian JIRA

View raw message