tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris A. Mattmann (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika
Date Sat, 15 Aug 2015 15:26:45 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14698315#comment-14698315

Chris A. Mattmann commented on TIKA-1699:

bq. I've tried to exclude the grobid transient dependencies to work around this problem, but
even an exclude of * still breaks the build on org.apache.maven.plugins:maven-remote-resources-plugin
with the broken repo definition. Unfortunately, I've therefore had to back out your r1695816,
in order to unbreak the build. Hopefully we can get the grobid community to sort that shortly,
and we can restore it!

yeah we're working with them to getting this fixed.

bq. On other possible issue spotted while failing to work around the broken pom - the grobid-core
jar seems to be almost 15mb in size! Plus its dependencies themselves. That means we'll increase
the size of the tika-app, tika-server and tika-bundle jars by almost half! Is there perhaps
a smaller grobid jar we could depend on instead, which doesn't cause such a bump in our dependency
sizes and jars?

Looking at: http://repo1.maven.org/maven2/org/apache/tika/tika-app/1.10/

Tika-app is ~48MB it seems so closer to 30% actually size increase. As for depending on a
smaller core Jar, I had an idea here. Grobid has a server, I wonder if we should just connect
to its REST server? [~sujenshah] In that fashion we could omit adding really any dependencies
beyond CXF and its WebClient. I'll investigate this.

> Integrate the GROBID PDF extractor in Tika
> ------------------------------------------
>                 Key: TIKA-1699
>                 URL: https://issues.apache.org/jira/browse/TIKA-1699
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Sujen Shah
>            Assignee: Chris A. Mattmann
>              Labels: memex
>             Fix For: 1.11
> GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning library for extracting,
parsing and re-structuring raw documents such as PDF into structured TEI-encoded documents
with a particular focus on technical and scientific publications.
> It has a java api which can be used to augment PDF parsing for journals and help extract
extra metadata about the paper like authors, publication, citations, etc. 
> It would be nice to have this integrated into Tika, I have tried it on my local, will
issue a pull request soon.

This message was sent by Atlassian JIRA

View raw message