lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <>
Subject RE: Integrating grobid with Tika in solr
Date Wed, 04 May 2016 14:28:43 GMT
I think Solr is using a version of Tika that predates that addition of the Grobid parser. 
You'll have to add that manually somehow until Solr upgrades to Tika 1.13 (soon to be released...I
think).  SOLR-8981.

-----Original Message-----
From: Betsey Benagh [] 
Sent: Wednesday, May 4, 2016 10:07 AM
Subject: Re: Integrating grobid with Tika in solr

Grobid runs as a service, and I'm (theoretically) configuring Tika to call it.

>From the Grobid wiki, here are instructions for integrating with Tika application:

First we need to create the file that points to the Grobid REST
Service. My file looks like the following:


Now you can run GROBID via Tika-app with the following command on a sample PDF file.

java -classpath $HOME/src/grobidparser-resources/:tika-app-1.11-SNAPSHOT.jar org.apache.tika.cli.TikaCLI
--config=$HOME/src/grobidparser-resources/tika-config.xml -J $HOME/src/grobid/papers/ICSE06.pdf

Here's the stack trace.

<lst name="error"><lst name="metadata"><str name="error-class">org.apache.solr.common.SolrException</str><str
name="msg">org.apache.tika.exception.TikaException: Unable to find a parser class: org.apache.tika.parser.journal.JournalParser</str><str
name="trace">org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException:
Unable to find a parser class: org.apache.tika.parser.journal.JournalParser
at org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(
at org.apache.solr.core.PluginBag$LazyPluginHolder.createInst(
at org.apache.solr.core.PluginBag$LazyPluginHolder.get(
at org.apache.solr.core.PluginBag.get(
at org.apache.solr.handler.RequestHandlerBase.getRequestHandler(
at org.apache.solr.core.SolrCore.getRequestHandler(
at org.apache.solr.servlet.HttpSolrCall.extractHandlerFromURLPath(
at org.apache.solr.servlet.HttpSolrCall.init(
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(
at org.eclipse.jetty.servlet.ServletHandler.doHandle(
at org.eclipse.jetty.server.handler.ScopedHandler.handle(
at org.eclipse.jetty.server.session.SessionHandler.doHandle(
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(
at org.eclipse.jetty.servlet.ServletHandler.doScope(
at org.eclipse.jetty.server.session.SessionHandler.doScope(
at org.eclipse.jetty.server.handler.ContextHandler.doScope(
at org.eclipse.jetty.server.handler.ScopedHandler.handle(
at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(
at org.eclipse.jetty.server.handler.HandlerCollection.handle(
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(
at org.eclipse.jetty.server.Server.handle(
at org.eclipse.jetty.server.HttpChannel.handle(
at org.eclipse.jetty.server.HttpConnection.onFillable(
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(
at org.eclipse.jetty.util.thread.QueuedThreadPool$
Caused by: org.apache.tika.exception.TikaException: Unable to find a parser class: org.apache.tika.parser.journal.JournalParser
at org.apache.tika.config.TikaConfig.parserFromDomElement(
at org.apache.tika.config.TikaConfig.&lt;init&gt;(
at org.apache.tika.config.TikaConfig.&lt;init&gt;(
at org.apache.tika.config.TikaConfig.&lt;init&gt;(
at org.apache.tika.config.TikaConfig.&lt;init&gt;(
at org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(
... 30 more
Caused by: java.lang.ClassNotFoundException: org.apache.tika.parser.journal.JournalParser
at java.lang.ClassLoader.loadClass(
at java.lang.ClassLoader.loadClass(
at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(
at org.apache.tika.config.ServiceLoader.getServiceClass(
at org.apache.tika.config.TikaConfig.parserFromDomElement(
... 35 more
</str><int name="code">500</int></lst>

On 5/4/16, 10:00 AM, "Shawn Heisey" <<>>

On 5/4/2016 7:15 AM, Betsey Benagh wrote:
(X-posted from stack overflow)
This feels like a basic, dumb question, but my reading of the documentation has not led me
to an answer.
i'm using Solr to index journal articles. Using the out-of-the-box configuration, it indexed
the text of the documents, but I'm looking to use Grobid to pull out the authors, title, affiliations,
etc. I got grobid up and running as a service.
I added
<str name="tika.config">/path/to/tika-config.xml</str>
to the requestHandler for /update/extract in solrconfig.xml The tika-config looks like:
<?xml version="1.0" encoding="UTF-8" standalone="no"?> <properties>
     <parser class="org.apache.tika.parser.journal.JournalParser">
I'm getting a ClassNotFound exception when I try to import a document, but can't figure out
where to set the classpath to fix it.

I do not know anything about grobid.

We'll need to see the exception -- the entire multi-line stacktrace, including any "caused
by" sections.

In general, you should create a lib directory in the solr home and place all extra jars in
that directory.  Otherwise you need <lib> elements in solrconfig.xml to load jars --
and they will be loaded once for every core that uses that <lib> element.  ${solr.solr.home}/lib
loads jars
*once* when Solr starts and makes them available to all cores.


View raw message