lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Betsey Benagh <betsey.ben...@stresearch.com>
Subject Re: Integrating grobid with Tika in solr
Date Wed, 04 May 2016 14:06:53 GMT
Grobid runs as a service, and I’m (theoretically) configuring Tika to call it.

>From the Grobid wiki, here are instructions for integrating with Tika application:

First we need to create the GrobidExtractor.properties file that points to the Grobid REST
Service. My file looks like the following:

grobid.server.url=http://localhost:[port]

Now you can run GROBID via Tika-app with the following command on a sample PDF file.

java -classpath $HOME/src/grobidparser-resources/:tika-app-1.11-SNAPSHOT.jar org.apache.tika.cli.TikaCLI
--config=$HOME/src/grobidparser-resources/tika-config.xml -J $HOME/src/grobid/papers/ICSE06.pdf

Here’s the stack trace.

<lst name="error"><lst name="metadata"><str name="error-class">org.apache.solr.common.SolrException</str><str
name="root-error-class">java.lang.ClassNotFoundException</str></lst><str
name="msg">org.apache.tika.exception.TikaException: Unable to find a parser class: org.apache.tika.parser.journal.JournalParser</str><str
name="trace">org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException:
Unable to find a parser class: org.apache.tika.parser.journal.JournalParser
at org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(ExtractingRequestHandler.java:82)
at org.apache.solr.core.PluginBag$LazyPluginHolder.createInst(PluginBag.java:367)
at org.apache.solr.core.PluginBag$LazyPluginHolder.get(PluginBag.java:348)
at org.apache.solr.core.PluginBag.get(PluginBag.java:148)
at org.apache.solr.handler.RequestHandlerBase.getRequestHandler(RequestHandlerBase.java:231)
at org.apache.solr.core.SolrCore.getRequestHandler(SolrCore.java:1362)
at org.apache.solr.servlet.HttpSolrCall.extractHandlerFromURLPath(HttpSolrCall.java:326)
at org.apache.solr.servlet.HttpSolrCall.init(HttpSolrCall.java:296)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:412)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:225)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:183)
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
at org.eclipse.jetty.server.Server.handle(Server.java:499)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.tika.exception.TikaException: Unable to find a parser class: org.apache.tika.parser.journal.JournalParser
at org.apache.tika.config.TikaConfig.parserFromDomElement(TikaConfig.java:362)
at org.apache.tika.config.TikaConfig.&lt;init&gt;(TikaConfig.java:127)
at org.apache.tika.config.TikaConfig.&lt;init&gt;(TikaConfig.java:115)
at org.apache.tika.config.TikaConfig.&lt;init&gt;(TikaConfig.java:111)
at org.apache.tika.config.TikaConfig.&lt;init&gt;(TikaConfig.java:92)
at org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(ExtractingRequestHandler.java:80)
... 30 more
Caused by: java.lang.ClassNotFoundException: org.apache.tika.parser.journal.JournalParser
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:814)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.tika.config.ServiceLoader.getServiceClass(ServiceLoader.java:189)
at org.apache.tika.config.TikaConfig.parserFromDomElement(TikaConfig.java:338)
... 35 more
</str><int name="code">500</int></lst>



On 5/4/16, 10:00 AM, "Shawn Heisey" <apache@elyograg.org<mailto:apache@elyograg.org>>
wrote:

On 5/4/2016 7:15 AM, Betsey Benagh wrote:
(X-posted from stack overflow)
This feels like a basic, dumb question, but my reading of the documentation has not led me
to an answer.
i'm using Solr to index journal articles. Using the out-of-the-box configuration, it indexed
the text of the documents, but I'm looking to use Grobid to pull out the authors, title, affiliations,
etc. I got grobid up and running as a service.
I added
<str name="tika.config">/path/to/tika-config.xml</str>
to the requestHandler for /update/extract in solrconfig.xml
The tika-config looks like:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<properties>
   <parsers>
     <parser class="org.apache.tika.parser.journal.JournalParser">
       <mime>application/pdf</mime>
     </parser>
   </parsers>
</properties>
I'm getting a ClassNotFound exception when I try to import a document, but can't figure out
where to set the classpath to fix it.

I do not know anything about grobid.

We'll need to see the exception -- the entire multi-line stacktrace,
including any "caused by" sections.

In general, you should create a lib directory in the solr home and place
all extra jars in that directory.  Otherwise you need <lib> elements in
solrconfig.xml to load jars -- and they will be loaded once for every
core that uses that <lib> element.  ${solr.solr.home}/lib loads jars
*once* when Solr starts and makes them available to all cores.

Thanks,
Shawn



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message