lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ma, Xiaohui (NIH/NLM/LHC) [C]" <xiao...@mail.nlm.nih.gov>
Subject RE: how to deal with virtual collection in solr?
Date Thu, 26 Aug 2010 18:27:22 GMT
Thanks so much for your help, Jan Høydahl!

I made multiple cores (aa public, aa private, bb public and bb private). I knew how to query
them individually. Please tell me if I can do a combinations through shards parameter now.
If yes, I tried to append &shards=aapub,bbpub after query string. Unfortunately it didn't
work.

Actually all of content is the same. I don't have "collection" field in xml files. Please
tell me how I can set a "collection" field in schema and simply search collection through
filter.

I used curl to index pdf files. I use Solr 1.4.1. I got the following error when I index pdf
with version 1.5 and 1.6.

*************************************
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>
<title>Error 500 </title>
</head>
<body><h2>HTTP ERROR: 500</h2><pre>org.apache.tika.exception.TikaException:
Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@134ae32

org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: Unexpected
RuntimeException from org.apache.tika.parser.pdf.PDFParser@134ae32
        at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
        at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
        at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
        at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
        at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
        at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
        at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
        at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
        at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
        at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
        at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
        at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
        at org.mortbay.jetty.Server.handle(Server.java:285)
        at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
        at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
        at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
        at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@134ae32
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:121)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
        at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
        ... 22 more
Caused by: java.lang.NullPointerException
        at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:194)
        at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:182)
        at org.pdfbox.pdmodel.PDDocumentCatalog.getAllPages(PDDocumentCatalog.java:226)
        at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
        at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
        at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:53)
        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:51)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119)
        ... 24 more
</pre>
<p>RequestURI=/solr/lhcpdf/update/extract</p><p><i><small><a
href="http://jetty.mortbay.org/">Powered by Jetty://</a></small></i></p><br/>
                                               
<br/>                          
***************************************


-----Original Message-----
From: Jan Høydahl / Cominvent [mailto:jan.asf@cominvent.com] 
Sent: Wednesday, August 25, 2010 4:34 PM
To: solr-user@lucene.apache.org
Subject: Re: how to deal with virtual collection in solr? 

> 1. Currently we use Verity and have more than 20 collections, each collection has a index
for public items and a index for private items. So there are virtual collections which point
to each collection and a virtual collection which points to all. For example, we have AA and
BB collections.
> 
> AA virtual collection --> (AA index for public items and AA index for private items).
> BB virtual collection --> (BB index for public items and BB index for private items).
> All virtual collection --> (AA index for public items and AA index for private items,
BB index for public items and BB index for private items).
> 
> Would you please tell me what I should do for this if I use Solr?

There are multiple ways to solve this, depending on the nature of your collections. If they
have somewhat different schemas, a natural choice would be to make multiple cores: AA-private,
AA-public, BB-private, BB-public. Now you can query them individually or in combinations through
the shards parameter. From next Solr version you can use virtual collections for the shard
parameter, e.g. &shards=AA,BB etc. (See http://wiki.apache.org/solr/SolrCloud#Distributed_Requests)

If all your content is (roughly) the same kind of data, you could also solve your virtual
collection issue through a "collection" field in your schema, and simply select collection
through filters: &fq=collection:AA. You could even write a Search Component which translates
a &collection= parameter in the request into the correct filters if you want to hide this
implementation to the front ends.

> 2. Our project has different kind format files I need index them. For example, xml files,
pdf files and text files. Is it possible for Solr to return a search result from all?

Sure. PDF and text files can be indexed through the ExtractingRequestHandler. XML can be indexed
from XMLUpdateHandler or DataImportHandler. Solr uses Apache Tika internally to extract text
from PDFs and other rich document formats.

> 
> 3. I got a error when I index pdf files which are version 1.5 or 1.6. Would you please
tell me if there is a patch to fix it?

How did you try to index these PDFs? What version of Solr are you using? Exactly what error
message did you get?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com


Mime
View raw message