lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Høydahl / Cominvent <jan....@cominvent.com>
Subject Re: how to deal with virtual collection in solr?
Date Fri, 27 Aug 2010 11:41:36 GMT
Hi,

Version 1.4.1 does not support the SolrCloud style sharding. In 1.4.1, please use this style:
&shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/


However, since schema is the same, I'd opt for one index with a "collections" field as the
filter.

You can add that field to your schema, and then inject it as metadata on the ExtractingRequestHandler
call:

curl "http://localhost:8983/solr/update/extract?literal.collection=aaprivate&literal.id=doc1&commit=true"
-F "file=@myfile.pdf"

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 26. aug. 2010, at 20.41, Ma, Xiaohui (NIH/NLM/LHC) [C] wrote:

> Thanks so much for your help! I will try it.
> 
> 
> -----Original Message-----
> From: Thomas Joiner [mailto:thomas.b.joiner@gmail.com] 
> Sent: Thursday, August 26, 2010 2:36 PM
> To: solr-user@lucene.apache.org
> Subject: Re: how to deal with virtual collection in solr?
> 
> I don't know about the shards, etc.
> 
> However I recently encountered that exception while indexing pdfs as well.
> The way that I resolved it was to upgrade to a nightly build of Solr. (You
> can find them https://hudson.apache.org/hudson/view/Solr/job/Solr-trunk/).
> 
> The problem is that the version of Tika that 1.4.1 using is a very old
> version of Tika, which uses a old version of PDFBox to do its parsing.  (You
> might be able to fix the problem just by replacing the Tika jars...however I
> don't know if there have been any API changes so I can't really suggest
> that.)
> 
> We didn't upgrade to trunk in order for that functionality, but it was nice
> that it started working. (The PDFs we'll be indexing won't be of later
> versions, but a test file was).
> 
> On Thu, Aug 26, 2010 at 1:27 PM, Ma, Xiaohui (NIH/NLM/LHC) [C] <
> xiaohui@mail.nlm.nih.gov> wrote:
> 
>> Thanks so much for your help, Jan Høydahl!
>> 
>> I made multiple cores (aa public, aa private, bb public and bb private). I
>> knew how to query them individually. Please tell me if I can do a
>> combinations through shards parameter now. If yes, I tried to append
>> &shards=aapub,bbpub after query string. Unfortunately it didn't work.
>> 
>> Actually all of content is the same. I don't have "collection" field in xml
>> files. Please tell me how I can set a "collection" field in schema and
>> simply search collection through filter.
>> 
>> I used curl to index pdf files. I use Solr 1.4.1. I got the following error
>> when I index pdf with version 1.5 and 1.6.
>> 
>> *************************************
>> <html>
>> <head>
>> <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>
>> <title>Error 500 </title>
>> </head>
>> <body><h2>HTTP ERROR: 500</h2><pre>org.apache.tika.exception.TikaException:
>> Unexpected RuntimeException from
>> org.apache.tika.parser.pdf.PDFParser@134ae32
>> 
>> org.apache.solr.common.SolrException:
>> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
>> org.apache.tika.parser.pdf.PDFParser@134ae32
>>       at
>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
>>       at
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
>>       at
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>>       at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>>       at
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>>       at
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>>       at
>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
>>       at
>> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
>>       at
>> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>>       at
>> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
>>       at
>> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
>>       at
>> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
>>       at
>> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
>>       at
>> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>>       at
>> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
>>       at org.mortbay.jetty.Server.handle(Server.java:285)
>>       at
>> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
>>       at
>> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
>>       at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
>>       at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
>>       at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
>>       at
>> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
>>       at
>> org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
>> Caused by: org.apache.tika.exception.TikaException: Unexpected
>> RuntimeException from org.apache.tika.parser.pdf.PDFParser@134ae32
>>       at
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:121)
>>       at
>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
>>       at
>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
>>       ... 22 more
>> Caused by: java.lang.NullPointerException
>>       at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:194)
>>       at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:182)
>>       at
>> org.pdfbox.pdmodel.PDDocumentCatalog.getAllPages(PDDocumentCatalog.java:226)
>>       at
>> org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
>>       at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
>>       at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:53)
>>       at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:51)
>>       at
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119)
>>       ... 24 more
>> </pre>
>> <p>RequestURI=/solr/lhcpdf/update/extract</p><p><i><small><a
href="
>> http://jetty.mortbay.org/">Powered by Jetty://</a></small></i></p><br/>
>> <br/>
>> ***************************************
>> 
>> 
>> -----Original Message-----
>> From: Jan Høydahl / Cominvent [mailto:jan.asf@cominvent.com]
>> Sent: Wednesday, August 25, 2010 4:34 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: how to deal with virtual collection in solr?
>> 
>>> 1. Currently we use Verity and have more than 20 collections, each
>> collection has a index for public items and a index for private items. So
>> there are virtual collections which point to each collection and a virtual
>> collection which points to all. For example, we have AA and BB collections.
>>> 
>>> AA virtual collection --> (AA index for public items and AA index for
>> private items).
>>> BB virtual collection --> (BB index for public items and BB index for
>> private items).
>>> All virtual collection --> (AA index for public items and AA index for
>> private items, BB index for public items and BB index for private items).
>>> 
>>> Would you please tell me what I should do for this if I use Solr?
>> 
>> There are multiple ways to solve this, depending on the nature of your
>> collections. If they have somewhat different schemas, a natural choice would
>> be to make multiple cores: AA-private, AA-public, BB-private, BB-public. Now
>> you can query them individually or in combinations through the shards
>> parameter. From next Solr version you can use virtual collections for the
>> shard parameter, e.g. &shards=AA,BB etc. (See
>> http://wiki.apache.org/solr/SolrCloud#Distributed_Requests)
>> 
>> If all your content is (roughly) the same kind of data, you could also
>> solve your virtual collection issue through a "collection" field in your
>> schema, and simply select collection through filters: &fq=collection:AA. You
>> could even write a Search Component which translates a &collection=
>> parameter in the request into the correct filters if you want to hide this
>> implementation to the front ends.
>> 
>>> 2. Our project has different kind format files I need index them. For
>> example, xml files, pdf files and text files. Is it possible for Solr to
>> return a search result from all?
>> 
>> Sure. PDF and text files can be indexed through the
>> ExtractingRequestHandler. XML can be indexed from XMLUpdateHandler or
>> DataImportHandler. Solr uses Apache Tika internally to extract text from
>> PDFs and other rich document formats.
>> 
>>> 
>>> 3. I got a error when I index pdf files which are version 1.5 or 1.6.
>> Would you please tell me if there is a patch to fix it?
>> 
>> How did you try to index these PDFs? What version of Solr are you using?
>> Exactly what error message did you get?
>> 
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> Training in Europe - www.solrtraining.com
>> 
>> 


Mime
View raw message