lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From d..@geschan.de
Subject Re: Get page number of searchresult of a pdf in solr
Date Fri, 01 Mar 2013 10:12:40 GMT
Is it possible to write a plugin that is converting each page  
separately with Tika and saving all pages in one document (maybe in a  
dynamic field like "page_*")? I would like to have only one document  
stored in SOLR for each pdf (it fit's better to the way my web  
application is managing these documents and I would like to use the  
same id as unique identifier).


To be honest, I can't understand why SOLR is not able to find the  
pages where the search term was found in. It's a quite common task in  
my opinion.

-Gesh

Zitat von Michael Della Bitta <michael.della.bitta@appinions.com>:

> My guess is the best way to do this is to index each page separately
> and to store a link to the PDF/page in each document.
>
> That would probably require you to preprocess the PDFs to turn each
> one into a single page per PDF, or to extract the text per page
> another way.
>
> Michael Della Bitta
>
> ------------------------------------------------
> Appinions
> 18 East 41st Street, 2nd Floor
> New York, NY 10017-6271
>
> www.appinions.com
>
> Where Influence Isn?t a Game
>
>
> On Thu, Feb 28, 2013 at 3:26 PM,  <dev@geschan.de> wrote:
>> Hello,
>>
>> I'm building a web application where users can search for pdf documents and
>> view them with pdf.js. I would like to display the search results with a
>> short snippet of the paragraph where the search term where found and a link
>> to open the document at the right page.
>>
>> So what I need is the page number and a short text snippet of every search
>> result.
>>
>> I'm using SOLR 4.1 for indexing pdf documents. The indexing itself works
>> fine but I don't know how to get the page number and paragraph of a search
>> result. I only get the document where the search term was found in.
>>
>> -Gesh
>>




Mime
View raw message