lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Upayavira ...@odoko.co.uk>
Subject Re: Get page number of searchresult of a pdf in solr
Date Sat, 02 Mar 2013 16:46:55 GMT
Can you index every page as a separate doc (they can share a docID
across all pages, the Solr ID field is just docID+pageno), then use
highlighting to to get snippets, and use result grouping to group docs
based on their docid. That'll mean you'll have all pages from a single
document grouped together.

You could use the page number in a page_* dynamic field, but then you'd
have to query against page_1, page_2, page_3...page_n for every query,
which wouldn't work too well.

Upayavira

On Sat, Mar 2, 2013, at 03:59 PM, Anirudha Jadhav wrote:
> if you increase the granularity of your document in index to a single
> page
> instead of an entire pdf; it becomes an easy problem.
> 
> Your description states that you are not searching for a terms in a pdf
> but
> instead you are searching for a term in a page from a pdf.
> 
> I assume you load the pdf externally for rendering.
> 
> Not sure why you need the combined doc. Search against the document
> pages,
> and use faceting on the filenameID to return unique docs matched per
> search
> 
> 
> 
> 
> On Sat, Mar 2, 2013 at 1:46 AM, Aloke Ghoshal <alghoshal@gmail.com>
> wrote:
> 
> > Hi,
> >
> > We are going about solving this problem by splitting a N-page document in
> > to N separate documents (one per page, type=Page) + 1 additional combined
> > document (that has all the pages, type=Combined). All the N+1 documents
> > have the same doc_id.
> >
> > The search is initially performed against the combined document
> > (type=Combined) to identify documents that match. For each search result a
> > second search is performed against the separate pages (type=Page AND
> > doc_id) to idetify the pages from within that document that match.
> >
> > Keen to know how others have solved this.
> >
> > Regards,
> > Aloke
> >
> > On Fri, Mar 1, 2013 at 8:51 PM, Dyer, James <James.Dyer@ingramcontent.com
> > >wrote:
> >
> > > Is there an easy (enough) way to do this, storing the page number as a
> > > payload on each term?
> > >
> > > James Dyer
> > > Ingram Content Group
> > > (615) 213-4311
> > >
> > > -----Original Message-----
> > > From: Michael Della Bitta [mailto:michael.della.bitta@appinions.com]
> > > Sent: Thursday, February 28, 2013 3:33 PM
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: Get page number of searchresult of a pdf in solr
> > >
> > > My guess is the best way to do this is to index each page separately
> > > and to store a link to the PDF/page in each document.
> > >
> > > That would probably require you to preprocess the PDFs to turn each
> > > one into a single page per PDF, or to extract the text per page
> > > another way.
> > >
> > > Michael Della Bitta
> > >
> > > ------------------------------------------------
> > > Appinions
> > > 18 East 41st Street, 2nd Floor
> > > New York, NY 10017-6271
> > >
> > > www.appinions.com
> > >
> > > Where Influence Isn't a Game
> > >
> > >
> > > On Thu, Feb 28, 2013 at 3:26 PM,  <dev@geschan.de> wrote:
> > > > Hello,
> > > >
> > > > I'm building a web application where users can search for pdf documents
> > > and
> > > > view them with pdf.js. I would like to display the search results with
> > a
> > > > short snippet of the paragraph where the search term where found and a
> > > link
> > > > to open the document at the right page.
> > > >
> > > > So what I need is the page number and a short text snippet of every
> > > search
> > > > result.
> > > >
> > > > I'm using SOLR 4.1 for indexing pdf documents. The indexing itself
> > works
> > > > fine but I don't know how to get the page number and paragraph of a
> > > search
> > > > result. I only get the document where the search term was found in.
> > > >
> > > > -Gesh
> > > >
> > >
> > >
> > >
> >
> 
> 
> 
> -- 
> Anirudha P. Jadhav

Mime
View raw message