lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anirudha Jadhav <aniru...@nyu.edu>
Subject Re: Get page number of searchresult of a pdf in solr
Date Sat, 02 Mar 2013 15:59:39 GMT
if you increase the granularity of your document in index to a single page
instead of an entire pdf; it becomes an easy problem.

Your description states that you are not searching for a terms in a pdf but
instead you are searching for a term in a page from a pdf.

I assume you load the pdf externally for rendering.

Not sure why you need the combined doc. Search against the document pages,
and use faceting on the filenameID to return unique docs matched per search




On Sat, Mar 2, 2013 at 1:46 AM, Aloke Ghoshal <alghoshal@gmail.com> wrote:

> Hi,
>
> We are going about solving this problem by splitting a N-page document in
> to N separate documents (one per page, type=Page) + 1 additional combined
> document (that has all the pages, type=Combined). All the N+1 documents
> have the same doc_id.
>
> The search is initially performed against the combined document
> (type=Combined) to identify documents that match. For each search result a
> second search is performed against the separate pages (type=Page AND
> doc_id) to idetify the pages from within that document that match.
>
> Keen to know how others have solved this.
>
> Regards,
> Aloke
>
> On Fri, Mar 1, 2013 at 8:51 PM, Dyer, James <James.Dyer@ingramcontent.com
> >wrote:
>
> > Is there an easy (enough) way to do this, storing the page number as a
> > payload on each term?
> >
> > James Dyer
> > Ingram Content Group
> > (615) 213-4311
> >
> > -----Original Message-----
> > From: Michael Della Bitta [mailto:michael.della.bitta@appinions.com]
> > Sent: Thursday, February 28, 2013 3:33 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Get page number of searchresult of a pdf in solr
> >
> > My guess is the best way to do this is to index each page separately
> > and to store a link to the PDF/page in each document.
> >
> > That would probably require you to preprocess the PDFs to turn each
> > one into a single page per PDF, or to extract the text per page
> > another way.
> >
> > Michael Della Bitta
> >
> > ------------------------------------------------
> > Appinions
> > 18 East 41st Street, 2nd Floor
> > New York, NY 10017-6271
> >
> > www.appinions.com
> >
> > Where Influence Isn't a Game
> >
> >
> > On Thu, Feb 28, 2013 at 3:26 PM,  <dev@geschan.de> wrote:
> > > Hello,
> > >
> > > I'm building a web application where users can search for pdf documents
> > and
> > > view them with pdf.js. I would like to display the search results with
> a
> > > short snippet of the paragraph where the search term where found and a
> > link
> > > to open the document at the right page.
> > >
> > > So what I need is the page number and a short text snippet of every
> > search
> > > result.
> > >
> > > I'm using SOLR 4.1 for indexing pdf documents. The indexing itself
> works
> > > fine but I don't know how to get the page number and paragraph of a
> > search
> > > result. I only get the document where the search term was found in.
> > >
> > > -Gesh
> > >
> >
> >
> >
>



-- 
Anirudha P. Jadhav

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message