lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: Not storing, but highlighting from document sentences
Date Tue, 18 Jan 2011 07:25:12 GMT
Hi Tarjei,

:)
Yeah, that is the solution we are going with, actually.


Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
> From: Tarjei Huse <tarjei@scanmine.com>
> To: solr-user@lucene.apache.org
> Sent: Tue, January 18, 2011 1:33:44 AM
> Subject: Re: Not storing, but highlighting from document sentences
> 
> On 01/12/2011 12:02 PM, Otis Gospodnetic wrote:
> > Hello,
> >
> >  I'm indexing some content (articles) whose text I cannot store in its 
>original 
>
> > form for copyright reason.  So I can index the content, but cannot  store 
>it.  
>
> > However, I need snippets and search term  highlighting.  
> >
> >
> > Any way to accomplish this  elegantly?  Or even not so elegantly?
> >
> > Here is one  idea:
> >
> > * Create 2 indices: main index for indexing (but not  storing) the original 
> > content, the secondary index for storing  individual sentences from the 
>original 
>
> > article.
> How about storing  the sentences in the same index in a separate field
> but with random ordering,  would that be ok?
> 
> Tarjei
> > * That is, before indexing an article,  split it into sentences.  Then index 
>the 
>
> > article in the main  index, and index+store each sentence in the secondary 
> > index.  So  for each doc in the main index there will be multiple docs in the 
>
> >  secondary index with individual sentences.  Each sentence doc includes an  
>ID of 
>
> > the "parent" document.
> >
> > * Then run queries against  the main index, and pull individual sentences 
>from 
>
> > the secondary index  for snippet+highlight purposes.
> >
> >
> > The problem I see with  this approach (and there may be other ones that I am 
>not 
>
> > seeing yet) is  with queries like foo AND bar.  In this case "foo" may be a 
>match 
>
> >  from sentence #1, and "bar" may be a match from sentence #7.  Or maybe  
>"foo" is 
>
> > a match in sentence #1, and "bar" is a match in multiple  sentences: #7 and 
>#10 
>
> > and #23.
> >
> > Regardless, when a query  is run against the main index, you don't know where 
>the 
>
> > match was, so  you don't know which sentences to go get from the secondary  
>index.
> >
> > Does anyone have any suggestions for how to handle  this?
> >
> > Thanks,
> > Otis
> > ----
> > Sematext :: http://sematext.com/ :: Solr -  Lucene - Nutch
> > Lucene ecosystem search :: http://search-lucene.com/
> >
> 
> 
> -- 
> Regards / Med vennlig  hilsen
> Tarjei Huse
> Mobil: 920 63 413
> 
> 

Mime
View raw message