lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tarjei Huse <tar...@scanmine.com>
Subject Re: Not storing, but highlighting from document sentences
Date Tue, 18 Jan 2011 06:33:44 GMT
On 01/12/2011 12:02 PM, Otis Gospodnetic wrote:
> Hello,
>
> I'm indexing some content (articles) whose text I cannot store in its original 
> form for copyright reason.  So I can index the content, but cannot store it.  
> However, I need snippets and search term highlighting.  
>
>
> Any way to accomplish this elegantly?  Or even not so elegantly?
>
> Here is one idea:
>
> * Create 2 indices: main index for indexing (but not storing) the original 
> content, the secondary index for storing individual sentences from the original 
> article.
How about storing the sentences in the same index in a separate field
but with random ordering, would that be ok?

Tarjei
> * That is, before indexing an article, split it into sentences.  Then index the 
> article in the main index, and index+store each sentence in the secondary 
> index.  So for each doc in the main index there will be multiple docs in the 
> secondary index with individual sentences.  Each sentence doc includes an ID of 
> the "parent" document.
>
> * Then run queries against the main index, and pull individual sentences from 
> the secondary index for snippet+highlight purposes.
>
>
> The problem I see with this approach (and there may be other ones that I am not 
> seeing yet) is with queries like foo AND bar.  In this case "foo" may be a match 
> from sentence #1, and "bar" may be a match from sentence #7.  Or maybe "foo" is 
> a match in sentence #1, and "bar" is a match in multiple sentences: #7 and #10 
> and #23.
>
> Regardless, when a query is run against the main index, you don't know where the 
> match was, so you don't know which sentences to go get from the secondary index.
>
> Does anyone have any suggestions for how to handle this?
>
> Thanks,
> Otis
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>


-- 
Regards / Med vennlig hilsen
Tarjei Huse
Mobil: 920 63 413


Mime
View raw message