lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Pump <jp...@mindspring.com>
Subject Re: Text storing design and performance question
Date Wed, 10 Jan 2007 21:49:08 GMT
Renaud, one optimization you can do on this is to try the first 10kb, 
see if it finds text worth highlighting, if not, with a slight overlap 
try the next 9.9kb - 19.9kb or just 9.9kb -> end if you're feeling lazy. 
This assumes that most good matches are at the start of the document, 
and that the files on disk are not compressed.

moraleslos wrote:
> Maybe keeping the data in the DB would make it quicker?  Seems like the I/O
> performance would cause most of the performance issues you're seeing.  
>
> -los
>
>
> Renaud Waldura-5 wrote:
>   
>> We used to store a big text field for highlighting purposes too, and it
>> proved a big pain. The index was gigantic, it took forever to build, and
>> the
>> search performance would sometimes suffer from it (just a hunch).
>>
>> Now we keep this big text field on disk (in a file), and feed it to the
>> highlighter. Unfortunately the highlighter has to read the file, parse it,
>> etc... It's slooow, sometimes over a second on a large document.
>>
>> I vaguely remember reading somewhere newer versions of the highlighter are
>> able to leverage term vectors to avoid re-parsing the field. (I could be
>> mistaken.) Maybe just storing term vectors would keep the index lean and
>> allow for fast highlighting?
>>
>> --Renaud
>>
>>  
>>
>> -----Original Message-----
>> From: Mark Miller [mailto:markrmiller@gmail.com] 
>> Sent: Wednesday, January 10, 2007 9:54 AM
>> To: java-user@lucene.apache.org
>> Subject: Re: Text storing design and performance question
>>
>> Being stateless should not be much of an issue. As Erick mentioned, the
>> highlighter just expects you to pass it the query again and the text to be
>> highlighted. So when you show the pagination you just need to keep around
>> what query generated the current page...then shove each piece of relevant
>> text from the database through the highlighter (with the query) before
>> displaying it.
>>
>> - Mark
>>
>> On 1/10/07, moraleslos <moraleslos@hotmail.com> wrote:
>>     
>>> Hi Mark,
>>>
>>> Looks like I've got to implement some sort of pagination for my clients.
>>> Problem is everything is stateless so looks like there's some work I 
>>> need to do on my end.  Thanks.
>>>
>>> -los
>>>
>>>
>>>
>>> markrmiller wrote:
>>>       
>>>> Usually a user cannot easily browse 50,000 on a single display, and 
>>>> so you would only highlight the docs as they became visible to the
>>>>         
>>> user.
>>>       
>>>> This is generally a small amount...often one at a time.
>>>>
>>>> - Mark
>>>>
>>>> moraleslos wrote:
>>>>         
>>>>> Hi Erik,
>>>>>
>>>>> Would that slow performance a bit?  For example, say I receive 
>>>>> 50,000 hits from a search.  From your explanation, I have to 
>>>>> retrieve the DB id
>>>>>           
>>> from
>>>       
>>>>> each hit, perform a query to the DB using the id to retrieve the 
>>>>> full contents for each hit, run highlighter on each content, and 
>>>>> then
>>>>>           
>>> return?
>>>       
>>>>> Although I'll give this a shot, it will seem to slow performance on 
>>>>> the searching side of things, wouldn't it?  Thanks for the reply.
>>>>>
>>>>> -los
>>>>>
>>>>>
>>>>>
>>>>> Erik Hatcher wrote:
>>>>>
>>>>>           
>>>>>> You don't have to store a field to highlight text.  If you've got

>>>>>> it in your database, retrieve it from there and pass that string

>>>>>> to the highlighter instead.
>>>>>>
>>>>>>     Erik
>>>>>>
>>>>>>
>>>>>> On Jan 10, 2007, at 10:45 AM, moraleslos wrote:
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> I'm running into a little dilemma with Lucene highlighting and

>>>>>>> indexing.  I currently index anything and everything that gets

>>>>>>> inserted into a database.
>>>>>>> This database includes all the content that is searched.  Now

>>>>>>> I'll have lots and lots of content, thinking of the range of

>>>>>>> 50GB+, all stored in the DB.
>>>>>>> Using Lucene, I index all of this.  But since I'm using 
>>>>>>> highlighting features, I'll also need to store the content into

>>>>>>> the index.  Not sure what the performance implications are during

>>>>>>> a search but I know that indexing performance should be slower
as 
>>>>>>> well as the index size being
>>>>>>>               
>>> enormous.
>>>       
>>>>>>> Because I have duplicated data, one in the index and the other
in 
>>>>>>> the db, are there other ways of handling this situation in a
more 
>>>>>>> efficient and performant way?  Thanks in advance.
>>>>>>>
>>>>>>> -los
>>>>>>> --
>>>>>>> View this message in context: http://www.nabble.com/Text-storing-
>>>>>>> design-and-performance-question-tf2953201.html#a8259883
>>>>>>> Sent from the Lucene - Java Users mailing list archive at
>>>>>>>               
>>> Nabble.com.
>>>       
>>>>>>>
>>>>>>>               
>>> ---------------------------------------------------------------------
>>>       
>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>>
>>>>>>>               
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>           
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>
>>>>         
>>> --
>>> View this message in context:
>>>
>>>       
>> http://www.nabble.com/Text-storing-design-and-performance-question-tf2953201
>> .html#a8261739
>>     
>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>       
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>>     
>
>   


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message