lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Investigating Lucene for Applicability to [Unusual?] Use Case
Date Wed, 13 Jun 2007 19:23:47 GMT
You might get more responses on java-user@lucene.a.o

On the surface, I don't see any reason why Lucene couldn't handle  
this.  Essentially, you are splitting the stream into Lucene  
Documents and indexing them.  Keep in mind, that Lucene doesn't care  
where the text comes from (PDF, AFP, whatever), that is up to the  
application to control.

So, basically, the answer is Lucene can enable what you want, but you  
will still need to do the application level logic.

On Jun 13, 2007, at 3:02 PM, Brad Harper wrote:

> Hello:
> I'm investigating Lucene as a replacement for a special-purpose search
> technology that was developed long before Lucene (or any of the  
> current IR
> libraries) became available.
> The use case involves so-called print streams. Imagine 20,000  
> statements
> concatenated into one large file suitable for delivery to a print  
> system.
> The document formats vary, but include AFP (an IBM printer format),  
> PCL (an
> HP format), Postscript, PDF, and even "plain-text".
> The indexing application must track the total page count of the  
> embedded
> statements. On a hit, the search application must extract and  
> return the
> [possibly multi-page] statement embedded within the larger print- 
> stream
> file.
> How would the search application know (be informed by the Lucene/ 
> indexer)
> the extent of the internal document(s)?
> I'm not seeing this scenario discussed in forums or books. Does  
> anyone have
> comments or thoughts on Lucene's applicability as a solution?
> Thanks.
> Brad
> -- 
> View this message in context: 
> Lucene-for-Applicability-to--Unusual---Use-Case- 
> tf3917031.html#a11106468
> Sent from the Lucene - General mailing list archive at

Grant Ingersoll
Center for Natural Language Processing

Read the Lucene Java FAQ at 

View raw message