lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Darren Hartford" <>
Subject RE: Solr newbe
Date Thu, 26 Jul 2007 12:48:13 GMT
One side-note is various content management tools already handle a lot
of data extraction (POI/PDFBox/etc).

In the case of Jakarta Slide and Apache Jackrabbit, both use Lucene
under the covers to index this data.

Not sure if you want to take the approach of putting your documents as
'managed' under a content management tool, but that's another avenue to


> -----Original Message-----
> From: Arne Muller [] 
> Sent: Thursday, July 26, 2007 8:04 AM
> To: Lucene
> Subject: Solr newbe
> Hello,
> I've just started with Lucene to index a file server and 
> aiming to index lotus notes and some tables from relational databases.
> After some research, I came (so far) to the conclusion that 
> I'm re-inventing the wheel, and that it may be better to use 
> solr or nutch as lucene front-ends.
> I was wondering if the following workflow is kind of state of 
> the art, or whether you've something to comment or add:
> Since there web-crawling is not necessary (my documents are 
> on a file server), I was thinking of using solr and to 
> implement a basic program that traverses the file system as a 
> scheduled task (or cron job). The program will not "crawl", 
> i.e. it will not follow any references within the documents 
> to continue its search elsewhere. It just processes each 
> found file, and if the file's modification date is newer than 
> 24h it adds it to solr.
> Most of the files are doc, xls, ppt and pdf. My "traverser" 
> will therefore use apache POI and PDFBox to extract the 
> contents from the documents, creating an appropriate XML 
> stream with the solr "add" tag and sending it to solr.
> I was wondering if there are already tools for this kind of 
> task available (this time I'd like to avoid reinventing the wheel ;-)?
>    thanks a lot for your help,
>    +kind regards,
>   Arne

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message