lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Veselin K <>
Subject Re: Indexing local PDFs: Lucene/Solr/Nutch ?
Date Wed, 17 Dec 2008 10:43:57 GMT
Thank you Erik, Hoss.

- If using either Solr's "stream.file" or Nutch's crawler,
  what is the procedure of adding new files?
  That is to say, if I did not know which are the new files in a
  specific folder and thus I passed all files to Solr/Nutch, would it
  skip the ones that have already been indexed?

- Also what if I file gets modified, would Solr/Nutch detect
  the change and re-index just this modified the file? 
  Or should some kind of cache be cleared and everything re-indexed?

- In order to provide the user with an option to search the indexes of
  two separete Solr/Nutch servers, do I need to link both servers
  somehow and join their indexes into one, or is it just a question of
  designing the web front-end so that it offers the choice to send your
  search query to one or multiple different servers.

Thank you,
Veselin K

On Sun, Dec 14, 2008 at 11:22:00AM -0800, Chris Hostetter wrote:
> : the easiest way to get rolling.  A simple script that recurses your folders
> : and issues a simple request posting each file in turn to Solr will give you a
> : full text searchable index in no time (well, ok, it'll take a little time, but
> : it'll be as fast as anything else out there).
> if all the files are "local" on the machine that Solr is running on you 
> don't even need to POST them, Solr can be configured to read the files by 
> local filename using the "stream.file" param...
> that said: if your fileserver implementation already exposes all of the 
> files over HTTP, then using Nutch and it's crawler might be an easier way 
> to get started on indexing all of them ... hard to say without being in 
> your shoes.  you may want to experiement with both.
> -Hoss

View raw message