lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Indexing local PDFs: Lucene/Solr/Nutch ?
Date Sun, 28 Dec 2008 02:29:05 GMT
Can you provide details about the part of the examples that weren't  
clear?  Perhaps I can clean up the docs or help you figure it out.


On Dec 27, 2008, at 3:42 PM, Veselin Kantsev wrote:

> Hello,
> I am now using solr 1.3 with tomcat6 on a debian lenny box.
> Could you please advise of any other instructions/HowTos on  
> integrating Tika or
> maybe RichDocumentHandler with Solr, that I can find online?
> Apart from the Solr Wiki, as following those examples did not help  
> in my
> case.
> Thank you.
> Veselin K.
> On Wed, Dec 17, 2008 at 10:43:57AM +0000, Veselin K wrote:
>> Thank you Erik, Hoss.
>> - If using either Solr's "stream.file" or Nutch's crawler,
>>  what is the procedure of adding new files?
>>  That is to say, if I did not know which are the new files in a
>>  specific folder and thus I passed all files to Solr/Nutch, would it
>>  skip the ones that have already been indexed?
>> - Also what if I file gets modified, would Solr/Nutch detect
>>  the change and re-index just this modified the file?
>>  Or should some kind of cache be cleared and everything re-indexed?
>> - In order to provide the user with an option to search the indexes  
>> of
>>  two separete Solr/Nutch servers, do I need to link both servers
>>  somehow and join their indexes into one, or is it just a question of
>>  designing the web front-end so that it offers the choice to send  
>> your
>>  search query to one or multiple different servers.
>> Thank you,
>> Veselin K
>> On Sun, Dec 14, 2008 at 11:22:00AM -0800, Chris Hostetter wrote:
>>> : the easiest way to get rolling.  A simple script that recurses  
>>> your folders
>>> : and issues a simple request posting each file in turn to Solr  
>>> will give you a
>>> : full text searchable index in no time (well, ok, it'll take a  
>>> little time, but
>>> : it'll be as fast as anything else out there).
>>> if all the files are "local" on the machine that Solr is running  
>>> on you
>>> don't even need to POST them, Solr can be configured to read the  
>>> files by
>>> local filename using the "stream.file" param...
>>> that said: if your fileserver implementation already exposes all  
>>> of the
>>> files over HTTP, then using Nutch and it's crawler might be an  
>>> easier way
>>> to get started on indexing all of them ... hard to say without  
>>> being in
>>> your shoes.  you may want to experiement with both.
>>> -Hoss

Grant Ingersoll

Lucene Helpful Hints:

View raw message