lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Serebrennikov <>
Subject Re: Proposal for Lucene
Date Thu, 07 Feb 2002 21:39:19 GMT
I'd like to add my +1 to the proposal and my +1 to keeping the Lucene as 
a library that can exist separately from the applications. Perhaps the 
applications should be separate targets in the Lucene project (and build 
process) or perhaps they can be separate projects. I think keeping them 
together would be good because Lucene's APIs may need to evolve to 
support these applications better and because this will help ensure that 
changes to Lucene API are reflected in the applications as soon as they 
are made and not with a lag that can come about if the applications are 
treated as separate, dependent projects.

See below for some additional ideas for the crawler.

Mark Tucker wrote:

>I like what you included in your proposal and suggest doing all that (over time) and taking
the following into consideration:
>	General Settings
>		SleeptimeBetweenCalls - can be used to avoid flooding a machine with too many requests
>		IndexerTimeout - kill this crawler thread after long period of inactivity
>		IncludeFilter - include only items matching filter
>		ExcludeFilter - exclude items matching filter (can be used with IncludeFilter)
I'm working on a crawler right now actually, but it is a derivative of 
WebSPHINX. The original WebSPHINX has not changed since a very long time 
ago, but it is licensed under LGPL at the moment. Perhaps we can get 
permission from the copyright holders to transfer it to APL (or do we 
even need to?). I made a number of bug fixes to it, added support for 
cookies (rudimentary) and support for HTTP redirects. One thing that I 
like in WebSPHINX is that it has a forgiving HTML parser that can deal 
with many kinds of broken HTML. Also, it has a very interesting 
framework for analyzing parsed content, but this goes beyound the 
requirements for use with Lucene.

I use the crawler with Lucene, but there is a layer of application 
classes between the two, so the kind of integration that has been 
proposed here has not yet been done. Anyway, I found that in addition to 
the Include and Exclude filters, it is helpful to be able to say that 
you want some page "expanded" (i.e. parsed and links followed), but not 
"indexed" (i.e. added to Lucene's index). And vice versa, it seems 
useful to index a page but not expand it, somethimes. Also, filters can 
be evaluated on links before they are followed, and then the second time 
on final URLs of pages retrieved. Normally the two are the same, but 
HTTP redirects can force the final URL to be something very different 
from the original link.

Perhaps one way to represent these conditions is to have the following 
"language" instead of include and exclude filters:

"include:" regex
"exclude:" regex
"noindex": regex
"noexpand": regex

The first two work as the include/exclude, but for things that pass 
these two, the others add handling properties that are used in 
processing the link and the page. Disclaimer: I'm experimenting with 
this now and these ideas are only about two days old, so please take 
them as such. Since we got into the discussion, I figured I'd put them 
on the table.

>		MaxItems - stops indexing after x items
>		MaxMegs - stops indexing after x MB of data
>	File System Indexer
>		URLReplacePrefix - can crawl c:\ but expose URL as http://mysever/docs/
Question: does this information really belong in the index? Perhaps the 
root path should be specified, and the documents tagged with a relative 
path to that path, but I think that, maybe, the URL to prefix the 
document paths with should be given once per entire index and be easy to 

>	Web Indexer
>		HTTPUser
>		HTTPPassword
>		HTTPUserAgent
>		ProxyServer
>		ProxyUser
>		ProxyPassword
>		HTTPSCertificate
>		HTTPSPrivateKey
Apache Commons has HTTPClient package that has some similar concepts and 
even implements them to some degree. I found it a bit rough still and 
dependent on JDK 1.3, but it can be fixed easier than a new one written 
I believe. It uses a notion of an HttpState, which is a state container 
for an HTTP user agent, containing things like authentication 
credentials and cookies. HTTPS support is easy to add with JSSE (which 
is the approach taken by the HttpClient from the Commons).

>	Other Possible Indexers
>		Microsoft Exchange 5.5/2000
>		Lotus Notes
>		Newsgroup (NNTP)
>		Documentum
>		XML - index single XML that represents multiple documents
One idea that might prove useful is to add a "DocumentFetcher" in 
addition to the DocumentIndexer. The two would go hand in hand, and 
document entries created in Lucene by a particular Indexer can be 
understood by a corresponding Fetcher. The Fetcher would then 
encapsulate retrieval of source documents or creating useful pointers to 
them (like URLs).

Another idea is to split the document storage and "envelope" from its 
content. The content is subject to a MIME type and can be handed to a 
parser, passed to a document factory, mapped to fields, etc. However, 
the logic of retrieving a PDF file from a Lotus Notes database (and 
creating a URL to point back to it), is different than getting the same 
PDF file from the file system. The same parser and a document factory 
can still be used though.

>Document Factory		
>	General
>		The minimum properties for each document should be:
>			URL
>			Title
>			Abstract
>			Full Text
>			Score
>		Support for META tags including Dublic Core syntax
>	Other Possible Document Factories
>		Office Docs - DOC, XLS, PPT
>		PDF
>Thanks for the great proposal.
Yes! Absolutely! Great proposal!


To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message