lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Janssen <>
Subject Re: Does Lucene save an offline version of web pages?
Date Sun, 27 Apr 2008 12:23:55 GMT
> - Fetch and index some pages (containing word and pdf documents) on
> daily basis.
> - Extract all pages that contain some provided keywords after fetching
> the pages.
> - Create some bulletin from fetched pages, bulletin will be in pdf
> format and are categorized based on keywords.
> - provide offline search capability (on pages that it indexed and also
> it should allows the users to browse the pages  offline)
> Can you let me know whether any of Lucene based projects can help me
> with this requirements?
> Specially with offline browsing feature?

Yes, the UpLib system from PARC does this.  It supports Word,
Powerpoint, PDF, Web pages, email, images, etc., as input documents.
It caches all documents given to it, in their original format, but
also allows access to them as HTML or PDF.  It Lucene-indexes both the
full content text of each document, along with metadata for each
document, and contains a number of document analysis engines for
calculating "indirect" metadata from the document.  It includes
several off-line browsers, including a Web-browser tool and a Java
client (I tend to use the Java rich client), for searching, reading,
and annotating the gathered pages.  Screenshots are at

We're currently in beta test of our first public release (it's been
used internally at PARC for over four years now); to be added to the
beta-test list, just create an account on the blog at


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message