lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lukas Vlcek" <>
Subject Re: Does Lucene save an offline version of web pages?
Date Sun, 27 Apr 2008 18:54:48 GMT

this sounds like job for Nutch (one of Lucene family projects).

On Sun, Apr 27, 2008 at 8:26 PM, Legolas wood <> wrote:

> Hi
> Thank you for reading my post.
> I have to design a system with the following requirements, I think
> Lucene or one of the projects which are based on Lucene can help me as a
> base to continue on.
> Here is the requirements:
> - Fetch and index some pages (containing word and pdf documents) on
> daily basis.

Nutch ca do this for you. It uses plug-ins for parsing various content. PDF
should be no problem. As for MS falimy documents it uses POI I think. THis
means that it is not 100% reliable but I guess you can write your own
plug-in and utilize commercial third-party utility for Word processing.

> - Extract all pages that contain some provided keywords after fetching
> the pages.

I think you will need to look closer at Nutch tools. As far as I know Nutch
can drop all pages containing specific words. I am not sure if there is any
tool to do the opposite.

> - Create some bulletin from fetched pages, bulletin will be in pdf
> format and are categorized based on keywords.

Can you be more specific? I think you need to implement this yourself. (May
be carrot could be helpful)

> - provide offline search capability (on pages that it indexed and also
> it should allows the users to browse the pages  offline)

If I am not mistaken then Nutch can cache documents for you but they are
converted into plain text. But I am really not sure in this point.

> Can you let me know whether any of Lucene based projects can help me
> with this requirements?
> Specially with offline browsing feature?
> Thanks.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message