lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lukas Vlcek" <lukas.vl...@gmail.com>
Subject Re: Does Lucene save an offline version of web pages?
Date Sun, 27 Apr 2008 18:54:48 GMT
Hi,

this sounds like job for Nutch (one of Lucene family projects).

On Sun, Apr 27, 2008 at 8:26 PM, Legolas wood <legolas.w@gmail.com> wrote:

> Hi
> Thank you for reading my post.
> I have to design a system with the following requirements, I think
> Lucene or one of the projects which are based on Lucene can help me as a
> base to continue on.
> Here is the requirements:
>
> - Fetch and index some pages (containing word and pdf documents) on
> daily basis.


Nutch ca do this for you. It uses plug-ins for parsing various content. PDF
should be no problem. As for MS falimy documents it uses POI I think. THis
means that it is not 100% reliable but I guess you can write your own
plug-in and utilize commercial third-party utility for Word processing.


>
> - Extract all pages that contain some provided keywords after fetching
> the pages.


I think you will need to look closer at Nutch tools. As far as I know Nutch
can drop all pages containing specific words. I am not sure if there is any
tool to do the opposite.


>
> - Create some bulletin from fetched pages, bulletin will be in pdf
> format and are categorized based on keywords.


Can you be more specific? I think you need to implement this yourself. (May
be carrot could be helpful)


>
> - provide offline search capability (on pages that it indexed and also
> it should allows the users to browse the pages  offline)
>

If I am not mistaken then Nutch can cache documents for you but they are
converted into plain text. But I am really not sure in this point.


>
> Can you let me know whether any of Lucene based projects can help me
> with this requirements?
> Specially with offline browsing feature?
>
>
> Thanks.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
http://blog.lukas-vlcek.com/

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message