lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Martin O'Shea" <>
Subject RE: Use of Lucene to store data from RSS feeds
Date Fri, 15 Oct 2010 14:35:37 GMT
@Pulkit Singhal: Thanks for the reply. Just to clarify my post yesterday, I'm not sure if each
row in the database table would form a document or not because I do not know if Lucene works
in this manner. In my case, each row of the table represents a single polling of an RSS feed
to retrieve any new postings over a given number of hours. If Lucene allows a document to
have separate time-based entries, then I am happy to use it for indexing. But if a separate
document is needed per row of the table, then I'm uncertain. I always do have the option of
using Lucene for in-memory indexing of postings to calculate the keyword frequencies. This
I know how to do.

The individual columns of my table represent the only two elements of each RSS item that I'm
interested in retrieving text from, i.e. the title and description.

-----Original Message-----
From: Pulkit Singhal [] 
Sent: 15 Oct 2010 13 36
Subject: Re: Use of Lucene to store data from RSS feeds

When you ask:
a) will each feed would form a Lucene document, or
b) will each database row would form a lucene document
I'm inclined to say that really depends on what type of aggregation
tool or logic you are using.

I don't know if "Tika" does it but if there is a tool out there that
can be pointed to a feed and tweaked to spit out documents with each
field having the settings that you want then you can go with that
approach. But if you are already parsing the feed and storing the raw
data into a database table then there is no reason that you can't
leverage that. From a database row perspective you have already done a
good deal of work to collect the data and breaking it down into chunks
that Lucene can happily index as separate fields in a document.

By the way I think there are tools that read from the database
directly too but I won't try to make things too complicated.

The way I see it, if you were to use the row at this moment and index
the 4 columns as fields ... plus you could set the feed body to be
ANALYZED (why don't I see the feed body in your database table?) ...
then lucene range queries on the date/time field could possibly return
some results. I am not sure how to get keyword frequencies but if the
analyzed tokens that lucene is keeping in its index sort of represent
the keywords that you are talking about then i do know that lucene
keeps some sort of inverted index per token in terms of how many
occurrences of it are there ... may be someone else on the list can
comment on how to extract that info in a query.

Sounds doable.

On Thu, Oct 14, 2010 at 10:17 AM,  <> wrote:
> Hello
> I would like to store data retrieved hourly from RSS feeds in a database or in Lucene
so that the text can be easily
> indexed for word frequencies.
> I need to get the text from the title and description elements of RSS items.
> Ideally, for each hourly retrieval from a given feed, I would add a row to a table in
a dataset made up of the
> following columns:
> feed_url, title_element_text, description_element_text, polling_date_time
> From this, I can look up any element in a feed and calculate keyword frequencies based
upon the length of time required.
> This can be done as a database table and hashmaps used to calculate word frequencies.
But can I do this in Lucene to
> this degree of granularity at all? If so, would each feed form a Lucene document or would
each 'row' from the
> database table form one?
> Can anyone advise?
> Thanks
> Martin O'Shea.
> --
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message