nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lewis John Mcgibbney <>
Subject Re: Nutch Author, Publication, and Religion Detection
Date Mon, 02 Jul 2012 12:32:39 GMT
OK so please let us know how you get on.

Although you seem to have a clear idea about how you're going to
progress with the issue, I would seriously consider taking on board
Julien's comments and grabbing the code that he's made available for
similar tasks.

All the best


On Fri, Jun 29, 2012 at 7:19 PM, JAB <> wrote:
> Hi Lewis;
> 'm looking at creating Nutch plugin to determine if a document is an article
> on religion, and what religion its primarily talking about. Then, adding an
> annotation called 'religion' to the document on what the primary category of
> the religion is. Examples: Atheism, Buddhism , Christian,  Hindu, Jewish,
> Muslim, or Unknown (if it can't be determined). No annotation will be added
> if its not an article on religion. Next, another annotation on what
> sub-category the religion is. For example, under Christian would be Catholic
> or Protestant. Then possibly a third annotation for  the denomination.
> Examples of denomination: 'Baptist Bible Churches' or 'Christian Methodist
> Episcopal Church' ( have a list of 147 denominations). I'm not familiar with
> religious breakdowns so I don't know if this it the appropriate way to
> categorize them.
> ******
> Design:
> I created a java class on religion that extends IndexingFilter class. I next
> determine if its an article on religion. I do so by counting the number of
> occurrences of certain key words in the document. Example, if 'God' appears
> more then 10 times, its an article on religion. If it mentions 'Christian'
> more than a certain number of times and more often than other religions, the
> sub-category would be 'Christian'. The first match on denomination search
> would be assumed to be the  denomination. I'm also using a
> language-detection plugin
> ( to
> determine the language of the document so I can search for words in the
> document's native language. I don't know if this is the best approach to
> solving this issue.
> --
> View this message in context:
> Sent from the Nutch - Dev mailing list archive at


View raw message