nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lewis John Mcgibbney <lewis.mcgibb...@gmail.com>
Subject Re: Nutch Author, Publication, and Religion Detection
Date Mon, 02 Jul 2012 12:32:39 GMT
OK so please let us know how you get on.

Although you seem to have a clear idea about how you're going to
progress with the issue, I would seriously consider taking on board
Julien's comments and grabbing the code that he's made available for
similar tasks.

All the best

Lewis

On Fri, Jun 29, 2012 at 7:19 PM, JAB <george.garnett@baesystems.com> wrote:
> Hi Lewis;
>
> 'm looking at creating Nutch plugin to determine if a document is an article
> on religion, and what religion its primarily talking about. Then, adding an
> annotation called 'religion' to the document on what the primary category of
> the religion is. Examples: Atheism, Buddhism , Christian,  Hindu, Jewish,
> Muslim, or Unknown (if it can't be determined). No annotation will be added
> if its not an article on religion. Next, another annotation on what
> sub-category the religion is. For example, under Christian would be Catholic
> or Protestant. Then possibly a third annotation for  the denomination.
> Examples of denomination: 'Baptist Bible Churches' or 'Christian Methodist
> Episcopal Church' ( have a list of 147 denominations). I'm not familiar with
> religious breakdowns so I don't know if this it the appropriate way to
> categorize them.
>
> ******
> Design:
>
> I created a java class on religion that extends IndexingFilter class. I next
> determine if its an article on religion. I do so by counting the number of
> occurrences of certain key words in the document. Example, if 'God' appears
> more then 10 times, its an article on religion. If it mentions 'Christian'
> more than a certain number of times and more often than other religions, the
> sub-category would be 'Christian'. The first match on denomination search
> would be assumed to be the  denomination. I'm also using a
> language-detection plugin
> (http://developer.cybozu.co.jp/oss/2010/10/language-detect.html) to
> determine the language of the document so I can search for words in the
> document's native language. I don't know if this is the best approach to
> solving this issue.
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Nutch-Author-Publication-and-Religion-Detection-tp3991662p3992130.html
> Sent from the Nutch - Dev mailing list archive at Nabble.com.



-- 
Lewis

Mime
View raw message