nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sujen Shah <sujen1...@gmail.com>
Subject Re: Team 18 : Similarity scoring: goldstandard.txt, stopwords.txt contents
Date Wed, 07 Oct 2015 16:15:57 GMT
Hi Mithun,
The goldstandard.txt is a file against which the parsed text of an html
page coming from nutch will be checked. There is no particular format for
that file, just plain text.

For example: If you were to score pages which were more similar to a topic
relating to Robotics, you would want your goldstandard.txt to contain words
like Autonomous, Artificial Intelligence, Robots, etc (Or even paste the
entire wikipedia article on Robotics).

Hence, you could basically just put all the relevant terms in your
goldstandard. But remember pasting a lot of things in the goldstandard
could introduce noise.

The stopwords.txt file is used to filter the irrelevant words in the
vocabulary like and, the, then, etc. You could search online for a list of
common stop words. You could even introduce your own.

I hope the above explanation helps you to get started, I will update the
wiki with an example soon.

Best,
Sujen


Regards,
Sujen Shah
M.S - Computer Science (Class of 2016)
University of Southern California
http://www.linkedin.com/in/sujenshah

On Wed, Oct 7, 2015 at 6:52 AM, Christian Alan Mattmann <mattmann@usc.edu>
wrote:

> Sujen can you provide an example on the existing Scoring
> Similarity wiki page of what the gold standard file
> should have in it and how it should be formatted.
>
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Adjunct Associate Professor, Computer Science Department
> University of Southern California
> Los Angeles, CA 90089 USA
> Email: mattmann@usc.edu
> WWW: http://sunset.usc.edu/
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
> -----Original Message-----
> From: Mithun Maragiri <maragiri@usc.edu>
> Date: Wednesday, October 7, 2015 at 1:39 AM
> To: jpluser <mattmann@usc.edu>
> Subject: Team 18 : Similarity scoring: goldstandard.txt, stopwords.txt
> contents
>
> >Hello Professor,
> >
> >
> >I am trying to implement the 7th question of scoring similarity.
> >I read the material given in the link and understood how to enable it and
> >how to use it.
> >But I did not get what to fill in the goldstandard.txt and stopwords.txt.
> >The link mentions that stopwords.txt should have stop words one per line
> >but I dont know what are my stop words and what to write in
> >goldstandard.txt
> >
> >
> >Can you please tell what to write in the goldstandard and stopwords.txt
> >files? Is there any example or reference which we can use to implement?
> >
> >Thanks,
> >Mithun
> >
> >
> >
> >
> >
>
>

Mime
View raw message