mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Em <>
Subject Boundary Values for Training Data
Date Fri, 23 Sep 2011 10:48:08 GMT
Hello list,

let's say I want to classifiy documents and there are two possible outcomes:
Yes, the document belongs to the topic I focus on, or No, it doesn't.

The topic is for example: Machine Learning.

Doc1: A sub-chapter of the book "Mahout in Action"
Doc2: A paper about clustering-techniques
Doc3: A Blog-Post of Ted Dunning, Machine-Learning-Expert, talking about
his opinion regarding the relationship between Google and Oracle
Doc4: Ted Dunning is talking about how to cook tasty spagetti (Sorry
Ted, you are my guinea pig in this case)

The point is: Doc3 is not really about Machine Learning, however it
might be relevant for people that are interested in Machine Learning,
since the author is a Machine-Learning-Expert and his opinion might
reflect some thoughts regarding that domain.

Doc4 is completely irrelevant. It has to do with Ted Dunning, but not
with Machine Learning nor software at all. The only exception would be
if Ted wrote a piece of Machine Learning software that is creating a
recipe for cooking tasty spagetti ;).

If I change the topic to something like "Star Trek":

Doc1: A review of a Star Trek movie
Doc2: A Star Trek computer game's description
Doc3: A review regarding a PlayStation 3 Star Trek game
Doc4: The announcement that the gaming studio of the Star Trek games is
going to create a new Star Wars game
Doc5: A Star Wars book's description
Doc6: The gaming studio of the Star Trek games is going to create a need
for speed clone

Doc 1,2 and 3 are relevant for Trekkies. Doc 4 might be as well, because
the studio is an authority for creating good Star Trek games and they
noted that their experiences with Star Trek will help them building a
good Star Wars game. Some fans might be interested in this.

However doc 5 is completely irrelevant, since it has nothing to do with
Star Trek.
Doc 6 is about an authority in the Star Trek merchandise-industry but it
correlates with my Ted-cooks-spagetti example from my first example -
Doc 6 is irrelevant.

Doc3 of my "Machine Learning" example and Doc 4 of my "Star Trek" one
are boundary values for beeing relevant. They might interest people that
focus on the two named domains, but they sail very close to the wind.

Does it generally make sense to take such examples into account for
training a model? Real humans may have a discussion about those examples
whether they really belong to the domain they want to focus on.

Thank you for your advice.


View raw message