mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Em <>
Subject Re: Boundary Values for Training Data
Date Mon, 26 Sep 2011 14:55:25 GMT
No experiences?


Am 23.09.2011 12:48, schrieb Em:
> Hello list,
> let's say I want to classifiy documents and there are two possible outcomes:
> Yes, the document belongs to the topic I focus on, or No, it doesn't.
> The topic is for example: Machine Learning.
> Doc1: A sub-chapter of the book "Mahout in Action"
> Doc2: A paper about clustering-techniques
> Doc3: A Blog-Post of Ted Dunning, Machine-Learning-Expert, talking about
> his opinion regarding the relationship between Google and Oracle
> Doc4: Ted Dunning is talking about how to cook tasty spagetti (Sorry
> Ted, you are my guinea pig in this case)
> The point is: Doc3 is not really about Machine Learning, however it
> might be relevant for people that are interested in Machine Learning,
> since the author is a Machine-Learning-Expert and his opinion might
> reflect some thoughts regarding that domain.
> Doc4 is completely irrelevant. It has to do with Ted Dunning, but not
> with Machine Learning nor software at all. The only exception would be
> if Ted wrote a piece of Machine Learning software that is creating a
> recipe for cooking tasty spagetti ;).
> If I change the topic to something like "Star Trek":
> Doc1: A review of a Star Trek movie
> Doc2: A Star Trek computer game's description
> Doc3: A review regarding a PlayStation 3 Star Trek game
> Doc4: The announcement that the gaming studio of the Star Trek games is
> going to create a new Star Wars game
> Doc5: A Star Wars book's description
> Doc6: The gaming studio of the Star Trek games is going to create a need
> for speed clone
> Doc 1,2 and 3 are relevant for Trekkies. Doc 4 might be as well, because
> the studio is an authority for creating good Star Trek games and they
> noted that their experiences with Star Trek will help them building a
> good Star Wars game. Some fans might be interested in this.
> However doc 5 is completely irrelevant, since it has nothing to do with
> Star Trek.
> Doc 6 is about an authority in the Star Trek merchandise-industry but it
> correlates with my Ted-cooks-spagetti example from my first example -
> Doc 6 is irrelevant.
> Doc3 of my "Machine Learning" example and Doc 4 of my "Star Trek" one
> are boundary values for beeing relevant. They might interest people that
> focus on the two named domains, but they sail very close to the wind.
> Does it generally make sense to take such examples into account for
> training a model? Real humans may have a discussion about those examples
> whether they really belong to the domain they want to focus on.
> Thank you for your advice.
> Regards,
> Em

View raw message