mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Schelter <...@apache.org>
Subject Re: Classification beginner questions
Date Fri, 10 Jun 2011 15:23:40 GMT
Hi Joscha,

If you have some money left, I'd recommend to get a copy of Mahout in 
Action, which features a very nice to read, detailed introduction to 
classification with Mahout, including strategies for feature selection.

--sebastian

On 10.06.2011 17:28, Hector Yee wrote:
> Oh you have a very strange feature, you are using the label as a feature, may bad. I
thought the words were the labels.
> Usually it's something like weight, height, something meaningful. If it's just the label
like you have you might as well use a hash map there is no feature to learn! But if you want
try making it an indicator vector. Set features to the number of animals and for the vector
set it to 1 at the index of the animal in the array, 0 otherwise. E.g for ant the feature
is 0, 1 , 00000
>
> Sent from my iPad
>
> On Jun 10, 2011, at 12:54 AM, Joscha Feth<joscha@feth.com>  wrote:
>
>> Hello fellow Mahouts,
>>
>> I am trying to grasp Mahout and generated a very simple (but obviously
>> wrong) example which I hoped would help me understand how everything works:
>>
>> -- 8<  --
>> public class OLRTest {
>>
>>     private static final int FEATURES = 1;
>>     private static final int CATEGORIES = 2;
>>
>>     private static final WordValueEncoder ANIMAL_ENCODER = new
>> AdaptiveWordValueEncoder(
>>             "animal");
>>
>>     private static final String[] animals = new String[] { "alligator",
>> "ant",
>>             "bear", "bee", "bird", "camel", "cat", "cheetah", "chicken",
>>             "chimpanzee", "cow", "crocodile", "deer", "dog", "dolphin",
>> "duck",
>>             "eagle", "elephant", "fish", "fly", "fox", "frog", "giraffe",
>>             "goat", "goldfish", "hamster", "hippopotamus", "horse",
>> "kangaroo",
>>             "kitten", "lion", "lobster", "monkey", "octopus", "owl",
>> "panda",
>>             "pig", "puppy", "rabbit", "rat", "scorpion", "seal", "shark",
>>             "sheep", "snail", "snake", "spider", "squirrel", "tiger",
>> "turtle",
>>             "wolf", "zebra" };
>>
>>     public static void main(String[] args) {
>>         final OnlineLogisticRegression algorithm = new
>> OnlineLogisticRegression(
>>                 CATEGORIES, FEATURES, new L1());
>>
>>         for (String animal : animals) {
>>             algorithm.train(0, generateVector(animal));
>>         }
>>
>>         algorithm.close();
>>
>>         testClassify(algorithm, "lion");
>>         testClassify(algorithm, "rabbit");
>>         testClassify(algorithm, "xyz");
>>         testClassify(algorithm, "something");
>>     }
>>
>>     private static void testClassify(final OnlineLogisticRegression
>> algorithm,
>>             final String allegedAnimal) {
>>         System.out.println(allegedAnimal
>>                 + " is an animal with a probability of "
>>                 + algorithm.classifyScalar(generateVector(allegedAnimal)) *
>> 100
>>                 + "%");
>>     }
>>
>>     private static Vector generateVector(String animal) {
>>         final Vector v = new RandomAccessSparseVector(FEATURES);
>>         ANIMAL_ENCODER.addToVector(animal, v);
>>         return v;
>>     }
>> }
>> -- 8<  --
>>
>> The output of running this sample code is:
>> -- 8<  --
>> lion is an animal with a probability of 0.12008121418417145%
>> rabbit is an animal with a probability of 0.11720244687895641%
>> xyz is an animal with a probability of 0.04192879358244322%
>> something is an animal with a probability of 0.04047790610981663%
>> -- 8<  --
>>
>> There were multiple surprising things for me:
>> * I would have suspected the probability of "lion" and "rabbit" close to
>> 100%
>> * I would have suspected the probability of "xyz" and "something" close to
>> 0%
>> * I would have suspected the probability of "lion" being the same as the one
>> for "rabbit"
>> * I would have suspected the probability of "xyz" being the same as the one
>> for "something"
>>
>> I know that the animals sample provided is extremely small, but even when
>> training with multiple passes (100, 1000, 10000) it did change the
>> probabilities only marginally.
>> What am I missing here?
>>
>> Thanks very much!
>> Joscha Feth


Mime
View raw message