mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Burba <mike.bu...@gmail.com>
Subject Should I be using OnlineLogisticRegression?
Date Thu, 06 Sep 2012 23:42:35 GMT
This is a newbie question from someone is just getting familiar with Mahout
and machine learning.

I bought and have read Mahout In Action, and I'm trying to apply the
concepts to some "real-world" data (i.e., not in the examples).

The problem I am trying to solve is a classification problem, so I started
with OnlineLogisticRegression.  I'm struggling to get good results out of
it, however, so I wonder if I am using the wrong algorithm.  Other notes
about my data:

- My target variable has (5) multiple categories....although 1 of the 5
dominates and appears in 90%+ of the classifications in the training set.
- My (6) predictor variables are all numeric; some of the variables range
from 0...5, others range from 0...1,000,000.
- The training set has millions of records.

I have modified the TrainLogistic / RunLogistic examples to use
classifyFull() instead of classifyScalar(), and output the resulting Vector
as probabilities for the selection of each category.

So why do I think the results aren't very good?  When I run the model
against the validation set, I am not much better than random.  Also, if I
change the problem, so that the target variable just has 2 categories
instead of 5 (either in the 90% category or out), and then use Auc to
validate against the training set, my best score is 0.52.  I have also
tried many values for --rate, --features, but none seem to make difference.

Does anyone have any advice on whether I using a hammer on a screw?  Is it
more likely that I have not found predictors that are very relevant?  Or am
I using an algorithm that is a poor fit?

I really appreciate your help,
Mike
*
*

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message