mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: SGD classifier demo app
Date Mon, 03 Feb 2014 23:57:46 GMT
Johannes,

Very good comments.

Frank,

As a benchmark, I just spent a few minutes building a logistic regression
model using R.  For this model AUC on 10% held-out data is about 0.9.

Here is a gist summarizing the results:

https://gist.github.com/tdunning/8794734




On Mon, Feb 3, 2014 at 2:41 PM, Johannes Schulte <johannes.schulte@gmail.com
> wrote:

> Hi Frank,
>
> you are using the feature vector encoders which hash a combination of
> feature name and feature value to 2 (default) locations in the vector. The
> vector size you configured is 11 and this is imo very small to the possible
> combination of values you have for your data (education, marital,
> campaign). You can do no harm by using a much bigger cardinality (try
> 1000).
>
> Second, you are using a continuous value encoder with passing in the weight
> your are using as string (e.g. variable "pDays"). I am not quite sure about
> the reasons in th mahout code right now but the way it is implemented now,
> every unique value should end up in a different location because the
> continuous value is part of the hashing. Try adding the weight directly
> using a static word value encoder, addToVector("pDays",v,pDays)
>
> Last, you are also putting in the variable "campaign" as a continous
> variable which should be probably a categorical variable, so just added
> with a StaticWorldValueEncoder.
>
> And finally and probably most important after looking at your target
> variable: you are using a Dictionary for mapping either y or no to 0 or 1.
> This is bad. Depending on what comes first in the data set, either a
> positive or negative example might be 0 or 1, totally random. Make a hard
> mapping from the possible values (y/n?) to zero and one, having yes the 1
> and no the zero.
>
>
>
>
>
> On Mon, Feb 3, 2014 at 9:33 PM, Frank Scholten <frank@frankscholten.nl
> >wrote:
>
> > Hi all,
> >
> > I am exploring Mahout's SGD classifier and like some feedback because I
> > think I didn't properly configure things.
> >
> > I created an example app that trains an SGD classifier on the 'bank
> > marketing' dataset from UCI:
> > http://archive.ics.uci.edu/ml/datasets/Bank+Marketing
> >
> > My app is at: https://github.com/frankscholten/mahout-sgd-bank-marketing
> >
> > The app reads a CSV file of telephone calls, encodes the features into a
> > vector and tries to predict whether a customer answers yes to a business
> > proposal.
> >
> > I do a few runs and measure accuracy but I'm I don't trust the results.
> > When I only use an intercept term as a feature I get around 88% accuracy
> > and when I add all features it drops to around 85%. Is this perhaps
> because
> > the dataset highly unbalanced? Most customers answer no. Or is the
> > classifier biased to predict 0 as the target code when it doesn't have
> any
> > data to go with?
> >
> > Any other comments about my code or improvements I can make in the app
> are
> > welcome! :)
> >
> > Cheers,
> >
> > Frank
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message