mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Frank Scholten <fr...@frankscholten.nl>
Subject Re: SGD classifier demo app
Date Tue, 04 Feb 2014 18:39:27 GMT
Thanks Ted!

Would indeed be a nice example to add.


On Tue, Feb 4, 2014 at 10:40 AM, Ted Dunning <ted.dunning@gmail.com> wrote:

> Yes.
>
>
> On Tue, Feb 4, 2014 at 1:31 AM, Sebastian Schelter <ssc@apache.org> wrote:
>
> > Would be great to add this as an example to Mahout's codebase.
> >
> >
> > On 02/04/2014 10:27 AM, Ted Dunning wrote:
> >
> >> Frank,
> >>
> >> I just munched on your code and sent a pull request.
> >>
> >> In doing this, I made a bunch of changes.  Hope you liked them.
> >>
> >> These include massive simplification of the reading and vectorization.
> >>   This wasn't strictly necessary, but it seemed like a good idea.
> >>
> >> More important was the way that I changed the vectorization.  For the
> >> continuous values, I added log transforms.  For the categorical values,
> I
> >> encoded as they are.  I also increased the feature vector size to 100 to
> >> avoid excessive collisions.
> >>
> >> In the learning code itself, I got rid of the use of index arrays in
> favor
> >> of shuffling the training data itself.  I also tuned the learning
> >> parameters a lot.
> >>
> >> The result is that the AUC that results is just a tiny bit less than 0.9
> >> which is pretty close to what I got in R.
> >>
> >> For everybody else, see
> >> https://github.com/tdunning/mahout-sgd-bank-marketing for my version
> and
> >> https://github.com/tdunning/mahout-sgd-bank-marketing/
> >> compare/frankscholten:master...masterfor
> >> my pull request.
> >>
> >>
> >>
> >> On Mon, Feb 3, 2014 at 3:57 PM, Ted Dunning <ted.dunning@gmail.com>
> >> wrote:
> >>
> >>
> >>> Johannes,
> >>>
> >>> Very good comments.
> >>>
> >>> Frank,
> >>>
> >>> As a benchmark, I just spent a few minutes building a logistic
> regression
> >>> model using R.  For this model AUC on 10% held-out data is about 0.9.
> >>>
> >>> Here is a gist summarizing the results:
> >>>
> >>> https://gist.github.com/tdunning/8794734
> >>>
> >>>
> >>>
> >>>
> >>> On Mon, Feb 3, 2014 at 2:41 PM, Johannes Schulte <
> >>> johannes.schulte@gmail.com> wrote:
> >>>
> >>>  Hi Frank,
> >>>>
> >>>> you are using the feature vector encoders which hash a combination of
> >>>> feature name and feature value to 2 (default) locations in the vector.
> >>>> The
> >>>> vector size you configured is 11 and this is imo very small to the
> >>>> possible
> >>>> combination of values you have for your data (education, marital,
> >>>> campaign). You can do no harm by using a much bigger cardinality (try
> >>>> 1000).
> >>>>
> >>>> Second, you are using a continuous value encoder with passing in the
> >>>> weight
> >>>> your are using as string (e.g. variable "pDays"). I am not quite sure
> >>>> about
> >>>> the reasons in th mahout code right now but the way it is implemented
> >>>> now,
> >>>> every unique value should end up in a different location because the
> >>>> continuous value is part of the hashing. Try adding the weight
> directly
> >>>> using a static word value encoder, addToVector("pDays",v,pDays)
> >>>>
> >>>> Last, you are also putting in the variable "campaign" as a continous
> >>>> variable which should be probably a categorical variable, so just
> added
> >>>> with a StaticWorldValueEncoder.
> >>>>
> >>>> And finally and probably most important after looking at your target
> >>>> variable: you are using a Dictionary for mapping either y or no to 0
> or
> >>>> 1.
> >>>> This is bad. Depending on what comes first in the data set, either a
> >>>> positive or negative example might be 0 or 1, totally random. Make a
> >>>> hard
> >>>> mapping from the possible values (y/n?) to zero and one, having yes
> the
> >>>> 1
> >>>> and no the zero.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Mon, Feb 3, 2014 at 9:33 PM, Frank Scholten <
> frank@frankscholten.nl
> >>>>
> >>>>> wrote:
> >>>>>
> >>>>
> >>>>  Hi all,
> >>>>>
> >>>>> I am exploring Mahout's SGD classifier and like some feedback
> because I
> >>>>> think I didn't properly configure things.
> >>>>>
> >>>>> I created an example app that trains an SGD classifier on the 'bank
> >>>>> marketing' dataset from UCI:
> >>>>> http://archive.ics.uci.edu/ml/datasets/Bank+Marketing
> >>>>>
> >>>>> My app is at:
> >>>>>
> >>>> https://github.com/frankscholten/mahout-sgd-bank-marketing
> >>>>
> >>>>>
> >>>>> The app reads a CSV file of telephone calls, encodes the features
> into
> >>>>> a
> >>>>> vector and tries to predict whether a customer answers yes to a
> >>>>> business
> >>>>> proposal.
> >>>>>
> >>>>> I do a few runs and measure accuracy but I'm I don't trust the
> results.
> >>>>> When I only use an intercept term as a feature I get around 88%
> >>>>> accuracy
> >>>>> and when I add all features it drops to around 85%. Is this perhaps
> >>>>>
> >>>> because
> >>>>
> >>>>> the dataset highly unbalanced? Most customers answer no. Or is the
> >>>>> classifier biased to predict 0 as the target code when it doesn't
> have
> >>>>>
> >>>> any
> >>>>
> >>>>> data to go with?
> >>>>>
> >>>>> Any other comments about my code or improvements I can make in the
> app
> >>>>>
> >>>> are
> >>>>
> >>>>> welcome! :)
> >>>>>
> >>>>> Cheers,
> >>>>>
> >>>>> Frank
> >>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message