mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Johannes Schulte <johannes.schu...@gmail.com>
Subject Re: SGD classifier demo app
Date Mon, 03 Feb 2014 22:41:23 GMT
Hi Frank,

you are using the feature vector encoders which hash a combination of
feature name and feature value to 2 (default) locations in the vector. The
vector size you configured is 11 and this is imo very small to the possible
combination of values you have for your data (education, marital,
campaign). You can do no harm by using a much bigger cardinality (try 1000).

Second, you are using a continuous value encoder with passing in the weight
your are using as string (e.g. variable "pDays"). I am not quite sure about
the reasons in th mahout code right now but the way it is implemented now,
every unique value should end up in a different location because the
continuous value is part of the hashing. Try adding the weight directly
using a static word value encoder, addToVector("pDays",v,pDays)

Last, you are also putting in the variable "campaign" as a continous
variable which should be probably a categorical variable, so just added
with a StaticWorldValueEncoder.

And finally and probably most important after looking at your target
variable: you are using a Dictionary for mapping either y or no to 0 or 1.
This is bad. Depending on what comes first in the data set, either a
positive or negative example might be 0 or 1, totally random. Make a hard
mapping from the possible values (y/n?) to zero and one, having yes the 1
and no the zero.





On Mon, Feb 3, 2014 at 9:33 PM, Frank Scholten <frank@frankscholten.nl>wrote:

> Hi all,
>
> I am exploring Mahout's SGD classifier and like some feedback because I
> think I didn't properly configure things.
>
> I created an example app that trains an SGD classifier on the 'bank
> marketing' dataset from UCI:
> http://archive.ics.uci.edu/ml/datasets/Bank+Marketing
>
> My app is at: https://github.com/frankscholten/mahout-sgd-bank-marketing
>
> The app reads a CSV file of telephone calls, encodes the features into a
> vector and tries to predict whether a customer answers yes to a business
> proposal.
>
> I do a few runs and measure accuracy but I'm I don't trust the results.
> When I only use an intercept term as a feature I get around 88% accuracy
> and when I add all features it drops to around 85%. Is this perhaps because
> the dataset highly unbalanced? Most customers answer no. Or is the
> classifier biased to predict 0 as the target code when it doesn't have any
> data to go with?
>
> Any other comments about my code or improvements I can make in the app are
> welcome! :)
>
> Cheers,
>
> Frank
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message