mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul van Hoven <paul.van.ho...@gmail.com>
Subject RandomAccessSparseVector setting 1.0 in 2 dims for 1 feature value?
Date Fri, 29 Nov 2013 09:14:58 GMT
For an example program using mahout I use the donut.csv sample data
from the project (
https://svn.apache.org/repos/asf/mahout/trunk/examples/src/main/resources/donut.csv
). My code looks like this:

    import org.apache.mahout.math.RandomAccessSparseVector;
    import org.apache.mahout.math.Vector;
    import org.apache.mahout.vectorizer.encoders.FeatureVectorEncoder;
    import org.apache.mahout.vectorizer.encoders.StaticWordValueEncoder;
    import com.csvreader.CsvReader;

    public class Runner {

    //Set the path accordingly!
    public static final String csvInputDataPath = "/path/to/donut.csv";

    public static void main(String[] args) {

    FeatureVectorEncoder encoder = new StaticWordValueEncoder("features");
    ArrayList<RandomAccessSparseVector> featureVectors =
     new ArrayList<RandomAccessSparseVector>();
    try {
    CsvReader csvReader = new CsvReader(csvInputDataPath);
    csvReader.readHeaders();
    while( csvReader.readRecord() ) {
    Vector featureVector = new RandomAccessSparseVector(30);
    featureVector.set(0, new Double(csvReader.get("x")));
    featureVector.set(1, new Double(csvReader.get("y")));
    featureVector.set(2, new Double(csvReader.get("c")));
    featureVector.set(3, new Integer(csvReader.get("color")));
    System.out.println("Before: " + featureVector.toString());
    encoder.addToVector(csvReader.get("shape").getBytes(),
    featureVector);
    System.out.println(" After: " + featureVector.toString());
    featureVectors.add((RandomAccessSparseVector) featureVector);
    }
    } catch(Exception e) {
    e.printStackTrace();
    }

    System.out.println("Program is done.");
    }

    }


What confuses me is the following output (one sample):

    Before: {0:0.923307513352484,1:0.0135197141207755,2:0.644866125183976,3:2.0}
     After: {0:0.923307513352484,1:0.0135197141207755,2:0.644866125183976,3:2.0,29:1.0,25:1.0}

As you can see, I added just one value "shape" to the vector. However
two dimensions of this vector are encoded with 1.0. On the other hand,
for some other data I get the output

    Before: {0:0.711011884035543,1:0.909141522599384,2:0.46035073663368,3:2.0}
     After: {0:0.711011884035543,1:0.909141522599384,2:0.46035073663368,3:3.0,16:1.0}

Why? I would expect that _always_ only one dimension gets occupied by
1.0 as this is the standard case for categorial encoding. This way
this seems to be wrong.

Thanks in advance,
Paul

Mime
View raw message