mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul van Hoven <>
Subject RandomAccessSparseVector setting 1.0 in 2 dims for 1 feature value?
Date Fri, 29 Nov 2013 09:14:58 GMT
For an example program using mahout I use the donut.csv sample data
from the project (
). My code looks like this:

    import org.apache.mahout.math.RandomAccessSparseVector;
    import org.apache.mahout.math.Vector;
    import org.apache.mahout.vectorizer.encoders.FeatureVectorEncoder;
    import org.apache.mahout.vectorizer.encoders.StaticWordValueEncoder;
    import com.csvreader.CsvReader;

    public class Runner {

    //Set the path accordingly!
    public static final String csvInputDataPath = "/path/to/donut.csv";

    public static void main(String[] args) {

    FeatureVectorEncoder encoder = new StaticWordValueEncoder("features");
    ArrayList<RandomAccessSparseVector> featureVectors =
     new ArrayList<RandomAccessSparseVector>();
    try {
    CsvReader csvReader = new CsvReader(csvInputDataPath);
    while( csvReader.readRecord() ) {
    Vector featureVector = new RandomAccessSparseVector(30);
    featureVector.set(0, new Double(csvReader.get("x")));
    featureVector.set(1, new Double(csvReader.get("y")));
    featureVector.set(2, new Double(csvReader.get("c")));
    featureVector.set(3, new Integer(csvReader.get("color")));
    System.out.println("Before: " + featureVector.toString());
    System.out.println(" After: " + featureVector.toString());
    featureVectors.add((RandomAccessSparseVector) featureVector);
    } catch(Exception e) {

    System.out.println("Program is done.");


What confuses me is the following output (one sample):

    Before: {0:0.923307513352484,1:0.0135197141207755,2:0.644866125183976,3:2.0}
     After: {0:0.923307513352484,1:0.0135197141207755,2:0.644866125183976,3:2.0,29:1.0,25:1.0}

As you can see, I added just one value "shape" to the vector. However
two dimensions of this vector are encoded with 1.0. On the other hand,
for some other data I get the output

    Before: {0:0.711011884035543,1:0.909141522599384,2:0.46035073663368,3:2.0}
     After: {0:0.711011884035543,1:0.909141522599384,2:0.46035073663368,3:3.0,16:1.0}

Why? I would expect that _always_ only one dimension gets occupied by
1.0 as this is the standard case for categorial encoding. This way
this seems to be wrong.

Thanks in advance,

View raw message