mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul van Hoven <paul.van.ho...@gmail.com>
Subject Re: RandomAccessSparseVector setting 1.0 in 2 dims for 1 feature value?
Date Fri, 29 Nov 2013 09:54:13 GMT
Hi, thanks for your quick reply. So multiple probes are a protection
against collisions? After playing a little with the default length of
a RandomAccessSparseVector object I noticed that (of course)
collisions occur when the length is too short. Therefore, I'm asking
myself if there is a possibility to check if a collision occurred
after encoding a new value in the vector? This would give a user the
information that the length of the chosen vector is too short. So far,
I did not find any method in the api to check for that.

2013/11/29 Ted Dunning <ted.dunning@gmail.com>:
> The default with the Mahout encoders is two probes.  This is unnecessary
> with the intercept term, of course, if you protect the intercept term from
> other updates, possible by encoding other data using a view of the original
> feature vector.
>
> For each probe, a different hash is used so each value is put into multiple
> locations.  Multiple probes are useful in general to decrease the effect of
> the reduced dimensionality of the hashed representation.
>
>
>
> On Fri, Nov 29, 2013 at 1:14 AM, Paul van Hoven <paul.van.hoven@gmail.com>wrote:
>
>> For an example program using mahout I use the donut.csv sample data
>> from the project (
>>
>> https://svn.apache.org/repos/asf/mahout/trunk/examples/src/main/resources/donut.csv
>> ). My code looks like this:
>>
>>     import org.apache.mahout.math.RandomAccessSparseVector;
>>     import org.apache.mahout.math.Vector;
>>     import org.apache.mahout.vectorizer.encoders.FeatureVectorEncoder;
>>     import org.apache.mahout.vectorizer.encoders.StaticWordValueEncoder;
>>     import com.csvreader.CsvReader;
>>
>>     public class Runner {
>>
>>     //Set the path accordingly!
>>     public static final String csvInputDataPath = "/path/to/donut.csv";
>>
>>     public static void main(String[] args) {
>>
>>     FeatureVectorEncoder encoder = new StaticWordValueEncoder("features");
>>     ArrayList<RandomAccessSparseVector> featureVectors =
>>      new ArrayList<RandomAccessSparseVector>();
>>     try {
>>     CsvReader csvReader = new CsvReader(csvInputDataPath);
>>     csvReader.readHeaders();
>>     while( csvReader.readRecord() ) {
>>     Vector featureVector = new RandomAccessSparseVector(30);
>>     featureVector.set(0, new Double(csvReader.get("x")));
>>     featureVector.set(1, new Double(csvReader.get("y")));
>>     featureVector.set(2, new Double(csvReader.get("c")));
>>     featureVector.set(3, new Integer(csvReader.get("color")));
>>     System.out.println("Before: " + featureVector.toString());
>>     encoder.addToVector(csvReader.get("shape").getBytes(),
>>     featureVector);
>>     System.out.println(" After: " + featureVector.toString());
>>     featureVectors.add((RandomAccessSparseVector) featureVector);
>>     }
>>     } catch(Exception e) {
>>     e.printStackTrace();
>>     }
>>
>>     System.out.println("Program is done.");
>>     }
>>
>>     }
>>
>>
>> What confuses me is the following output (one sample):
>>
>>     Before:
>> {0:0.923307513352484,1:0.0135197141207755,2:0.644866125183976,3:2.0}
>>      After:
>> {0:0.923307513352484,1:0.0135197141207755,2:0.644866125183976,3:2.0,29:1.0,25:1.0}
>>
>> As you can see, I added just one value "shape" to the vector. However
>> two dimensions of this vector are encoded with 1.0. On the other hand,
>> for some other data I get the output
>>
>>     Before:
>> {0:0.711011884035543,1:0.909141522599384,2:0.46035073663368,3:2.0}
>>      After:
>> {0:0.711011884035543,1:0.909141522599384,2:0.46035073663368,3:3.0,16:1.0}
>>
>> Why? I would expect that _always_ only one dimension gets occupied by
>> 1.0 as this is the standard case for categorial encoding. This way
>> this seems to be wrong.
>>
>> Thanks in advance,
>> Paul
>>

Mime
View raw message