mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: RandomAccessSparseVector setting 1.0 in 2 dims for 1 feature value?
Date Fri, 29 Nov 2013 22:26:39 GMT
If you always insert 1's for each element, then you can detect collisions
by inserting all your elements (or all elements in each document
separately) and looking for the max value in the vector.  If you see
something >1, you have a collision.

But collisions are actually good.  The only way to completely avoid them is
to use a vector as large as your vocabulary which is often painfully large.

You can also view multiple probes not so much as avoiding collisions, but
as making the linear transformation from the very large dimensional
representation of one dimension per word to the lower hashed representation
more likely to be nearly invertible in the sense that the Euclidean metric
will be approximately preserved.  Think Johnson-Lindenstrauss random
projections.



On Fri, Nov 29, 2013 at 1:54 AM, Paul van Hoven <paul.van.hoven@gmail.com>wrote:

> Hi, thanks for your quick reply. So multiple probes are a protection
> against collisions? After playing a little with the default length of
> a RandomAccessSparseVector object I noticed that (of course)
> collisions occur when the length is too short. Therefore, I'm asking
> myself if there is a possibility to check if a collision occurred
> after encoding a new value in the vector? This would give a user the
> information that the length of the chosen vector is too short. So far,
> I did not find any method in the api to check for that.
>
> 2013/11/29 Ted Dunning <ted.dunning@gmail.com>:
> > The default with the Mahout encoders is two probes.  This is unnecessary
> > with the intercept term, of course, if you protect the intercept term
> from
> > other updates, possible by encoding other data using a view of the
> original
> > feature vector.
> >
> > For each probe, a different hash is used so each value is put into
> multiple
> > locations.  Multiple probes are useful in general to decrease the effect
> of
> > the reduced dimensionality of the hashed representation.
> >
> >
> >
> > On Fri, Nov 29, 2013 at 1:14 AM, Paul van Hoven <
> paul.van.hoven@gmail.com>wrote:
> >
> >> For an example program using mahout I use the donut.csv sample data
> >> from the project (
> >>
> >>
> https://svn.apache.org/repos/asf/mahout/trunk/examples/src/main/resources/donut.csv
> >> ). My code looks like this:
> >>
> >>     import org.apache.mahout.math.RandomAccessSparseVector;
> >>     import org.apache.mahout.math.Vector;
> >>     import org.apache.mahout.vectorizer.encoders.FeatureVectorEncoder;
> >>     import org.apache.mahout.vectorizer.encoders.StaticWordValueEncoder;
> >>     import com.csvreader.CsvReader;
> >>
> >>     public class Runner {
> >>
> >>     //Set the path accordingly!
> >>     public static final String csvInputDataPath = "/path/to/donut.csv";
> >>
> >>     public static void main(String[] args) {
> >>
> >>     FeatureVectorEncoder encoder = new
> StaticWordValueEncoder("features");
> >>     ArrayList<RandomAccessSparseVector> featureVectors =
> >>      new ArrayList<RandomAccessSparseVector>();
> >>     try {
> >>     CsvReader csvReader = new CsvReader(csvInputDataPath);
> >>     csvReader.readHeaders();
> >>     while( csvReader.readRecord() ) {
> >>     Vector featureVector = new RandomAccessSparseVector(30);
> >>     featureVector.set(0, new Double(csvReader.get("x")));
> >>     featureVector.set(1, new Double(csvReader.get("y")));
> >>     featureVector.set(2, new Double(csvReader.get("c")));
> >>     featureVector.set(3, new Integer(csvReader.get("color")));
> >>     System.out.println("Before: " + featureVector.toString());
> >>     encoder.addToVector(csvReader.get("shape").getBytes(),
> >>     featureVector);
> >>     System.out.println(" After: " + featureVector.toString());
> >>     featureVectors.add((RandomAccessSparseVector) featureVector);
> >>     }
> >>     } catch(Exception e) {
> >>     e.printStackTrace();
> >>     }
> >>
> >>     System.out.println("Program is done.");
> >>     }
> >>
> >>     }
> >>
> >>
> >> What confuses me is the following output (one sample):
> >>
> >>     Before:
> >> {0:0.923307513352484,1:0.0135197141207755,2:0.644866125183976,3:2.0}
> >>      After:
> >>
> {0:0.923307513352484,1:0.0135197141207755,2:0.644866125183976,3:2.0,29:1.0,25:1.0}
> >>
> >> As you can see, I added just one value "shape" to the vector. However
> >> two dimensions of this vector are encoded with 1.0. On the other hand,
> >> for some other data I get the output
> >>
> >>     Before:
> >> {0:0.711011884035543,1:0.909141522599384,2:0.46035073663368,3:2.0}
> >>      After:
> >>
> {0:0.711011884035543,1:0.909141522599384,2:0.46035073663368,3:3.0,16:1.0}
> >>
> >> Why? I would expect that _always_ only one dimension gets occupied by
> >> 1.0 as this is the standard case for categorial encoding. This way
> >> this seems to be wrong.
> >>
> >> Thanks in advance,
> >> Paul
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message