From user-return-18858-apmail-mahout-user-archive=mahout.apache.org@mahout.apache.org Fri Nov 29 22:27:39 2013 Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C9368109E3 for ; Fri, 29 Nov 2013 22:27:39 +0000 (UTC) Received: (qmail 54411 invoked by uid 500); 29 Nov 2013 22:27:38 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 54377 invoked by uid 500); 29 Nov 2013 22:27:37 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 54369 invoked by uid 99); 29 Nov 2013 22:27:37 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 29 Nov 2013 22:27:37 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ted.dunning@gmail.com designates 209.85.223.173 as permitted sender) Received: from [209.85.223.173] (HELO mail-ie0-f173.google.com) (209.85.223.173) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 29 Nov 2013 22:27:30 +0000 Received: by mail-ie0-f173.google.com with SMTP id to1so16580311ieb.4 for ; Fri, 29 Nov 2013 14:27:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=Enh53D7LwEqJ8UQdo8ru+AKFzixa2itFRL6gclMfGuQ=; b=RBhCRmkw9tAopEJuTdTo13m2zvw/LoINfNTt8FhKX7OSvwHW1jl8vb7FbBl40shsCe djiFKi8JPa/nWDa8ry7AoFF4Wcrg+38iMZwoUrqBY1bCcRMxQcxwO4Vwv0ygvCKV8VVP 0WJTcC9EqRqMkIDwQX4352Ug2kSt7TC59h5eDezkD7OjiupBpS/ONRrahHMuu6G6zrwg 8rQTAHb382cX1vn7d6JAA3/o0l5RQnDAt2CoE91rt8STopZFc2rcPmG4KGMJphUz1sIv DknYqaR2WfktFaX1CMJUWI30k9uc9auxgtzu7vutT6PlqJ45SIrD3IUEZ1dhIFGu+v5n 9eug== X-Received: by 10.50.141.133 with SMTP id ro5mr8026292igb.35.1385764029865; Fri, 29 Nov 2013 14:27:09 -0800 (PST) MIME-Version: 1.0 Received: by 10.64.62.35 with HTTP; Fri, 29 Nov 2013 14:26:39 -0800 (PST) In-Reply-To: References: From: Ted Dunning Date: Fri, 29 Nov 2013 14:26:39 -0800 Message-ID: Subject: Re: RandomAccessSparseVector setting 1.0 in 2 dims for 1 feature value? To: "user@mahout.apache.org" Content-Type: multipart/alternative; boundary=089e0122f6aebe28f304ec58541e X-Virus-Checked: Checked by ClamAV on apache.org --089e0122f6aebe28f304ec58541e Content-Type: text/plain; charset=UTF-8 If you always insert 1's for each element, then you can detect collisions by inserting all your elements (or all elements in each document separately) and looking for the max value in the vector. If you see something >1, you have a collision. But collisions are actually good. The only way to completely avoid them is to use a vector as large as your vocabulary which is often painfully large. You can also view multiple probes not so much as avoiding collisions, but as making the linear transformation from the very large dimensional representation of one dimension per word to the lower hashed representation more likely to be nearly invertible in the sense that the Euclidean metric will be approximately preserved. Think Johnson-Lindenstrauss random projections. On Fri, Nov 29, 2013 at 1:54 AM, Paul van Hoven wrote: > Hi, thanks for your quick reply. So multiple probes are a protection > against collisions? After playing a little with the default length of > a RandomAccessSparseVector object I noticed that (of course) > collisions occur when the length is too short. Therefore, I'm asking > myself if there is a possibility to check if a collision occurred > after encoding a new value in the vector? This would give a user the > information that the length of the chosen vector is too short. So far, > I did not find any method in the api to check for that. > > 2013/11/29 Ted Dunning : > > The default with the Mahout encoders is two probes. This is unnecessary > > with the intercept term, of course, if you protect the intercept term > from > > other updates, possible by encoding other data using a view of the > original > > feature vector. > > > > For each probe, a different hash is used so each value is put into > multiple > > locations. Multiple probes are useful in general to decrease the effect > of > > the reduced dimensionality of the hashed representation. > > > > > > > > On Fri, Nov 29, 2013 at 1:14 AM, Paul van Hoven < > paul.van.hoven@gmail.com>wrote: > > > >> For an example program using mahout I use the donut.csv sample data > >> from the project ( > >> > >> > https://svn.apache.org/repos/asf/mahout/trunk/examples/src/main/resources/donut.csv > >> ). My code looks like this: > >> > >> import org.apache.mahout.math.RandomAccessSparseVector; > >> import org.apache.mahout.math.Vector; > >> import org.apache.mahout.vectorizer.encoders.FeatureVectorEncoder; > >> import org.apache.mahout.vectorizer.encoders.StaticWordValueEncoder; > >> import com.csvreader.CsvReader; > >> > >> public class Runner { > >> > >> //Set the path accordingly! > >> public static final String csvInputDataPath = "/path/to/donut.csv"; > >> > >> public static void main(String[] args) { > >> > >> FeatureVectorEncoder encoder = new > StaticWordValueEncoder("features"); > >> ArrayList featureVectors = > >> new ArrayList(); > >> try { > >> CsvReader csvReader = new CsvReader(csvInputDataPath); > >> csvReader.readHeaders(); > >> while( csvReader.readRecord() ) { > >> Vector featureVector = new RandomAccessSparseVector(30); > >> featureVector.set(0, new Double(csvReader.get("x"))); > >> featureVector.set(1, new Double(csvReader.get("y"))); > >> featureVector.set(2, new Double(csvReader.get("c"))); > >> featureVector.set(3, new Integer(csvReader.get("color"))); > >> System.out.println("Before: " + featureVector.toString()); > >> encoder.addToVector(csvReader.get("shape").getBytes(), > >> featureVector); > >> System.out.println(" After: " + featureVector.toString()); > >> featureVectors.add((RandomAccessSparseVector) featureVector); > >> } > >> } catch(Exception e) { > >> e.printStackTrace(); > >> } > >> > >> System.out.println("Program is done."); > >> } > >> > >> } > >> > >> > >> What confuses me is the following output (one sample): > >> > >> Before: > >> {0:0.923307513352484,1:0.0135197141207755,2:0.644866125183976,3:2.0} > >> After: > >> > {0:0.923307513352484,1:0.0135197141207755,2:0.644866125183976,3:2.0,29:1.0,25:1.0} > >> > >> As you can see, I added just one value "shape" to the vector. However > >> two dimensions of this vector are encoded with 1.0. On the other hand, > >> for some other data I get the output > >> > >> Before: > >> {0:0.711011884035543,1:0.909141522599384,2:0.46035073663368,3:2.0} > >> After: > >> > {0:0.711011884035543,1:0.909141522599384,2:0.46035073663368,3:3.0,16:1.0} > >> > >> Why? I would expect that _always_ only one dimension gets occupied by > >> 1.0 as this is the standard case for categorial encoding. This way > >> this seems to be wrong. > >> > >> Thanks in advance, > >> Paul > >> > --089e0122f6aebe28f304ec58541e--