Some minor testing, based on previous mail threads (thanks TedImeaneveryone).
A LocalitySensitiveHash space is defined 50 random vectors (0>1). A
sample vector's position in the LSH space is a bitset. Each entry the
sign of the cosine distance to that LSH vector. Thus every LSH
distance has 50 bits. This space of course has no negative vectors.
Test data: 1000 random vectors as samples. All values 0>1, linear distribution.
This test data gives no negative cosine distances, and so all bits are
0. This is expected (from previous mail threads).
Now, define the random vectors in the space as the dimensionwise
delta between two random vectors. This gives vector values ranging
from 0.5 to 0.5; The same test of 1000 random vectors gives Hamming
distances of 935 to 950 (nonmatching) bits. Here is the
OnlineSummarizer output:
[(count=1000.0),(sd=4.15466),(mean=938.856),(min=928.0),(25%=935.935),(median=938.932),(75%=941.725),(max=951.0),]
Changing the sample vectors from linear to Gaussian, the distances
range from 915 to 960. The mid quartiles are farther from the median.
[(count=1000.0),(sd=5.52686),(mean=935.912),(min=916.0),(25%=931.871),(median=935.810),(75%=939.730),(max=961.0),]
Given the central limit theorem blah blah blah, this seems appropriate.
Should I spin a patch? Where would this go? I have dense&sparse
implementations, a Vector wrapper and some unit tests.
Next week in Jerusalem,
Lance Norskog
goksron@gmail.com
