[ https://issues.apache.org/jira/browse/SPARK19771?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=15889180#comment15889180
]
Mingjie Tang edited comment on SPARK19771 at 3/1/17 12:30 AM:

If we follow the ANDOR framework, one optimization is here:
At first, suppose we use BucketedRandomProjectionLSH, and setNumHashTables(3) and setNumHashFunctions(5).
For one input vector with sparse format e.g., [(6,[0,1,2],[1.0,1.0,1.0])],
we can compute its mapped hash vectors is
WrappedArray([1.0,1.0,1.0,0.0,0.0], [1.0,0.0,0.0,1.0,1.0], [1.0,1.0,1.0,1.0,0.0])]
For the similarityjoin, we map this computed vector into
++++
 datasetAentry hashValue
++++
[0,(6,[0,1,2],[1.... 0[1.0,1.0,1.0,0...
[0,(6,[0,1,2],[1.... 1[1.0,0.0,0.0,1....
[0,(6,[0,1,2],[1.... 2[1.0,1.0,1.0,...
Then, based on ANDOR principle, we using the entry and hashValue for equipjoin with other
table . When we look at the the sql, it need to iterate through the hashValue vector for equal
join. Thus, this computation and memory usage cost is very high. For example, if we apply
the nestloop for comparing two vectors with length of NumHashFunctions, the computation cost
is (NumHashFunctions* NumHashFunctions) and memory overhead is (N* NumHashFunctions).
Therefore, we can apply one optimization technique here. that is, we do not need to store
the hash value for each hash table like [1.0,1.0,1.0,0.0,0.0], instead, we use a simple
hash function to improve this.
Suppose h_l= {h_0, h_1., ... h_k}, where k is the number of setNumHashFunctions.
then, the mapped hash function is
H(h_l)=sum_{i =0...k} (h_i*r_i) mod prime.
where r_i is a integer and the prime can be set as 2^325 for less hash collision.
Then, we only need to store the hash value H(h_l) for AND operation.
The similar approach also can be applied for the approxNeasetNeighbors searching.
If the multiprobe approach does not need to read the hash value for one hash function (e.g.,
h_l mentioned above), we can applied this H(h_l) to the preprocessed data to save memory.
I will double check the multiprobe approach and see whether it is work. Then, I can submit
a patch to improve the current way. Note, this hash function to reduce the vectortovector
comparing is widely used in different places for example, the LSH implementation by Andoni
.
[~yunn] [~mlnick]
was (Author: merlintang):
If we follow the ANDOR framework, one optimization is here:
At first, suppose we use BucketedRandomProjectionLSH, and setNumHashTables(3) and setNumHashFunctions(5).
For one input vector with sparse format e.g., [(6,[0,1,2],[1.0,1.0,1.0])],
we can compute its mapped hash vectors is
WrappedArray([1.0,1.0,1.0,0.0,0.0], [1.0,0.0,0.0,1.0,1.0], [1.0,1.0,1.0,1.0,0.0])]
For the similarityjoin, we map this computed vector into
++++
 datasetAentry hashValue
++++
[0,(6,[0,1,2],[1.... 0[1.0,1.0,1.0,0...
[0,(6,[0,1,2],[1.... 1[1.0,0.0,0.0,1....
[0,(6,[0,1,2],[1.... 2[1.0,1.0,1.0,...
Then, based on ANDOR principle, we using the entry and hashValue for equipjoin with other
table . When we look at the the sql, it need to iterate through the hashValue vector for equal
join. Thus, this computation and memory usage cost is very high. For example, if we apply
the nestloop for comparing two vectors with length of NumHashFunctions, the computation cost
is (NumHashFunctions* NumHashFunctions) and memory overhead is (N* NumHashFunctions).
Therefore, we can apply one optimization technique here. that is, we do not need to store
the hash value for each hash table like [1.0,1.0,1.0,0.0,0.0], instead, we use a simple
hash function to improve this.
Suppose h_l= {h_0, h_1., ... h_k}, where k is the number of setNumHashFunctions.
then, the mapped hash function is
H(h_l)=sum_{i =0...k} (h_i*r_i) mod prime.
where r_i is a integer and the prime can be set as 2^325 for less hash collision.
Then, we only need to store the hash value H(h_l) for AND operation.
The similar approach also can be applied for the approxNeasetNeighbors searching.
If the multiprobe approach does not need to read the hash value for one hash function (e.g.,
h_l mentioned above), we can applied this H(h_l) to the preprocessed data to save memory.
I will double check the multiprobe approach and see whether it is work. Then, I can submit
a patch to improve the current way. Note, this hash function to reduce the vectortovector
comparing is widely used in different places for example, the LSH implementation by Andoni
.
[~diefunction] [~mlnick]
> Support ORAND amplification in Locality Sensitive Hashing (LSH)
> 
>
> Key: SPARK19771
> URL: https://issues.apache.org/jira/browse/SPARK19771
> Project: Spark
> Issue Type: Improvement
> Components: ML
> Affects Versions: 2.1.0
> Reporter: Yun Ni
>
> The current LSH implementation only supports ANDOR amplification. We need to discuss
the following questions before we goes to implementations:
> (1) Whether we should support ORAND amplification
> (2) What API changes we need for ORAND amplification
> (3) How we fix the approxNearestNeighbor and approxSimilarityJoin internally.

This message was sent by Atlassian JIRA
(v6.3.15#6346)

To unsubscribe, email: issuesunsubscribe@spark.apache.org
For additional commands, email: issueshelp@spark.apache.org
