mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian <...@apache.org>
Subject Re: Scaling up spark Iitem similarity on big data data sets
Date Thu, 23 Jun 2016 14:01:55 GMT
Hi,

Pairwise similarity is a quadratic problem and its very easy to run into 
a problem size does not scale anymore, especially with so many items. 
Our code downsamples the input data to help with this.

One thing you can do is decrease the argument maxNumInteractions to a 
lower number to increase the amount of downsampling. Another thing you 
can do is to remove the items with the highest amount of interactions 
from the dataset as they are not very interesting usually (everybody 
knows the topsellers already) and heavily impact the computation.

Best,
Sebastian


On 23.06.2016 15:47, jelmer wrote:
> Hi,
>
> I am trying to build a simple recommendation engine using spark item
> similarity (eg with
> org.apache.mahout.math.cf.SimilarityAnalysis.cooccurrencesIDSs)
>
> Things work fine on comparatively small dataset but I am having difficulty
> scaling it up
>
> The input I am using is CSV data containing 19.988.422 view item events
> produced by 1.384.107 users. Looking at 5.135.845 distinct products
>
> The csv data is stored on hdfs and is split up over 15 files, consequently
> the resultant RDD will have 15 partitions.
>
> After tweaking some parameters I did manage to get the job to run without
> going out of memory but the job takes a very very long time to run
>
> After running for 15 hours it still is stuck on
>
> org.apache.spark.rdd.RDD.flatMap(RDD.scala:332)
> org.apache.mahout.sparkbindings.blas.AtA$.at_a_nongraph_mmul(AtA.scala:254)
> org.apache.mahout.sparkbindings.blas.AtA$.at_a(AtA.scala:61)
> org.apache.mahout.sparkbindings.SparkEngine$.tr2phys(SparkEngine.scala:325)
> org.apache.mahout.sparkbindings.SparkEngine$.tr2phys(SparkEngine.scala:339)
> org.apache.mahout.sparkbindings.SparkEngine$.toPhysical(SparkEngine.scala:123)
> org.apache.mahout.math.drm.logical.CheckpointAction.checkpoint(CheckpointAction.scala:41)
> org.apache.mahout.math.drm.package$.drm2Checkpointed(package.scala:95)
> org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$3.apply(SimilarityAnalysis.scala:145)
> org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$3.apply(SimilarityAnalysis.scala:143)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> scala.collection.Iterator$class.foreach(Iterator.scala:727)
> scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:176)
> scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45)
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)scala.collection.AbstractIterator.to(Iterator.scala:1157)
> scala.collection.TraversableOnce$class.toList(TraversableOnce.scala:257)
> scala.collection.AbstractIterator.toList(Iterator.scala:1157)
>
>
> I am using spark on yarn and containers cannot use more than 16gb
>
> I figured I would be able to speed things up by throwing a larger number of
> executors at the problem. but so far that is not working out very well
>
> I tried assigning 500 executors and repartitioning the input data to 500
> partitions and even changing the spark.yarn.driver.memoryOverhead to crazy
> values (half of the heap) did not resolve this.
>
> Could someone offer any guidance on how to best speed up item similarity
> jobs ?
>

Mime
View raw message