spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Abhishek R. Singh" <abhis...@tetrationanalytics.com>
Subject Re: How to increase parallelism of a Spark cluster?
Date Mon, 03 Aug 2015 01:13:11 GMT
I don't know if (your assertion/expectation that) workers will process things (multiple partitions)
in parallel is really valid. Or if having more partitions than workers will necessarily help
(unless you are memory bound - so partitions is essentially helping your work size rather
than execution parallelism).

[Disclaimer: I am no authority on Spark, but wanted to throw my spin based my own understanding].


Nothing official about it :)

-abhishek-

> On Jul 31, 2015, at 1:03 PM, Sujit Pal <sujitatgtalk@gmail.com> wrote:
> 
> Hello,
> 
> I am trying to run a Spark job that hits an external webservice to get back some information.
The cluster is 1 master + 4 workers, each worker has 60GB RAM and 4 CPUs. The external webservice
is a standalone Solr server, and is accessed using code similar to that shown below.
> 
>> def getResults(keyValues: Iterator[(String, Array[String])]):
>>         Iterator[(String, String)] = {
>>     val solr = new HttpSolrClient()
>>     initializeSolrParameters(solr)
>>     keyValues.map(keyValue => (keyValue._1, process(solr, keyValue)))
>> }
>> myRDD.repartition(10)     
>>              .mapPartitions(keyValues => getResults(keyValues))
>  
> The mapPartitions does some initialization to the SolrJ client per partition and then
hits it for each record in the partition via the getResults() call.
> 
> I repartitioned in the hope that this will result in 10 clients hitting Solr simultaneously
(I would like to go upto maybe 30-40 simultaneous clients if I can). However, I counted the
number of open connections using "netstat -anp | grep ":8983.*ESTABLISHED" in a loop on the
Solr box and observed that Solr has a constant 4 clients (ie, equal to the number of workers)
over the lifetime of the run.
> 
> My observation leads me to believe that each worker processes a single stream of work
sequentially. However, from what I understand about how Spark works, each worker should be
able to process number of tasks parallelly, and that repartition() is a hint for it to do
so.
> 
> Is there some SparkConf environment variable I should set to increase parallelism in
these workers, or should I just configure a cluster with multiple workers per machine? Or
is there something I am doing wrong?
> 
> Thank you in advance for any pointers you can provide.
> 
> -sujit
> 

Mime
View raw message