spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Igor Berman <igor.ber...@gmail.com>
Subject Re: How to increase parallelism of a Spark cluster?
Date Sun, 02 Aug 2015 17:52:11 GMT
What kind of cluster? How many cores on each worker? Is there config for
http solr client? I remember standard httpclient has limit per route/host.
On Aug 2, 2015 8:17 PM, "Sujit Pal" <sujitatgtalk@gmail.com> wrote:

> No one has any ideas?
>
> Is there some more information I should provide?
>
> I am looking for ways to increase the parallelism among workers. Currently
> I just see number of simultaneous connections to Solr equal to the number
> of workers. My number of partitions is (2.5x) larger than number of
> workers, and the workers seem to be large enough to handle more than one
> task at a time.
>
> I am creating a single client per partition in my mapPartition call. Not
> sure if that is creating the gating situation? Perhaps I should use a Pool
> of clients instead?
>
> Would really appreciate some pointers.
>
> Thanks in advance for any help you can provide.
>
> -sujit
>
>
> On Fri, Jul 31, 2015 at 1:03 PM, Sujit Pal <sujitatgtalk@gmail.com> wrote:
>
>> Hello,
>>
>> I am trying to run a Spark job that hits an external webservice to get
>> back some information. The cluster is 1 master + 4 workers, each worker has
>> 60GB RAM and 4 CPUs. The external webservice is a standalone Solr server,
>> and is accessed using code similar to that shown below.
>>
>> def getResults(keyValues: Iterator[(String, Array[String])]):
>>>         Iterator[(String, String)] = {
>>>     val solr = new HttpSolrClient()
>>>     initializeSolrParameters(solr)
>>>     keyValues.map(keyValue => (keyValue._1, process(solr, keyValue)))
>>> }
>>> myRDD.repartition(10)
>>
>>              .mapPartitions(keyValues => getResults(keyValues))
>>>
>>
>> The mapPartitions does some initialization to the SolrJ client per
>> partition and then hits it for each record in the partition via the
>> getResults() call.
>>
>> I repartitioned in the hope that this will result in 10 clients hitting
>> Solr simultaneously (I would like to go upto maybe 30-40 simultaneous
>> clients if I can). However, I counted the number of open connections using
>> "netstat -anp | grep ":8983.*ESTABLISHED" in a loop on the Solr box and
>> observed that Solr has a constant 4 clients (ie, equal to the number of
>> workers) over the lifetime of the run.
>>
>> My observation leads me to believe that each worker processes a single
>> stream of work sequentially. However, from what I understand about how
>> Spark works, each worker should be able to process number of tasks
>> parallelly, and that repartition() is a hint for it to do so.
>>
>> Is there some SparkConf environment variable I should set to increase
>> parallelism in these workers, or should I just configure a cluster with
>> multiple workers per machine? Or is there something I am doing wrong?
>>
>> Thank you in advance for any pointers you can provide.
>>
>> -sujit
>>
>>
>

Mime
View raw message