spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Igor Berman <igor.ber...@gmail.com>
Subject Re: How to increase parallelism of a Spark cluster?
Date Sun, 02 Aug 2015 21:41:27 GMT
so how many cores you configure per node?
do u have something like ----total-executor-cores or maybe
--num-executors config(I'm
not sure what kind of cluster databricks platform provides, if it's
standalone then first option should be used)? if you have 4 cores at total,
then even though you have 4 cores per machine only 1 is working on each
machine...which could be a cause.
another option - you are hitting some default config of limiting number of
concurrent routes or max total connection from jvm,
look at
https://hc.apache.org/httpcomponents-client-ga/tutorial/html/connmgmt.html
 (assuming you are using HttpClient from 4.x and not 3.x version)
not sure what are the defaults...



On 2 August 2015 at 23:42, Sujit Pal <sujitatgtalk@gmail.com> wrote:

> Hi Igor,
>
> The cluster is a Databricks Spark cluster. It consists of 1 master + 4
> workers, each worker has 60GB RAM and 4 CPUs. The original mail has some
> more details (also the reference to the HttpSolrClient in there should be
> HttpSolrServer, sorry about that, mistake while writing the email).
>
> There is no additional configuration on the external Solr host from my
> code, I am using the default HttpClient provided by HttpSolrServer.
> According to the Javadocs, you can pass in a HttpClient object as well. Is
> there some specific configuration you would suggest to get past any limits?
>
> On another project, I faced a similar problem but I had more leeway (was
> using a Spark cluster from EC2) and less time, my workaround was to use
> python multiprocessing to create a program that started up 30 python
> JSON/HTTP clients and wrote output into 30 output files, which were then
> processed by Spark. Reason I mention this is that I was using default
> configurations there as well, just needed to increase the number of
> connections against Solr to a higher number.
>
> This time round, I would like to do this through Spark because it makes
> the pipeline less complex.
>
> -sujit
>
>
> On Sun, Aug 2, 2015 at 10:52 AM, Igor Berman <igor.berman@gmail.com>
> wrote:
>
>> What kind of cluster? How many cores on each worker? Is there config for
>> http solr client? I remember standard httpclient has limit per route/host.
>> On Aug 2, 2015 8:17 PM, "Sujit Pal" <sujitatgtalk@gmail.com> wrote:
>>
>>> No one has any ideas?
>>>
>>> Is there some more information I should provide?
>>>
>>> I am looking for ways to increase the parallelism among workers.
>>> Currently I just see number of simultaneous connections to Solr equal to
>>> the number of workers. My number of partitions is (2.5x) larger than number
>>> of workers, and the workers seem to be large enough to handle more than one
>>> task at a time.
>>>
>>> I am creating a single client per partition in my mapPartition call. Not
>>> sure if that is creating the gating situation? Perhaps I should use a Pool
>>> of clients instead?
>>>
>>> Would really appreciate some pointers.
>>>
>>> Thanks in advance for any help you can provide.
>>>
>>> -sujit
>>>
>>>
>>> On Fri, Jul 31, 2015 at 1:03 PM, Sujit Pal <sujitatgtalk@gmail.com>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> I am trying to run a Spark job that hits an external webservice to get
>>>> back some information. The cluster is 1 master + 4 workers, each worker has
>>>> 60GB RAM and 4 CPUs. The external webservice is a standalone Solr server,
>>>> and is accessed using code similar to that shown below.
>>>>
>>>> def getResults(keyValues: Iterator[(String, Array[String])]):
>>>>>         Iterator[(String, String)] = {
>>>>>     val solr = new HttpSolrClient()
>>>>>     initializeSolrParameters(solr)
>>>>>     keyValues.map(keyValue => (keyValue._1, process(solr, keyValue)))
>>>>> }
>>>>> myRDD.repartition(10)
>>>>
>>>>              .mapPartitions(keyValues => getResults(keyValues))
>>>>>
>>>>
>>>> The mapPartitions does some initialization to the SolrJ client per
>>>> partition and then hits it for each record in the partition via the
>>>> getResults() call.
>>>>
>>>> I repartitioned in the hope that this will result in 10 clients hitting
>>>> Solr simultaneously (I would like to go upto maybe 30-40 simultaneous
>>>> clients if I can). However, I counted the number of open connections using
>>>> "netstat -anp | grep ":8983.*ESTABLISHED" in a loop on the Solr box and
>>>> observed that Solr has a constant 4 clients (ie, equal to the number of
>>>> workers) over the lifetime of the run.
>>>>
>>>> My observation leads me to believe that each worker processes a single
>>>> stream of work sequentially. However, from what I understand about how
>>>> Spark works, each worker should be able to process number of tasks
>>>> parallelly, and that repartition() is a hint for it to do so.
>>>>
>>>> Is there some SparkConf environment variable I should set to increase
>>>> parallelism in these workers, or should I just configure a cluster with
>>>> multiple workers per machine? Or is there something I am doing wrong?
>>>>
>>>> Thank you in advance for any pointers you can provide.
>>>>
>>>> -sujit
>>>>
>>>>
>>>
>

Mime
View raw message