spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Igor Berman <>
Subject Re: bulk upload to Elasticsearch and shuffle behavior
Date Tue, 01 Sep 2015 08:50:27 GMT
Hi Eric,
I see that you solved your problem. Imho, when you do repartition you split
your work into 2 stages, so your hbase lookup happens at first stage, and
upload to ES happens after shuffle on next stage, so without repartition
it's hard to tell where is ES upload and where is Hbase lookup time.

If you don't mind it's interesting if you reduce number of partitions
before uploading to ES? Do you have some rule of thumb on how much
partitions should be there before uploading to ES?
We have kind of same pipeline and we reduce # of partitions to 8 or so
before uploading to ES(probably depends on ES cluster strength)

On 1 September 2015 at 06:05, Eric Walker <> wrote:

> I think I have found out what was causing me difficulties.  It seems I was
> reading too much into the stage description shown in the "Stages" tab of
> the Spark application UI.  While it said "repartition at
>", I can infer from the network traffic and
> from its response to changes I subsequently made that the actual code that
> was running was the code doing the HBase lookups.  I suspect the actual
> shuffle, once it occurred, required on the same order of network IO as the
> upload to Elasticsearch that followed.
> Eric
> On Mon, Aug 31, 2015 at 6:09 PM, Eric Walker <>
> wrote:
>> Hi,
>> I am working on a pipeline that carries out a number of stages, the last
>> of which is to build some large JSON objects from information in the
>> preceding stages.  The JSON objects are then uploaded to Elasticsearch in
>> bulk.
>> If I carry out a shuffle via a `repartition` call after the JSON
>> documents have been created, the upload to ES is fast.  But the shuffle
>> itself takes many tens of minutes and is IO-bound.
>> If I omit the repartition, the upload to ES takes a long time due to a
>> complete lack of parallelism.
>> Currently, the step that precedes the assembling of the JSON documents,
>> which goes into the final repartition call, is the querying of pairs of
>> object ids.  In a mapper the ids are resolved to documents by querying
>> HBase.  The initial pairs of ids are obtained via a query against the SQL
>> context, and the query result is repartitioned before going into the mapper
>> that resolves the ids into documents.
>> It's not clear to me why the final repartition preceding the upload to ES
>> is required.  I would like to omit it, since it is so expensive and
>> involves so much network IO, but have not found a way to do this yet.  If I
>> omit the repartition, the job takes much longer.
>> Does anyone know what might be going on here, and what I might be able to
>> do to get rid of the last `repartition` call before the upload to ES?
>> Eric

View raw message