I am working on a pipeline that carries out a number of stages, the last of which is to build some large JSON objects from information in the preceding stages. The JSON objects are then uploaded to Elasticsearch in bulk.
If I carry out a shuffle via a `repartition` call after the JSON documents have been created, the upload to ES is fast. But the shuffle itself takes many tens of minutes and is IO-bound.
If I omit the repartition, the upload to ES takes a long time due to a complete lack of parallelism.
Currently, the step that precedes the assembling of the JSON documents, which goes into the final repartition call, is the querying of pairs of object ids. In a mapper the ids are resolved to documents by querying HBase. The initial pairs of ids are obtained via a query against the SQL context, and the query result is repartitioned before going into the mapper that resolves the ids into documents.
It's not clear to me why the final repartition preceding the upload to ES is required. I would like to omit it, since it is so expensive and involves so much network IO, but have not found a way to do this yet. If I omit the repartition, the job takes much longer.
Does anyone know what might be going on here, and what I might be able to do to get rid of the last `repartition` call before the upload to ES?