spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <mich.talebza...@gmail.com>
Subject Re: Importing large file with SparkContext.textFile
Date Sat, 03 Sep 2016 15:10:42 GMT
Hi,

Your point:

*When processing a large file with Apache Spark, with, say,
sc.textFile("somefile.xml"), does it split it for parallel processing
across executors or, will it be processed as a single chunk in a single
executor?*

OK we should remember that Spark is all about clustering, meaning spread
the work load among the cluster members/nodes

Let us look at this with Spark running in Standalone mode (resources
managed by Spark itself).

You start your master (on the master node) and workers (slaves) on each
node.

Then you specify your executors and resources for each executor (memory and
cores)

In my set up I have 6 executors active. First I read a text file in.

scala> val textFile = sc.textFile("/tmp/ASE15UpgradeGuide.doc")
textFile: org.apache.spark.rdd.RDD[String] = /tmp/ASE15UpgradeGuide.doc
MapPartitionsRDD[3] at textFile at <console>:24


As you see Spark creates an RDD for it. An RDD is basically a framework to
process unstructured and structured data. That is all. There is no data
there yet. So what happens among the executors


Let us have a look at the Spark UI Executors page

[image: Inline images 2]


As you can see there are 5 executors plus the driver. The only storage
memory is consumed is by the driver. (70.7KB)

Let me collect that text file into Spark

scala> textFile.collect

Now you can see the spread of storage memory and input among the executors


[image: Inline images 1]
Note that in this case Executor ID = 4 and Executor ID = 1 have Storage
memory consumed plus data inputs. So Spark spreads the work among
executors. In the case above Spark uses one executor on each host (two
hosts, 4 executors on host ...217 and one executor on host ...216). so in
summary all executors do the work. In this mode Spark driver allocates work
to two executors.

With regard to the second question,

*When using dataframes, with implicit XMLContext from Databricks is there
any optimization prebuilt for such large file processing?*

I believe it should do it. However, if the csv file or any file is
compressed (Snappy, bz2 etc), it cannot be split, so it will end up in one
executor only.

This is my understanding. Assuming you allocate n cores to each executor,
you are enabling executor to run parallel tasks on sub-set of data using
the same code. If Spark cannot split the data (the blocks,), then the code
will run in serial mode using one executor only which is not what you want.

HTH


Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 3 September 2016 at 11:08, Somasundaram Sekar <
somasundar.sekar@tigeranalytics.com> wrote:

> Hi All,
>
>
>
> Would like to gain some understanding on the questions listed below,
>
>
>
> 1.       When processing a large file with Apache Spark, with, say,
> sc.textFile("somefile.xml"), does it split it for parallel processing
> across executors or, will it be processed as a single chunk in a single
> executor?
>
> 2.       When using dataframes, with implicit XMLContext from Databricks
> is there any optimization prebuilt for such large file processing?
>
>
>
> Please help!!!
>
>
>
> http://stackoverflow.com/questions/39305310/does-spark-
> process-large-file-in-the-single-worker
>
>
>
> Regards,
>
> Somasundaram S
>

Mime
View raw message