spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Somasundaram Sekar <somasundar.se...@tigeranalytics.com>
Subject Re: Importing large file with SparkContext.textFile
Date Sat, 03 Sep 2016 17:15:33 GMT
What would be the best practice for handling large compressed file that
cannot be split.

Regards,
Somasundaram S

On 3 Sep 2016 8:40 pm, "Mich Talebzadeh" <mich.talebzadeh@gmail.com> wrote:

> Hi,
>
> Your point:
>
> *When processing a large file with Apache Spark, with, say,
> sc.textFile("somefile.xml"), does it split it for parallel processing
> across executors or, will it be processed as a single chunk in a single
> executor?*
>
> OK we should remember that Spark is all about clustering, meaning spread
> the work load among the cluster members/nodes
>
> Let us look at this with Spark running in Standalone mode (resources
> managed by Spark itself).
>
> You start your master (on the master node) and workers (slaves) on each
> node.
>
> Then you specify your executors and resources for each executor (memory
> and cores)
>
> In my set up I have 6 executors active. First I read a text file in.
>
> scala> val textFile = sc.textFile("/tmp/ASE15UpgradeGuide.doc")
> textFile: org.apache.spark.rdd.RDD[String] = /tmp/ASE15UpgradeGuide.doc
> MapPartitionsRDD[3] at textFile at <console>:24
>
>
> As you see Spark creates an RDD for it. An RDD is basically a framework to
> process unstructured and structured data. That is all. There is no data
> there yet. So what happens among the executors
>
>
> Let us have a look at the Spark UI Executors page
>
> [image: Inline images 2]
>
>
> As you can see there are 5 executors plus the driver. The only storage
> memory is consumed is by the driver. (70.7KB)
>
> Let me collect that text file into Spark
>
> scala> textFile.collect
>
> Now you can see the spread of storage memory and input among the executors
>
>
> [image: Inline images 1]
> Note that in this case Executor ID = 4 and Executor ID = 1 have Storage
> memory consumed plus data inputs. So Spark spreads the work among
> executors. In the case above Spark uses one executor on each host (two
> hosts, 4 executors on host ...217 and one executor on host ...216). so in
> summary all executors do the work. In this mode Spark driver allocates work
> to two executors.
>
> With regard to the second question,
>
> *When using dataframes, with implicit XMLContext from Databricks is there
> any optimization prebuilt for such large file processing?*
>
> I believe it should do it. However, if the csv file or any file is
> compressed (Snappy, bz2 etc), it cannot be split, so it will end up in one
> executor only.
>
> This is my understanding. Assuming you allocate n cores to each executor,
> you are enabling executor to run parallel tasks on sub-set of data using
> the same code. If Spark cannot split the data (the blocks,), then the code
> will run in serial mode using one executor only which is not what you want.
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 3 September 2016 at 11:08, Somasundaram Sekar <somasundar.sekar@
> tigeranalytics.com> wrote:
>
>> Hi All,
>>
>>
>>
>> Would like to gain some understanding on the questions listed below,
>>
>>
>>
>> 1.       When processing a large file with Apache Spark, with, say,
>> sc.textFile("somefile.xml"), does it split it for parallel processing
>> across executors or, will it be processed as a single chunk in a single
>> executor?
>>
>> 2.       When using dataframes, with implicit XMLContext from Databricks
>> is there any optimization prebuilt for such large file processing?
>>
>>
>>
>> Please help!!!
>>
>>
>>
>> http://stackoverflow.com/questions/39305310/does-spark-proce
>> ss-large-file-in-the-single-worker
>>
>>
>>
>> Regards,
>>
>> Somasundaram S
>>
>
>

Mime
View raw message