spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Somasundaram Sekar <somasundar.se...@tigeranalytics.com>
Subject Re: Importing large file with SparkContext.textFile
Date Sat, 03 Sep 2016 17:44:50 GMT
Can I assume that, there is no point in running the decompression as a
spark job, as it will run only on one executor, so we somehow, decompress
the file, put it in some spark supported storage and start the application.

/Soma

On 3 Sep 2016 10:56 pm, "Mich Talebzadeh" <mich.talebzadeh@gmail.com> wrote:

> yes but they can be uncompressed right? In that case they can  be split.
> Depends on the size of the file and the block size if left compressed.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 3 September 2016 at 18:15, Somasundaram Sekar <somasundar.sekar@
> tigeranalytics.com> wrote:
>
>> What would be the best practice for handling large compressed file that
>> cannot be split.
>>
>> Regards,
>> Somasundaram S
>>
>> On 3 Sep 2016 8:40 pm, "Mich Talebzadeh" <mich.talebzadeh@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> Your point:
>>>
>>> *When processing a large file with Apache Spark, with, say,
>>> sc.textFile("somefile.xml"), does it split it for parallel processing
>>> across executors or, will it be processed as a single chunk in a single
>>> executor?*
>>>
>>> OK we should remember that Spark is all about clustering, meaning spread
>>> the work load among the cluster members/nodes
>>>
>>> Let us look at this with Spark running in Standalone mode (resources
>>> managed by Spark itself).
>>>
>>> You start your master (on the master node) and workers (slaves) on each
>>> node.
>>>
>>> Then you specify your executors and resources for each executor (memory
>>> and cores)
>>>
>>> In my set up I have 6 executors active. First I read a text file in.
>>>
>>> scala> val textFile = sc.textFile("/tmp/ASE15UpgradeGuide.doc")
>>> textFile: org.apache.spark.rdd.RDD[String] = /tmp/ASE15UpgradeGuide.doc
>>> MapPartitionsRDD[3] at textFile at <console>:24
>>>
>>>
>>> As you see Spark creates an RDD for it. An RDD is basically a framework
>>> to process unstructured and structured data. That is all. There is no data
>>> there yet. So what happens among the executors
>>>
>>>
>>> Let us have a look at the Spark UI Executors page
>>>
>>> [image: Inline images 2]
>>>
>>>
>>> As you can see there are 5 executors plus the driver. The only storage
>>> memory is consumed is by the driver. (70.7KB)
>>>
>>> Let me collect that text file into Spark
>>>
>>> scala> textFile.collect
>>>
>>> Now you can see the spread of storage memory and input among the
>>> executors
>>>
>>>
>>> [image: Inline images 1]
>>> Note that in this case Executor ID = 4 and Executor ID = 1 have Storage
>>> memory consumed plus data inputs. So Spark spreads the work among
>>> executors. In the case above Spark uses one executor on each host (two
>>> hosts, 4 executors on host ...217 and one executor on host ...216). so in
>>> summary all executors do the work. In this mode Spark driver allocates work
>>> to two executors.
>>>
>>> With regard to the second question,
>>>
>>> *When using dataframes, with implicit XMLContext from Databricks is
>>> there any optimization prebuilt for such large file processing?*
>>>
>>> I believe it should do it. However, if the csv file or any file is
>>> compressed (Snappy, bz2 etc), it cannot be split, so it will end up in one
>>> executor only.
>>>
>>> This is my understanding. Assuming you allocate n cores to each
>>> executor, you are enabling executor to run parallel tasks on sub-set of
>>> data using the same code. If Spark cannot split the data (the blocks,),
>>> then the code will run in serial mode using one executor only which is not
>>> what you want.
>>>
>>> HTH
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 3 September 2016 at 11:08, Somasundaram Sekar <
>>> somasundar.sekar@tigeranalytics.com> wrote:
>>>
>>>> Hi All,
>>>>
>>>>
>>>>
>>>> Would like to gain some understanding on the questions listed below,
>>>>
>>>>
>>>>
>>>> 1.       When processing a large file with Apache Spark, with, say,
>>>> sc.textFile("somefile.xml"), does it split it for parallel processing
>>>> across executors or, will it be processed as a single chunk in a single
>>>> executor?
>>>>
>>>> 2.       When using dataframes, with implicit XMLContext from
>>>> Databricks is there any optimization prebuilt for such large file
>>>> processing?
>>>>
>>>>
>>>>
>>>> Please help!!!
>>>>
>>>>
>>>>
>>>> http://stackoverflow.com/questions/39305310/does-spark-proce
>>>> ss-large-file-in-the-single-worker
>>>>
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Somasundaram S
>>>>
>>>
>>>
>

Mime
View raw message