If the file is not splittable(can I assume the log file is splittable, though) can you advise on how spark handles such case…? If Spark can't what is the widely used practice?

On 3 Sep 2016 7:29 pm, "Raghavendra Pandey" <raghavendra.pandey@gmail.com> wrote:
If your file format is splittable say TSV, CSV etc, it will be distributed across all executors.

On Sat, Sep 3, 2016 at 3:38 PM, Somasundaram Sekar <somasundar.sekar@tigeranalytics.com> wrote:

Hi All,


Would like to gain some understanding on the questions listed below,


1.       When processing a large file with Apache Spark, with, say, sc.textFile("somefile.xml"), does it split it for parallel processing across executors or, will it be processed as a single chunk in a single executor?

2.       When using dataframes, with implicit XMLContext from Databricks is there any optimization prebuilt for such large file processing?


Please help!!!





Somasundaram S