spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <>
Subject Re: Spark Streaming Advice
Date Mon, 10 Oct 2016 16:38:13 GMT
Hi Kevin,

What is the streaming interval (batch interval) above?

I do analytics on streaming trade data but after manipulation of individual
messages I store the selected on in Hbase. Very fast.


Dr Mich Talebzadeh

LinkedIn *

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On 10 October 2016 at 15:25, Kevin Mellott <>

> Whilst working on this application, I found a setting that drastically
> improved the performance of my particular Spark Streaming application. I'm
> sharing the details in hopes that it may help somebody in a similar
> situation.
> As my program ingested information into HDFS (as parquet files), I noticed
> that the time to process each batch was significantly greater than I
> anticipated. Whether I was writing a single parquet file (around 8KB) or
> around 10-15 files (8KB each), that step of the processing was taking
> around 30 seconds. Once I set the configuration below, this operation
> reduced from 30 seconds to around 1 second.
> // ssc = instance of SparkStreamingContext
> ssc.sparkContext.hadoopConfiguration.set("parquet.enable.summary-metadata",
> "false")
> I've also verified that the parquet files being generated are usable by
> both Hive and Impala.
> Hope that helps!
> Kevin
> On Thu, Oct 6, 2016 at 4:22 PM, Kevin Mellott <>
> wrote:
>> I'm attempting to implement a Spark Streaming application that will
>> consume application log messages from a message broker and store the
>> information in HDFS. During the data ingestion, we apply a custom schema to
>> the logs, partition by application name and log date, and then save the
>> information as parquet files.
>> All of this works great, except we end up having a large number of
>> parquet files created. It's my understanding that Spark Streaming is unable
>> to control the number of files that get generated in each partition; can
>> anybody confirm that is true?
>> Also, has anybody else run into a similar situation regarding data
>> ingestion with Spark Streaming and do you have any tips to share? Our end
>> goal is to store the information in a way that makes it efficient to query,
>> using a tool like Hive or Impala.
>> Thanks,
>> Kevin

View raw message