spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <>
Subject Re: Spark Streaming Advice
Date Mon, 10 Oct 2016 21:55:33 GMT
Your file size is too small this has a significant impact on the namenode. Use Hbase or maybe
hawq to store small writes.

> On 10 Oct 2016, at 16:25, Kevin Mellott <> wrote:
> Whilst working on this application, I found a setting that drastically improved the performance
of my particular Spark Streaming application. I'm sharing the details in hopes that it may
help somebody in a similar situation.
> As my program ingested information into HDFS (as parquet files), I noticed that the time
to process each batch was significantly greater than I anticipated. Whether I was writing
a single parquet file (around 8KB) or around 10-15 files (8KB each), that step of the processing
was taking around 30 seconds. Once I set the configuration below, this operation reduced from
30 seconds to around 1 second.
> // ssc = instance of SparkStreamingContext
> ssc.sparkContext.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")
> I've also verified that the parquet files being generated are usable by both Hive and
> Hope that helps!
> Kevin
>> On Thu, Oct 6, 2016 at 4:22 PM, Kevin Mellott <> wrote:
>> I'm attempting to implement a Spark Streaming application that will consume application
log messages from a message broker and store the information in HDFS. During the data ingestion,
we apply a custom schema to the logs, partition by application name and log date, and then
save the information as parquet files.
>> All of this works great, except we end up having a large number of parquet files
created. It's my understanding that Spark Streaming is unable to control the number of files
that get generated in each partition; can anybody confirm that is true? 
>> Also, has anybody else run into a similar situation regarding data ingestion with
Spark Streaming and do you have any tips to share? Our end goal is to store the information
in a way that makes it efficient to query, using a tool like Hive or Impala.
>> Thanks,
>> Kevin

View raw message