Your file size is too small this has a significant impact on the namenode. Use Hbase or maybe hawq to store small writes.

On 10 Oct 2016, at 16:25, Kevin Mellott <kevin.r.mellott@gmail.com> wrote:

Whilst working on this application, I found a setting that drastically improved the performance of my particular Spark Streaming application. I'm sharing the details in hopes that it may help somebody in a similar situation.

As my program ingested information into HDFS (as parquet files), I noticed that the time to process each batch was significantly greater than I anticipated. Whether I was writing a single parquet file (around 8KB) or around 10-15 files (8KB each), that step of the processing was taking around 30 seconds. Once I set the configuration below, this operation reduced from 30 seconds to around 1 second.

// ssc = instance of SparkStreamingContext
ssc.sparkContext.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")

I've also verified that the parquet files being generated are usable by both Hive and Impala.

Hope that helps!
Kevin

On Thu, Oct 6, 2016 at 4:22 PM, Kevin Mellott <kevin.r.mellott@gmail.com> wrote:
I'm attempting to implement a Spark Streaming application that will consume application log messages from a message broker and store the information in HDFS. During the data ingestion, we apply a custom schema to the logs, partition by application name and log date, and then save the information as parquet files.

All of this works great, except we end up having a large number of parquet files created. It's my understanding that Spark Streaming is unable to control the number of files that get generated in each partition; can anybody confirm that is true? 

Also, has anybody else run into a similar situation regarding data ingestion with Spark Streaming and do you have any tips to share? Our end goal is to store the information in a way that makes it efficient to query, using a tool like Hive or Impala.

Thanks,
Kevin