Your file size is too small this has a significant impact on the namenode. Use Hbase or maybe hawq to store small writes.
Whilst working on this application, I found a setting that drastically improved the performance of my particular Spark Streaming application. I'm sharing the details in hopes that it may help somebody in a similar situation.
As my program ingested information into HDFS (as parquet files), I noticed that the time to process each batch was significantly greater than I anticipated. Whether I was writing a single parquet file (around 8KB) or around 10-15 files (8KB each), that step of the processing was taking around 30 seconds. Once I set the configuration below, this operation reduced from 30 seconds to around 1 second.
// ssc = instance of SparkStreamingContext
I've also verified that the parquet files being generated are usable by both Hive and Impala.
Hope that helps!