spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: Structured Streaming partition logic with respect to storage and fileformat
Date Tue, 21 Jun 2016 11:06:34 GMT
Based on the underlying Hadoop FileFormat. This one does it mostly based on blocksize. You
can change this though.

> On 21 Jun 2016, at 12:19, Sachin Aggarwal <different.sachin@gmail.com> wrote:
> 
> 
> when we use readStream to read data as Stream, how spark decides the no of RDD and partition
within each RDD with respect to storage and file format.
> 
> val dsJson = sqlContext.readStream.json("/Users/sachin/testSpark/inputJson")
> 
> val dsCsv = sqlContext.readStream.option("header","true").csv("/Users/sachin/testSpark/inputCsv")
> val ds = sqlContext.readStream.text("/Users/sachin/testSpark/inputText")
> val dsText = ds.as[String].map(x =>(x.split(" ")(0),x.split(" ")(1))).toDF("name","age")
> 
> val dsParquet = sqlContext.readStream.format("parquet").parquet("/Users/sachin/testSpark/inputParquet")
> 
> 
> -- 
> 
> Thanks & Regards
> 
> Sachin Aggarwal
> 7760502772

Mime
View raw message