spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ajay Srivastava <>
Subject Creating RDD from only few columns of a Parquet file
Date Tue, 13 Jan 2015 06:20:16 GMT
Hi,I am trying to read a parquet file using -val parquetFile = sqlContext.parquetFile("people.parquet")

There is no way to specify that I am interested in reading only some columns from disk. For
example, If the parquet file has 10 columns and want to read only 3 columns from disk.

We have done an experiment -
Table1 - Parquet file containing 10 columns
Table2 - Parquet file containing only 3 columns which were used in query 

The time taken by query on table1 and table2 shows huge difference. Query on Table1 takes
more than double of time taken on table2 which makes me think that spark is reading all the
columns from disk in case of table1 when it needs only 3 columns.

How should I make sure that it reads only 3 of 10 columns from disk ?


View raw message