spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Priyanka Gomatam <Priyanka.Goma...@microsoft.com.INVALID>
Subject Is there a way to read a Parquet File as ColumnarBatch?
Date Mon, 22 Apr 2019 15:22:19 GMT
Hi,
I am new to Spark and have been playing around with the Parquet reader code. I have two questions:

  1.  I saw the code that starts at DataSourceScanExec class, and moves on to the ParquetFileFormat
class and does a VectorizedParquetRecordReader. I tried doing a spark.read.parquet(...) and
debugged through the code, but for some reason it never hit the breakpoints I placed in these
classes. Perhaps I am doing something wrong, but is there a certain versioning for parquet
readers that I am missing out on? How do I make the code take the DataSourceScanExec ->
... -> ParquetReader ... -> VectorizedParqeutRecordRead ... route?
  2.  If I do manage to make it take the above path, I see there is a point at which the data
is filled into ColumnarBatch objects, has anyone tried returning all the data as ColumnarBatch?
Is there any reading material you can point me to?
Thanks in advance, this will be super helpful for me!

Mime
View raw message