spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From laserson <...@git.apache.org>
Subject [GitHub] incubator-spark pull request: Added parquetFileAsJSON to read Parq...
Date Tue, 11 Feb 2014 01:33:49 GMT
Github user laserson commented on the pull request:

    https://github.com/apache/incubator-spark/pull/576#issuecomment-34718389
  
    No, this actually constructs Avro `GenericRecord` objects in memory.  The problem is that
if you want access to the Parquet data through PySpark, there is no obvious/general way to
convert from the Java in-memory representation (which can be Thrift or Avro) to some Python-friendly
object.  In principle, you could serialize as Thrift or Avro and have the Python workers read
this byte stream.  However, since PySpark currently serializes it data through text, you might
as well use a text representation of the Thrift/Avro records, which is JSON.
    
    You're right that this function is not to be used for fast OLAP-style processing, but
rather to give PySpark users an easy way to access Parquet data.


Mime
View raw message