spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tanveer Ahmad - EWI <T.Ah...@tudelft.nl>
Subject Arrow RecordBatches to Spark Dataframe
Date Thu, 25 Jun 2020 03:35:01 GMT
Hi all,

I have a small question, if you people can help me.

In this code snippet<https://urldefense.proofpoint.com/v2/url?u=https-3A__gist.github.com_linar-2Djether_7dd61ed6fa89098ab9c58a1ab428b2b5&d=DwMFaQ&c=XYzUhXBD2cD-CornpT4QE19xOJBbRy-TBPLK0X9U2o8&r=0FbbJetCCSYzJEnEDCQ1rNv76vTL6SUFCukKhvNosPs&m=xmBkwlg97mtA1QdP5CjruMn_xeOPwDNai-A67sGzgE8&s=AamSgwvubLZjISfIuoBJCWRNB4aikOo78kezYSyRMqw&e=>,
Jether is converting an prdd (RDD) of pd.Dataframes objects to Arrow RecordBatches (slices)
and then to Spark Dataframe finally. Similarly the code in Scala<https://github.com/apache/spark/blob/65a189c7a1ddceb8ab482ccc60af5350b8da5ea5/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala#L192-L206>
converts   JavaRDD to Spark Dataframe.

If I already have an ardd (RDD) of pa.RecordBatch (Arrow RecordBatches) objects, how can I
convert it to Spark Dataframe directly without using Pandas in PySpark? Thanks.


Regards,
Tanveer Ahmad


Mime
View raw message