spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kelvin Qin" <>
Subject Re:Question about how parquet files are read and processed
Date Thu, 16 Apr 2020 03:29:30 GMT
The advantage of Parquet is that it only scans the required columns, it is a file in a column
storage format. 
The fewer columns you select, the less memory is required. 
Developers do not need to care about the details of loading data, they are well-designed and
imperceptible to users.

At 2020-04-16 11:00:32, "Yeikel" <> wrote:
>I have a parquet file with millions of records and hundreds of fields that I
>will be extracting from a cluster with more resources. I need to take that
>data,derive a set of tables from only some of the fields and import them
>using a smaller cluster
>The smaller cluster cannot load in memory the entire parquet file , but it
>can load the derived tables.
>if I am reading a parquet file , and I only select a few fields , how much
>computing power do I need compared to all the columns? is it different?  Do
>I need more or less computing power depending on the number of columns I
>select , or does it depend more on the raw source itself and the number of
>columns it contains?
>One suggestion I received from a college was to derive the tables using the
>larger cluster and just import them in the smaller cluster , but I was
>wondering if that's really necessary considering that after the import , I
>won't be use the dumps anymore.
>I hope my question makes sense. 
>Thanks for your help!
>Sent from:
>To unsubscribe e-mail:
View raw message