spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kelvin Qin" <qykx2...@126.com>
Subject Re:Question about how parquet files are read and processed
Date Thu, 16 Apr 2020 03:29:30 GMT
Hi,
The advantage of Parquet is that it only scans the required columns, it is a file in a column
storage format. 
The fewer columns you select, the less memory is required. 
Developers do not need to care about the details of loading data, they are well-designed and
imperceptible to users.







At 2020-04-16 11:00:32, "Yeikel" <email@yeikel.com> wrote:
>I have a parquet file with millions of records and hundreds of fields that I
>will be extracting from a cluster with more resources. I need to take that
>data,derive a set of tables from only some of the fields and import them
>using a smaller cluster
>
>The smaller cluster cannot load in memory the entire parquet file , but it
>can load the derived tables.
>
>if I am reading a parquet file , and I only select a few fields , how much
>computing power do I need compared to all the columns? is it different?  Do
>I need more or less computing power depending on the number of columns I
>select , or does it depend more on the raw source itself and the number of
>columns it contains?
>
>One suggestion I received from a college was to derive the tables using the
>larger cluster and just import them in the smaller cluster , but I was
>wondering if that's really necessary considering that after the import , I
>won't be use the dumps anymore.
>
>I hope my question makes sense. 
>
>Thanks for your help!
>
>
>
>
>
>
>
>--
>Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
>---------------------------------------------------------------------
>To unsubscribe e-mail: user-unsubscribe@spark.apache.org
Mime
View raw message