spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rishi Shah <rishishah.s...@gmail.com>
Subject Re: [PySpark 2.3+] Reading parquet entire path vs a set of file paths
Date Wed, 03 Jun 2020 18:15:21 GMT
Hi All,

Just following up on below to see if anyone has any suggestions. Appreciate
your help in advance.

Thanks,
Rishi

On Mon, Jun 1, 2020 at 9:33 AM Rishi Shah <rishishah.star@gmail.com> wrote:

> Hi All,
>
> I use the following to read a set of parquet file paths when files are
> scattered across many many partitions.
>
> paths = ['p1', 'p2', ... 'p10000']
> df = spark.read.parquet(*paths)
>
> Above method feels like is sequentially reading those files & not really
> parallelizing the read operation, is that correct?
>
> If I put all these files in a single path and read like below - works
> faster:
>
> path = 'consolidated_path'
> df = spark.read.parquet(path)
>
> Is my observation correct? If so, is there a way to optimize reads from
> multiple/specific paths ?
>
> --
> Regards,
>
> Rishi Shah
>


-- 
Regards,

Rishi Shah

Mime
View raw message