spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rishi Shah <rishishah.s...@gmail.com>
Subject [PySpark 2.3+] Reading parquet entire path vs a set of file paths
Date Mon, 01 Jun 2020 13:33:20 GMT
Hi All,

I use the following to read a set of parquet file paths when files are
scattered across many many partitions.

paths = ['p1', 'p2', ... 'p10000']
df = spark.read.parquet(*paths)

Above method feels like is sequentially reading those files & not really
parallelizing the read operation, is that correct?

If I put all these files in a single path and read like below - works
faster:

path = 'consolidated_path'
df = spark.read.parquet(path)

Is my observation correct? If so, is there a way to optimize reads from
multiple/specific paths ?

-- 
Regards,

Rishi Shah

Mime
View raw message