spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joel D <>
Subject Process Million Binary Files
Date Wed, 10 Oct 2018 21:56:04 GMT

I need to process millions of PDFs in hdfs using spark. First I’m trying
with some 40k files. I’m using binaryFiles api with which I’m facing couple
of issues:

1. It creates only 4 tasks and I can’t seem to increase the parallelism
2. It took 2276 seconds and that means for millions of files it will take
ages to complete. I’m also expecting it to fail for million records with
some timeout or gc overhead exception.

Val files = sparkSession.sparkContext.binaryFiles(filePath, 200).cache

Val fileContentRdd = => myFunc(file)

Do you have any guidance on how I can process millions of files using
binaryFiles api?

How can I increase the number of tasks/parallelism during the creation of
files rdd?


View raw message