spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From gy8 <>
Subject sqlCtx.load a single big csv file from s3 in parallel
Date Thu, 04 Jun 2015 22:54:29 GMT
Hi there!

I'm trying to read a large .csv file (14GB) into a dataframe from S3 via the
spark-csv package. I want to load this data in parallel utilizing all 20
executors that I have, however by default only 3 executors are being used
(which downloaded 5gb/5gb/4gb).

Here is my script (im using pyspark):

lol_file = sqlCtx.load(source="com.databricks.spark.csv",

I have tried add option flags 1) minSplits=120, 2) minPartitions=120 but
neither worked. I tried reading the source code but I'm noob at scala and
could not figure out how the options are being used :(

Thank you for reading and any help is much appreciated!


View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message