spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ag007 <>
Subject Parquet files are only 6-20MB in size?
Date Mon, 03 Nov 2014 07:42:40 GMT
Hi there,

I have a pySpark job that is simply taking a tab separated CSV outputting it
to a Parquet file.  The code is based on the SQL write parquet example. 
(Using a different inferred schema, only 35 columns). The input files range
from 100MB to 12 Gb.

I have tried different different block sizes from 10MB through to 1 Gb, I
have tried different parallelism. The total part files total about 1:5

I am trying to get large parquet files.  Having this many small files will
cause problems to my name node.  I have over 500,000 of these files. 

Your assistance would be greatly appreciated.


PS Another solution may be if there is a parquet concat tool around.  I
couldn't see one.  I understand that this tool would have to adjust the

View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message