spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Armbrust <mich...@databricks.com>
Subject Re: Merging Parquet Files
Date Tue, 25 Nov 2014 17:47:43 GMT
You'll need to be running a very recent version of Spark SQL as this
feature was just added.

On Tue, Nov 25, 2014 at 1:01 AM, Daniel Haviv <danielrulez@gmail.com> wrote:

> Hi,
> Thanks for your reply.. I'm trying to do what you suggested but I get:
> scala> sqlContext.sql("CREATE TEMPORARY TABLE data USING
> org.apache.spark.sql.parquet OPTIONS (path '/requests_parquet.toomany')")
>
> *java.lang.RuntimeException: Failed to load class for data source:
> org.apache.spark.sql.parquet*
> *        at scala.sys.package$.error(package.scala:27)*
>
> any idea why ?
>
> Thanks,
> Daniel
>
> On Mon, Nov 24, 2014 at 11:30 PM, Michael Armbrust <michael@databricks.com
> > wrote:
>
>> Parquet does a lot of serial metadata operations on the driver which
>> makes it really slow when you have a very large number of files (especially
>> if you are reading from something like S3).  This is something we are aware
>> of and that I'd really like to improve in 1.3.
>>
>> You might try the (brand new and very experimental) new parquet support
>> that I added into 1.2 at the last minute in an attempt to make our metadata
>> handling more efficient.
>>
>> Basically you load the parquet files using the new data source API
>> instead of using parquetFile:
>>
>> CREATE TEMPORARY TABLE data
>> USING org.apache.spark.sql.parquet
>> OPTIONS (
>>   path 'path/to/parquet'
>> )
>>
>> This will at least parallelize the retrieval of file status object, but
>> there is a lot more optimization that I hope to do.
>>
>> On Sat, Nov 22, 2014 at 1:53 PM, Daniel Haviv <danielrulez@gmail.com>
>> wrote:
>>
>>> Hi,
>>> I'm ingesting a lot of small JSON files and convert them to unified
>>> parquet files, but even the unified files are fairly small (~10MB).
>>> I want to run a merge operation every hour on the existing files, but it
>>> takes a lot of time for such a small amount of data: about 3 GB spread of
>>> 3000 parquet files.
>>>
>>> Basically what I'm doing is load files in the existing directory,
>>> coalesce them and save to the new dir:
>>> val parquetFiles=sqlContext.parquetFile("/requests_merged/inproc")
>>>
>>> parquetFiles.coalesce(2).saveAsParquetFile("/requests_merged/$currday")
>>>
>>> Doing this takes over an hour on my 3 node cluster...
>>>
>>> Is there a better way to achieve this ?
>>> Any ideas what can cause such a simple operation take so long?
>>>
>>> Thanks,
>>> Daniel
>>>
>>
>>
>

Mime
View raw message