spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Armbrust <mich...@databricks.com>
Subject Re: Broadcasting a parquet file using spark and python
Date Wed, 01 Apr 2015 18:00:32 GMT
You will need to create a hive parquet table that points to the data and
run "ANALYZE TABLE tableName noscan" so that we have statistics on the size.

On Tue, Mar 31, 2015 at 9:36 PM, Jitesh chandra Mishra <jitesh129@gmail.com>
wrote:

> Hi Michael,
>
> Thanks for your response. I am running 1.2.1.
>
> Is there any workaround to achieve the same with 1.2.1?
>
> Thanks,
> Jitesh
>
> On Wed, Apr 1, 2015 at 12:25 AM, Michael Armbrust <michael@databricks.com>
> wrote:
>
>> In Spark 1.3 I would expect this to happen automatically when the parquet
>> table is small (< 10mb, configurable with spark.sql.autoBroadcastJoinThreshold).
>> If you are running 1.3 and not seeing this, can you show the code you are
>> using to create the table?
>>
>> On Tue, Mar 31, 2015 at 3:25 AM, jitesh129 <jitesh129@gmail.com> wrote:
>>
>>> How can we implement a BroadcastHashJoin for spark with python?
>>>
>>> My SparkSQL inner joins are taking a lot of time since it is performing
>>> ShuffledHashJoin.
>>>
>>> Tables on which join is performed are stored as parquet files.
>>>
>>> Please help.
>>>
>>> Thanks and regards,
>>> Jitesh
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Broadcasting-a-parquet-file-using-spark-and-python-tp22315.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>>
>>
>

Mime
View raw message