From user-return-28575-apmail-spark-user-archive=spark.apache.org@spark.apache.org Thu Mar 12 16:24:16 2015 Return-Path: X-Original-To: apmail-spark-user-archive@minotaur.apache.org Delivered-To: apmail-spark-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9948317410 for ; Thu, 12 Mar 2015 16:24:16 +0000 (UTC) Received: (qmail 84509 invoked by uid 500); 12 Mar 2015 16:24:13 -0000 Delivered-To: apmail-spark-user-archive@spark.apache.org Received: (qmail 84439 invoked by uid 500); 12 Mar 2015 16:24:13 -0000 Mailing-List: contact user-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@spark.apache.org Received: (qmail 84429 invoked by uid 99); 12 Mar 2015 16:24:13 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 12 Mar 2015 16:24:13 +0000 X-ASF-Spam-Status: No, hits=2.3 required=5.0 tests=SPF_SOFTFAIL,URI_HEX X-Spam-Check-By: apache.org Received-SPF: softfail (nike.apache.org: transitioning domain of g.tanguy.claravista@gmail.com does not designate 162.253.133.43 as permitted sender) Received: from [162.253.133.43] (HELO mwork.nabble.com) (162.253.133.43) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 12 Mar 2015 16:23:48 +0000 Received: from mben.nabble.com (unknown [162.253.133.72]) by mwork.nabble.com (Postfix) with ESMTP id 76234170E9A9 for ; Thu, 12 Mar 2015 09:22:53 -0700 (PDT) Date: Thu, 12 Mar 2015 09:22:45 -0700 (MST) From: gtanguy To: user@spark.apache.org Message-ID: <1426177365829-22016.post@n3.nabble.com> Subject: SPARKQL Join partitioner MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Hello, I am wondering how does "/join/" work in SparkQL? Does it co-partition two tables? or does it do it by wide dependency? I have two big tables to join, the query creates more than 150Go temporary data, so the query stops because I have no space left my disk. I guess I could use a HashPartitioner in order to join with inputs co-partitioned, like this : 1/ Read my two tables in two SchemaRDD 2/ Transform the two SchemaRDD in two RDD[(Key,Value)] 3/ Repartition my two RDDs with my partitioner : rdd.PartitionBy(new HashPartitioner(100)) 4/ Join my two RDDs 5/ Transform my result in SchemaRDD 6/ Reconstruct my hive table. Is there an easy way via SparkQL (hivecontext)? Thanks for your help. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SPARKQL-Join-partitioner-tp22016.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscribe@spark.apache.org For additional commands, e-mail: user-help@spark.apache.org