From user-return-66562-apmail-spark-user-archive=spark.apache.org@spark.apache.org Fri Dec 30 15:45:28 2016 Return-Path: X-Original-To: apmail-spark-user-archive@minotaur.apache.org Delivered-To: apmail-spark-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E157F19485 for ; Fri, 30 Dec 2016 15:45:27 +0000 (UTC) Received: (qmail 58287 invoked by uid 500); 30 Dec 2016 15:45:23 -0000 Delivered-To: apmail-spark-user-archive@spark.apache.org Received: (qmail 58152 invoked by uid 500); 30 Dec 2016 15:45:23 -0000 Mailing-List: contact user-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@spark.apache.org Received: (qmail 58142 invoked by uid 99); 30 Dec 2016 15:45:22 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 30 Dec 2016 15:45:22 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 88969C101D for ; Fri, 30 Dec 2016 15:45:22 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.179 X-Spam-Level: * X-Spam-Status: No, score=1.179 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id tq8b6WfIeMmc for ; Fri, 30 Dec 2016 15:45:21 +0000 (UTC) Received: from mail-qk0-f174.google.com (mail-qk0-f174.google.com [209.85.220.174]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 8C7235F46F for ; Fri, 30 Dec 2016 15:45:21 +0000 (UTC) Received: by mail-qk0-f174.google.com with SMTP id t184so300143304qkd.0 for ; Fri, 30 Dec 2016 07:45:21 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:from:date:message-id:subject:to; bh=2ZiQVWyqINHeOtJKG9igd/V0SDUTzIruaUrkjrZwt18=; b=pgAjGc6WcQgYMNANkb3BqBQ6WzH7kb/vpYiPfuCSlG9ZDBKIJa04e82qIQ+ApVeFzz 95elRrrhoO2wJXDji/IKIeLjaqsU5JYo3yaqRyj07PQNIllMqSmWuq4oBbdXNtSgDDWQ 5hNtJRN2eNAylrqQ8X7/GY9uVONiF3Zb2GEyqqUXOuJT0bzJNdC2yGbRWKQPL8XdbjOU HTOmbyhjdI4imVhFkntE3+O4CK77jNStLUXiAoFBkhAfgcnBa+0C5B8aTEsgr3Y+1CTA CnBR2QnD2bJHgjaKrsWL3lqG1F9Phkyzv/iQwD3hvnZnVMehLOz5tzZspvbl0B3VNgTK j6DA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=2ZiQVWyqINHeOtJKG9igd/V0SDUTzIruaUrkjrZwt18=; b=gswWVeqQeewLG7PBFEDn95NL8vXcnlJriAY5qBqNleuNhqQLpE1z8D6zQJmkebVm/M rsBdP0nBly9OKH7Ym7Oq0wYqNWmAYoD1XEu0FVYj52N/+Ad0rKdOSEPu6TjsoSJtTbp6 oNCKXqD5h40bnnf0xCEJFp2l7d0SU4X/7Fthf8c7D4u2wSrbzfUm0mJd4sGEgFjWvk1u XqGUgkJEojR59o67gxi/m59NTmJJqsyg1rMjG0YUXbIzUCgA5NVfXN4d41/ev+Tg5n3X 6yp+5lq4IALIxjn0XwcdXALNUZOC1vRS73N8J1P7d/fLC4nVhoGgDwundK39yAQkWkS5 cQ7Q== X-Gm-Message-State: AIkVDXKYz9H14PkInhUdAQ55BbaNX+WtkdCdtPHcHgRENxMroZk0YPdBHxP+Uew7yXTo4JgPeeeHoD63+l83eA== X-Received: by 10.55.110.6 with SMTP id j6mr44672340qkc.151.1483112720817; Fri, 30 Dec 2016 07:45:20 -0800 (PST) MIME-Version: 1.0 Received: by 10.200.50.245 with HTTP; Fri, 30 Dec 2016 07:45:20 -0800 (PST) From: titli batali Date: Fri, 30 Dec 2016 21:15:20 +0530 Message-ID: Subject: Broadcast Join and Inner Join giving different result on same DataFrame To: user Content-Type: multipart/alternative; boundary=94eb2c05e90ae2e7290544e21650 --94eb2c05e90ae2e7290544e21650 Content-Type: text/plain; charset=UTF-8 Hi, I have two dataframes which has common column Product_Id on which i have to perform a join operation. val transactionDF = readCSVToDataFrame(sqlCtx: SQLContext, pathToReadTransactions: String, transactionSchema: StructType) val productDF = readCSVToDataFrame(sqlCtx: SQLContext, pathToReadProduct:String, productSchema: StructType) As, transaction data is very large but product data is small, i would ideally do a broadcast join where i braodcast productDF. val productBroadcastDF = broadcast(productDF) val broadcastJoin = transcationDF.join(productBroadcastDF, "productId") Or simply, val innerJoin = transcationDF.join(productDF, "productId") should give the same result as above. But If i join using simple inner join i get dataframe with joined values whereas if i do broadcast join i get empty dataframe with empty values. I am not able to explain this behavior. Ideally both should give the same result. What could have gone wrong. Any one faced the similar issue? Thanks, Prateek --94eb2c05e90ae2e7290544e21650 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi,

I have two dataframes which has com= mon column Product_Id on which i have to perform a join operation.

=C2=A0 =C2=A0 val transactionDF =3D readCSVToDataFrame(sql= Ctx: SQLContext, pathToReadTransactions: String, transactionSchema: StructT= ype)
=C2=A0 =C2=A0 val productDF =3D readCSVToDataFrame(sqlC= tx: SQLContext, pathToReadProduct:String, productSchema: StructType)
<= div>
As, transaction data is very large but product data is s= mall, i would ideally do a =C2=A0broadcast join where i braodcast productDF= .
=C2=A0 =C2=A0
=C2=A0 =C2=A0 =C2=A0val productBroa= dcastDF =3D =C2=A0broadcast(productDF)
=C2=A0 =C2=A0 =C2=A0val br= oadcastJoin =3D transcationDF.join(productBroadcastDF, "productId"= ;)
=C2=A0 =C2=A0=C2=A0
Or simply, =C2=A0val innerJo= in =3D transcationDF.join(productDF, "productId") should give the= same result as above.

But If i join using simple = inner join i get =C2=A0dataframe =C2=A0with joined values whereas if i do b= roadcast join i get empty dataframe with empty values. I am not able to exp= lain this behavior. Ideally both should give the same result.
What could have gone wrong. Any one faced the similar issue?=C2= =A0


Thanks,
Prateek
=



=C2=A0 =C2=A0
= --94eb2c05e90ae2e7290544e21650--