spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yin Huai <huaiyin....@gmail.com>
Subject Re: Spark SQL Join returns less rows that expected
Date Tue, 25 Nov 2014 20:35:11 GMT
I guess you want to use split("\\|") instead of split("|").

On Tue, Nov 25, 2014 at 4:51 AM, Cheng Lian <lian.cs.zju@gmail.com> wrote:

> Which version are you using? Or if you are using the most recent master or
> branch-1.2, which commit are you using?
>
>
> On 11/25/14 4:08 PM, david wrote:
>
>> Hi,
>>
>>   I have 2 files which come from csv import of 2 Oracle tables.
>>
>>   F1 has 46730613 rows
>>   F2 has   3386740 rows
>>
>> I build 2 tables with spark.
>>
>> Table F1 join with table F2 on c1=d1.
>>
>>
>> All keys F2.d1 exists in F1.c1,  so i expect to retrieve 46730613  rows.
>> But
>> it returns only 3437  rows
>>
>> // --- begin code ---
>>
>> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>> import sqlContext.createSchemaRDD
>>
>>
>> val rddFile = sc.textFile("hdfs://referential/F1/part-*")
>> case class F1(c1:String, c2:String,c3:Double, c3:String, c5: String)
>> val stkrdd = rddFile.map(x => x.split("|")).map(f =>
>> F1(f(44),f(3),f(10).toDouble, "",f(2)))
>> stkrdd.registerAsTable("F1")
>> sqlContext.cacheTable("F1")
>>
>>
>> val prdfile = sc.textFile("hdfs://referential/F2/part-*")
>> case class F2(d1: String, d2:String, d3:String,d4:String)
>> val productrdd = prdfile.map(x => x.split("|")).map(f =>
>> F2(f(0),f(2),f(101),f(3)))
>> productrdd.registerAsTable("F2")
>> sqlContext.cacheTable("F2")
>>
>> val resrdd = sqlContext.sql("Select count(*) from F1, F2 where F1.c1 =
>> F2.d1
>> ").count()
>>
>> // --- end of code ---
>>
>>
>> Does anybody know what i missed ?
>>
>> Thanks
>>
>>
>>
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Spark-SQL-Join-returns-less-rows-
>> that-expected-tp19731.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Mime
View raw message