spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From parameshr <parame...@gmail.com>
Subject Dataframe and corresponding RDD return different rows (PySpark)
Date Sat, 30 Jul 2016 21:54:42 GMT
Hi,

I am facing a weird behavior where the dataframe and the downstream list and
map generated from its RDD equivalent seem to be returning different rows.
What could be possibly going wrong? Any help is appreciated.

Below is a snippet of the code along with the output:
NOTE: 

[1] samples is a dataframe with 10 rows and three columns. In the first
line, I am concatenating the first two columns
[2] Output of the highlighted statements is shown below. They are different.
I understand if the order is different (because doing .collect() on a rdd
could possibly produce a different ordering), but some of the rows returned
are completely different. This seems really weird!

CODE:

tmp = samples.select(func.concat(func.col("post_visid_low"), func.lit("-"),
func.col("post_visid_high")).alias(
        'user_id'), "post_page_url")
print("tmp show:")
*tmp.show(10, False)*

# term freq computation
vocab =
tmp.select("post_page_url").groupBy("post_page_url").count().rdd.collectAsMap()
*for k,v in vocab.items():
    print(k,v)
*

# group by user_ids
user_id_urls = tmp.rdd.reduceByKey(
    lambda x,y: x + "," + y)
num_users = user_id_urls.count()
print("user_id_urls:")
*user_id_urls.collect()*

OUTPUT:


tmp dataframe show():
+---------------------------------------+--------------------------------------------------------------------------------------------+
|user_id                                |post_page_url                                   
                                          
|
+---------------------------------------+--------------------------------------------------------------------------------------------+
|6917530152391623611-2707424459370863148|http://www.backcountry.com/Store/catalog/shopAllBrands.jsp
                                
|
|6917530609264617841-2788188800375174579|http://www.backcountry.com/Store/catalog/shopAllBrands.jsp
                                
|
|6917530818644021208-2821777435347267515|http://www.backcountry.com                      
                                          
|
|6917530818644021208-2821777435347267515|http://www.backcountry.com/rc/mens-sale-snow-outerwear-jackets
                            
|
|6917530818644021208-2821777435347267515|http://www.backcountry.com/rc/mens-sale-snow-outerwear-jackets
                            
|
|6917530818644021208-2821777435347267515|http://www.backcountry.com/dakine-washburn-jacket-mens
                                    
|
|1657310128-1262694438                 
|http://www.backcountry.com/santa-cruz-bicycles-5010-2.0-carbon-r-complete-mountain-bike-2016|
|4611687717086954899-2907911088913069555|http://www.backcountry.com/ugg-bixbee-bootie-toddler-infant-boys
                          
|
|2023386797-562458996                   |http://www.backcountry.com                      
                                          
|
|6917530783747871522-2923626095076314968|http://www.backcountry.com/pikolinos-verona-boot-womens
                                   
|
+---------------------------------------+--------------------------------------------------------------------------------------------+

vocab map:
http://www.backcountry.com/boys-jackets 2
http://www.backcountry.com/dakine-titan-mittens 1
https://www.backcountry.com/Store/account/account.jsp 1
http://www.backcountry.com/ski-clothing 1
http://www.backcountry.com/the-north-face-runners-1-etip-glove 1
http://www.backcountry.com/patagonia 1
http://www.backcountry.com/burton-boys-clothing 1
http://www.backcountry.com/mens-shorts 1
https://www.backcountry.com/Store/account/login.jsp 1

user_id_urls rdd:
[(u'4611687717086954899-2907911088913069555',
  u'http://www.backcountry.com/ugg-bixbee-bootie-toddler-infant-boys'),
 (u'2023386797-562458996', u'http://www.backcountry.com'),
 (u'6917530783747871522-2923626095076314968',
  u'http://www.backcountry.com/pikolinos-verona-boot-womens'),
 (u'6917530818644021208-2821777435347267515',
 
u'http://www.backcountry.com,http://www.backcountry.com/rc/mens-sale-snow-outerwear-jackets,http://www.backcountry.com/rc/mens-sale-snow-outerwear-jackets,http://www.backcountry.com/dakine-washburn-jacket-mens'),
 (u'6917530152391623611-2707424459370863148',
  u'http://www.backcountry.com/Store/catalog/shopAllBrands.jsp'),
 (u'6917530609264617841-2788188800375174579',
  u'http://www.backcountry.com/Store/catalog/shopAllBrands.jsp'),
 (u'1657310128-1262694438',
 
u'http://www.backcountry.com/santa-cruz-bicycles-5010-2.0-carbon-r-complete-mountain-bike-2016')]


Thanks,
Params



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Dataframe-and-corresponding-RDD-return-different-rows-PySpark-tp27435.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message