spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Params <parame...@gmail.com>
Subject Dataframe and corresponding RDD return different rows (PySpark)
Date Sat, 30 Jul 2016 22:35:01 GMT
Hi,

I am facing a weird behavior where the dataframe and the downstream list
and map generated from its RDD equivalent seem to be returning different
rows. What could be possibly going wrong? Any help is appreciated.

Below is a snippet of the code along with the output:
NOTE:[1] samples is a dataframe with 10 rows and three columns (resulting
from sampling 10 random rows from another larger dataframe). After that, I
concatenate the first two columns.

[2] Output of the highlighted statements is shown below. They are
different. I understand if the order is different (because doing .collect()
on a rdd could possibly produce a different ordering), but some of the rows
returned are completely different. For eg: the third output seems to
produce several urls that never exist in the dataframe from which this rdd
is generated. This seems really weird!

FULL CODE:


















*samples = subset_df.select("post_visid_low", "post_visid_high",
"post_page_url").where(         subset_df["post_page_url"] !=
"").sample(False, 0.1, seed=0).limit(num_samples) tmp =
samples.select(func.concat(func.col("post_visid_low"), func.lit("-"),
func.col("post_visid_high")).alias(         'user_id'),
"post_page_url") print("tmp show:") tmp.show(10, False)# term freq
computation vocab =
tmp.select("post_page_url").groupBy("post_page_url").count().rdd.collectAsMap()
for
k,v in vocab.items():     print(k,v)# group by user_ids user_id_urls =
tmp.rdd.reduceByKey(     lambda x,y: x + "," + y) num_users =
user_id_urls.count() print("user_id_urls:") user_id_urls.collect()*

OUTPUT:
tmp dataframe show():
+---------------------------------------+--------------------------------------------------------------------------------------------+
|user_id                                |post_page_url
                                                          |
+---------------------------------------+--------------------------------------------------------------------------------------------+
|6917530152391623611-2707424459370863148|
http://www.backcountry.com/Store/catalog/shopAllBrands.jsp
                |
|6917530609264617841-2788188800375174579|
http://www.backcountry.com/Store/catalog/shopAllBrands.jsp
                |
|6917530818644021208-2821777435347267515|http://www.backcountry.com
                                                          |
|6917530818644021208-2821777435347267515|
http://www.backcountry.com/rc/mens-sale-snow-outerwear-jackets
                |
|6917530818644021208-2821777435347267515|
http://www.backcountry.com/rc/mens-sale-snow-outerwear-jackets
                |
|6917530818644021208-2821777435347267515|
http://www.backcountry.com/dakine-washburn-jacket-mens
                |
|1657310128-1262694438                  |
http://www.backcountry.com/santa-cruz-bicycles-5010-2.0-carbon-r-complete-mountain-bike-2016
|
|4611687717086954899-2907911088913069555|
http://www.backcountry.com/ugg-bixbee-bootie-toddler-infant-boys
                |
|2023386797-562458996                   |http://www.backcountry.com
                                                          |
|6917530783747871522-2923626095076314968|
http://www.backcountry.com/pikolinos-verona-boot-womens
                 |
+---------------------------------------+--------------------------------------------------------------------------------------------+

vocab map:
http://www.backcountry.com/boys-jackets 2
http://www.backcountry.com/dakine-titan-mittens 1
https://www.backcountry.com/Store/account/account.jsp 1
http://www.backcountry.com/ski-clothing 1
http://www.backcountry.com/the-north-face-runners-1-etip-glove 1
http://www.backcountry.com/patagonia 1
http://www.backcountry.com/burton-boys-clothing 1
http://www.backcountry.com/mens-shorts 1
https://www.backcountry.com/Store/account/login.jsp 1

user_id_urls rdd:
[(u'4611687717086954899-2907911088913069555',
  u'http://www.backcountry.com/ugg-bixbee-bootie-toddler-infant-boys'),
 (u'2023386797-562458996', u'http://www.backcountry.com'),
 (u'6917530783747871522-2923626095076314968',
  u'http://www.backcountry.com/pikolinos-verona-boot-womens'),
 (u'6917530818644021208-2821777435347267515',
  u'
http://www.backcountry.com,http://www.backcountry.com/rc/mens-sale-snow-outerwear-jackets,http://www.backcountry.com/rc/mens-sale-snow-outerwear-jackets,http://www.backcountry.com/dakine-washburn-jacket-mens'
<http://www.backcountry.com%2Chttp//www.backcountry.com/rc/mens-sale-snow-outerwear-jackets,http://www.backcountry.com/rc/mens-sale-snow-outerwear-jackets,http://www.backcountry.com/dakine-washburn-jacket-mens'>
),
 (u'6917530152391623611-2707424459370863148',
  u'http://www.backcountry.com/Store/catalog/shopAllBrands.jsp'),
 (u'6917530609264617841-2788188800375174579',
  u'http://www.backcountry.com/Store/catalog/shopAllBrands.jsp'),
 (u'1657310128-1262694438',
  u'
http://www.backcountry.com/santa-cruz-bicycles-5010-2.0-carbon-r-complete-mountain-bike-2016'
)]


Thanks,
Params

Mime
View raw message