spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nan Zhu <zhunanmcg...@gmail.com>
Subject shared variable and ALS in mllib
Date Mon, 06 Jan 2014 17:17:56 GMT
Hi, all

I meet a question related to how to share a variable among tasks, it seems that neither broadcast
nor accumulator can resolve my problem

I have a set of txt files as my dataset, naming 1.txt - 20000.txt

each txt file represents the rating of users to a certain product, the product ID is indicated
in the first line of each file, “1:”…”20000:”

the following lines are ratings “userid, rating"

I want to parse the input files with spark and pass it to the ALS implementation in mllib

the ALS requires me to have a RDD of Rating objects, where Rating is 3-tuple (user, product,
rating)

My problem is that some tasks get the partition of a certain text file, so it will never see
the first line like “1:” so that it cannot get which product the rating is corresponded
to

How can I resolve this, except getting some script to transform the format of the files by
appending the product id to each line?

Best,

--  
Nan Zhu




Mime
View raw message