spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nan Zhu <zhunanmcg...@gmail.com>
Subject Re: shared variable and ALS in mllib
Date Tue, 07 Jan 2014 04:40:29 GMT
Thanks Jason, yes, that’s true, but how to finish the first step

it seems that sc.textFile() has no parameters to achieve the goal,  

I stored the file on s3

Best,  

--  
Nan Zhu


On Monday, January 6, 2014 at 11:27 PM, Jason Dai wrote:

> If you assign each file to a standalone partition, then you can generate the Rating RDD
using something like the following:
>  
> files.mapPartitions { part =>
>    product = part.next()
>    part.map((user, rating) => (user, product, rating))
> }
>  
> Thanks,
> -Jason
>  
>  
>  
> On Tue, Jan 7, 2014 at 1:17 AM, Nan Zhu <zhunanmcgill@gmail.com (mailto:zhunanmcgill@gmail.com)>
wrote:
> > Hi, all
> >  
> > I meet a question related to how to share a variable among tasks, it seems that
neither broadcast nor accumulator can resolve my problem
> >  
> > I have a set of txt files as my dataset, naming 1.txt - 20000.txt
> >  
> > each txt file represents the rating of users to a certain product, the product ID
is indicated in the first line of each file, “1:”…”20000:”  
> >  
> > the following lines are ratings “userid, rating"
> >  
> > I want to parse the input files with spark and pass it to the ALS implementation
in mllib
> >  
> > the ALS requires me to have a RDD of Rating objects, where Rating is 3-tuple (user,
product, rating)  
> >  
> > My problem is that some tasks get the partition of a certain text file, so it will
never see the first line like “1:” so that it cannot get which product the rating is corresponded
to
> >  
> > How can I resolve this, except getting some script to transform the format of the
files by appending the product id to each line?
> >  
> > Best,
> >  
> > --  
> > Nan Zhu
> >  
> >  
> >  
>  


Mime
View raw message