spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <jan.zi...@centrum.cz>
Subject Re: Spark and Python using generator of data bigger than RAM as input to sc.parallelize()
Date Tue, 07 Oct 2014 07:47:48 GMT
The file itself is for now just wikipedia dump, that can be downloaded from here http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2.
It's basically one big .xml that I need to parse in a way to have title + text on one line
of the data. For this I currently use gensim.corpora.wikicorpus.extract_pages that is possible
to see here https://github.com/piskvorky/gensim/blob/develop/gensim/corpora/wikicorpus.py.
This returns the generator from which I'd like to make the RDD.
______________________________________________________________
> Od: Steve Lewis <lordjoe2000@gmail.com>
> Komu: <jan.zikes@centrum.cz>
> Datum: 07.10.2014 01:25
> Předmět: Re: Spark and Python using generator of data bigger than RAM as input to sc.parallelize()
>
Say more about the one file you have  - is the file itself large and is it text?Here are
3 samples - I have tested the first two in Spark like this 
    Class inputFormatClass = MGFInputFormat.class;        Class keyClass = String.class; 
      Class valueClass = String.class;        JavaPairRDD<String, String> spectraAsStrings
= ctx.newAPIHadoopFile(                path,                inputFormatClass, 
              keyClass,                valueClass,                ctx.hadoopConfiguration() 
      );I have not tested with non-local cluster or gigabyte sized files on Spark but the
equivalent Hadoop code - like this but returning Hadoop Text works well at those scales
On Mon, Oct 6, 2014 at 2:33 PM, <jan.zikes@centrum.cz <jan.zikes@centrum.cz>>
wrote:
@Davies
I know that gensim.corpora.wikicorpus.extract_pages will be for sure the bottle neck on the
master node.
Unfortunately I am using Spark on EC2 and I don't have enough space on my nodes to store there
whole data that needs to be parsed by extract_pages. I have my data on S3 and I kind of hoped
that after reading (sc.textFile(file_on_s3)) the data from S3 to RDD it will be possible to
pass the RDD to extract_pages, this unfortunately does not work for me. If it'd work it'd
be by far the best way to go for me.
 
@Steve
I can try Hadoop Custom InputFormat. It'd be great if you could send me some samples. But
if I understand it correctly then I'm afraid that it won't work for me, because I actually
don't have any url to wikipedia, I have only file, that is opened, parsed and returned as
generator that generates parsed pagename and text from wikipedia (it can be also some non
public wikipedia like site)
______________________________________________________________
> Od: Steve Lewis <lordjoe2000@gmail.com <lordjoe2000@gmail.com>>
> Komu: Davies Liu <davies@databricks.com <davies@databricks.com>>
> Datum: 06.10.2014 22:39
> Předmět: Re: Spark and Python using generator of data bigger than RAM as input to sc.parallelize()
>
> CC: "user"
Try a Hadoop Custom InputFormat - I can give you some samples - While I have not tried this
an input split has only a length (could be ignores if the format treats as non splittable)
and a String for a location.If the location is a URL into wikipedia the whole thing should
work.Hadoop InputFormats seem to be the best way to get large (say multi gigabyte files) into
RDDs

-- 
Steven M. Lewis PhD4221 105th Ave NEKirkland, WA 98033206-384-1340 (cell)
Skype lordjoe_com



Mime
View raw message