spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Lewis <>
Subject Re: Spark and Python using generator of data bigger than RAM as input to sc.parallelize()
Date Mon, 06 Oct 2014 20:39:24 GMT
Try a Hadoop Custom InputFormat - I can give you some samples -
While I have not tried this an input split has only a length (could be
ignores if the format treats as non splittable) and a String for a location.
If the location is a URL into wikipedia the whole thing should work.
Hadoop InputFormats seem to be the best way to get large (say multi
gigabyte files) into RDDs

View raw message