spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jamal sasha <jamalsha...@gmail.com>
Subject Re: Processing audio/video/images
Date Thu, 19 Jun 2014 21:14:06 GMT
So..
Here is my experimental code to get a feel of it

def read_file(filename):
   with open(filename) as f:
        lines = [ line for line in f]
        return lines


files = ["/somepath/.../test1.txt","sompath/.../test2.txt"]
test1.txt has
foo bar
this is test1

test2.txt
bar foo
this is text2

rdd_files = sc.parallelize(files).foreach(read_file)
Now, I am hoping to get from this is the lines (probably unordered)
But rdd_files.take(2) doesnt return anything (take method is not defined on
this)
How do i do this?


On Mon, Jun 2, 2014 at 5:29 PM, jamal sasha <jamalshasha@gmail.com> wrote:

> Phoofff.. (Mind blown)...
> Thank you sir.
> This is awesome
>
>
> On Mon, Jun 2, 2014 at 5:23 PM, Marcelo Vanzin <vanzin@cloudera.com>
> wrote:
>
>> The idea is simple. If you want to run something on a collection of
>> files, do (in pseudo-python):
>>
>> def processSingleFile(path):
>>   # Your code to process a file
>>
>> files = [ "file1", "file2" ]
>> sc.parallelize(files).foreach(processSingleFile)
>>
>>
>> On Mon, Jun 2, 2014 at 5:16 PM, jamal sasha <jamalshasha@gmail.com>
>> wrote:
>> > Hi Marcelo,
>> >   Thanks for the response..
>> > I am not sure I understand. Can you elaborate a bit.
>> > So, for example, lets take a look at this example
>> > http://pythonvision.org/basic-tutorial
>> >
>> > import mahotas
>> > dna = mahotas.imread('dna.jpeg')
>> > dnaf = ndimage.gaussian_filter(dna, 8)
>> >
>> > But except dna.jpeg Lets say, I have millions of dna.jpeg and I want to
>> run
>> > the above logic on all the millions files.
>> > How should I go about this?
>> > Thanks
>> >
>> > On Mon, Jun 2, 2014 at 5:09 PM, Marcelo Vanzin <vanzin@cloudera.com>
>> wrote:
>> >>
>> >> Hi Jamal,
>> >>
>> >> If what you want is to process lots of files in parallel, the best
>> >> approach is probably to load all file names into an array and
>> >> parallelize that. Then each task will take a path as input and can
>> >> process it however it wants.
>> >>
>> >> Or you could write the file list to a file, and then use sc.textFile()
>> >> to open it (assuming one path per line), and the rest is pretty much
>> >> the same as above.
>> >>
>> >> It will probably be hard to process each individual file in parallel,
>> >> unless mp3 and jpg files can be split into multiple blocks that can be
>> >> processed separately. In that case, you'd need a custom (Hadoop) input
>> >> format that is able to calculate the splits. But it doesn't sound like
>> >> that's what you want.
>> >>
>> >>
>> >>
>> >> On Mon, Jun 2, 2014 at 5:02 PM, jamal sasha <jamalshasha@gmail.com>
>> wrote:
>> >> > Hi,
>> >> >   How do one process for data sources other than text?
>> >> > Lets say I have millions of mp3 (or jpeg) files and I want to use
>> spark
>> >> > to
>> >> > process them?
>> >> > How does one go about it.
>> >> >
>> >> >
>> >> > I have never been able to figure this out..
>> >> > Lets say I have this library in python which works like following:
>> >> >
>> >> > import audio
>> >> >
>> >> > song = audio.read_mp3(filename)
>> >> >
>> >> > Then most of the methods are attached to song or maybe there is
>> another
>> >> > function which takes "song" type as an input.
>> >> >
>> >> > Maybe the above is just rambling.. but how do I use spark to process
>> >> > (say)
>> >> > audiio files.
>> >> > Thanks
>> >>
>> >>
>> >>
>> >> --
>> >> Marcelo
>> >
>> >
>>
>>
>>
>> --
>> Marcelo
>>
>
>

Mime
View raw message