spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Venkat, Ankam" <>
Subject Processing .wav files in PySpark
Date Fri, 16 Jan 2015 22:11:35 GMT
I need to process .wav files in Pyspark.  If the files are in local file system, I am able
to process them.  Once I store them on HDFS, I am facing issues.  For example,

I run a sox program on a wav file like this.

sox ext2187854_03_27_2014.wav -n stats  <-- works fine

sox hdfs://xxxxxxx:8020/user/ab00855/ext2187854_03_27_2014.wav -n stats   <-- Does not
work as sox cannot read HDFS file.

So, I do like this.

hadoop fs -cat hdfs://xxxxxxx:8020/user/ab00855/ext2187854_03_27_2014.wav | sox -t wav - -n
stats  <-- This works fine

But, I am not able to do this in PySpark.

wavfile = sc.textFile('hdfs://xxxxxxx:8020/user/ab00855/ext2187854_03_27_2014.wav')
wavfile.pipe(['sox', '-t' 'wav', '-', '-n', 'stats']))

I tried different options like sc.binaryFiles and sc.pickleFile.

Any thoughts?

Venkat Ankam

This communication is the property of CenturyLink and may contain confidential or privileged
information. Unauthorized use of this communication is strictly prohibited and may be unlawful.
If you have received this communication in error, please immediately notify the sender by
reply e-mail and destroy all copies of the communication and any attachments.
View raw message