spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sun Rui <sunrise_...@163.com>
Subject Re: Tar File: On Spark
Date Fri, 20 May 2016 01:48:00 GMT
Sure. You can try pySpark, which is the Python API of Spark.
> On May 20, 2016, at 06:20, ayan guha <guha.ayan@gmail.com> wrote:
> 
> Hi
> 
> Thanks for the input. Can it be possible to write it in python? I think I can use FileUti.untar
from hdfs jar. But can I do it from python?
> 
> On 19 May 2016 16:57, "Sun Rui" <sunrise_win@163.com <mailto:sunrise_win@163.com>>
wrote:
> 1. create a temp dir on HDFS, say “/tmp”
> 2. write a script to create in the temp dir one file for each tar file. Each file has
only one line:
> <absolute path of the tar file>
> 3. Write a spark application. It is like:
>   val rdd = sc.textFile (<HDFS path of the temp dir>)
>   rdd.map { line =>
>        construct an untar command using the path information in “line” and launches
the command
>   }
> 
> > On May 19, 2016, at 14:42, ayan guha <guha.ayan@gmail.com <mailto:guha.ayan@gmail.com>>
wrote:
> >
> > Hi
> >
> > I have few tar files in HDFS in a single folder. each file has multiple files in
it.
> >
> > tar1:
> >       - f1.txt
> >       - f2.txt
> > tar2:
> >       - f1.txt
> >       - f2.txt
> >
> > (each tar file will have exact same number of files, same name)
> >
> > I am trying to find a way (spark or pig) to extract them to their own folders.
> >
> > f1
> >       - tar1_f1.txt
> >       - tar2_f1.txt
> > f2:
> >        - tar1_f2.txt
> >        - tar1_f2.txt
> >
> > Any help?
> >
> >
> >
> > --
> > Best Regards,
> > Ayan Guha
> 
> 


Mime
View raw message