spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: Tar File: On Spark
Date Thu, 19 May 2016 22:42:51 GMT
See http://memect.co/call-java-from-python-so

You can also use Py4J

On Thu, May 19, 2016 at 3:20 PM, ayan guha <guha.ayan@gmail.com> wrote:

> Hi
>
> Thanks for the input. Can it be possible to write it in python? I think I
> can use FileUti.untar from hdfs jar. But can I do it from python?
> On 19 May 2016 16:57, "Sun Rui" <sunrise_win@163.com> wrote:
>
>> 1. create a temp dir on HDFS, say “/tmp”
>> 2. write a script to create in the temp dir one file for each tar file.
>> Each file has only one line:
>> <absolute path of the tar file>
>> 3. Write a spark application. It is like:
>>   val rdd = sc.textFile (<HDFS path of the temp dir>)
>>   rdd.map { line =>
>>        construct an untar command using the path information in “line”
>> and launches the command
>>   }
>>
>> > On May 19, 2016, at 14:42, ayan guha <guha.ayan@gmail.com> wrote:
>> >
>> > Hi
>> >
>> > I have few tar files in HDFS in a single folder. each file has multiple
>> files in it.
>> >
>> > tar1:
>> >       - f1.txt
>> >       - f2.txt
>> > tar2:
>> >       - f1.txt
>> >       - f2.txt
>> >
>> > (each tar file will have exact same number of files, same name)
>> >
>> > I am trying to find a way (spark or pig) to extract them to their own
>> folders.
>> >
>> > f1
>> >       - tar1_f1.txt
>> >       - tar2_f1.txt
>> > f2:
>> >        - tar1_f2.txt
>> >        - tar1_f2.txt
>> >
>> > Any help?
>> >
>> >
>> >
>> > --
>> > Best Regards,
>> > Ayan Guha
>>
>>
>>

Mime
View raw message