spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Allen <n...@nickallen.org>
Subject Re: How to 'Pipe' Binary Data in Apache Spark
Date Fri, 16 Jan 2015 18:46:15 GMT
I just wanted to reiterate the solution for the benefit of the community.

The problem is not from my use of 'pipe', but that 'textFile' cannot be
used to read in binary data. (Doh) There are a couple options to move
forward.

1. Implement a custom 'InputFormat' that understands the binary input data.
(Per Sean Owen)

2. Use 'SparkContext.binaryFiles' to read in the entire binary file as a
single record. This will impact performance as it prevents the use of more
than one mapper on the file's data.

In my specific case for #1 I can only find one project from RIPE-NCC (
https://github.com/RIPE-NCC/hadoop-pcap) that does this. Unfortunately, it
appears to only support a limited set of network protocols.



On Fri, Jan 16, 2015 at 10:40 AM, Nick Allen <nick@nickallen.org> wrote:

> Per your last comment, it appears I need something like this:
>
> https://github.com/RIPE-NCC/hadoop-pcap
>
>
> Thanks a ton.  That get me oriented in the right direction.
>
> On Fri, Jan 16, 2015 at 10:20 AM, Sean Owen <sowen@cloudera.com> wrote:
>
>> Well it looks like you're reading some kind of binary file as text.
>> That isn't going to work, in Spark or elsewhere, as binary data is not
>> even necessarily the valid encoding of a string. There are no line
>> breaks to delimit lines and thus elements of the RDD.
>>
>> Your input has some record structure (or else it's not really useful
>> to put it into an RDD). You can encode this as a SequenceFile and read
>> it with objectFile.
>>
>> You could also write a custom InputFormat that knows how to parse pcap
>> records directly.
>>
>> On Fri, Jan 16, 2015 at 3:09 PM, Nick Allen <nick@nickallen.org> wrote:
>> > I have an RDD containing binary data. I would like to use 'RDD.pipe' to
>> pipe
>> > that binary data to an external program that will translate it to
>> > string/text data. Unfortunately, it seems that Spark is mangling the
>> binary
>> > data before it gets passed to the external program.
>> >
>> > This code is representative of what I am trying to do. What am I doing
>> > wrong? How can I pipe binary data in Spark?  Maybe it is getting
>> corrupted
>> > when I read it in initially with 'textFile'?
>> >
>> > bin = sc.textFile("binary-data.dat")
>> > csv = bin.pipe ("/usr/bin/binary-to-csv.sh")
>> > csv.saveAsTextFile("text-data.csv")
>> >
>> > Specifically, I am trying to use Spark to transform pcap (packet
>> capture)
>> > data to text/csv so that I can perform an analysis on it.
>> >
>> > Thanks!
>> >
>> > --
>> > Nick Allen <nick@nickallen.org>
>>
>
>
>
> --
> Nick Allen <nick@nickallen.org>
>



-- 
Nick Allen <nick@nickallen.org>

Mime
View raw message