sqoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DSuiter RDX <dsui...@rdx.com>
Subject Re: Preserving origin syslog information
Date Wed, 30 Oct 2013 15:10:36 GMT
I apologize, this was intended for the Flume mailing list.

Sorry about that!

*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com


On Wed, Oct 30, 2013 at 11:04 AM, DSuiter RDX <dsuiter@rdx.com> wrote:

> Hi, just a general behavioral question.
>
> We have a syslogTCP source catching remotely generated syslog events. They
> got to an Avro sink, which delivers them to an Avro source, then into an
> HDFS sink.
>
> I currently have a test replicating channel delivering it to HDFS with the
> avro_event serializer, and also delivering the same events to HDFS without
> the avro_event serializer. The latter results in a text-encoded aggregate
> file, which works well.
>
> The issue I would like clarification on is this:
>
> When it is saved to HDFS as Avro, there is a epoch timestamp, the
> hostname, and some severity and facility information being saved along with
> the message body. There is a "headers" and "body" section of the Avro
> schema, and the timestamp etc is in the "headers" section, and the actual
> text is the "body."
>
> However, when the file is saved to HDFS as text, the only thing we get is
> the content of the "body" field, and there is no longer any host,
> timestamp, etc., even though those are components of the original message.
>
> Where are the components form the generating server being stripped away?
> By syslogTCP source, or by HDFS sink deserializing into text?
>
> Another way to summarize this is: When the server writing the events to
> syslog writes them, it writes with timestamp and host fields. If we use
> Avro the whole way, it keeps that information as headers, but if we save as
> text, no timestamp or host information is preserved. We would like it
> preserved so we can programmatically parse the timestamp to sort by day. We
> would also like to not have to deal with Avro MapReduce for the time being,
> as that has proved challenging. So, is there a way that I can get the WHOLE
> event body as the "body" using syslogTCP source, or do we need to look at
> exec source to tail the generating server /var/log/messages and send it
> that way?
>
> Thanks,
> *Devin Suiter*
> Jr. Data Solutions Software Engineer
> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
> Google Voice: 412-256-8556 | www.rdx.com
>

Mime
View raw message