From user-return-1818-apmail-sqoop-user-archive=sqoop.apache.org@sqoop.apache.org Wed Oct 30 15:08:31 2013 Return-Path: X-Original-To: apmail-sqoop-user-archive@www.apache.org Delivered-To: apmail-sqoop-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5330510475 for ; Wed, 30 Oct 2013 15:08:31 +0000 (UTC) Received: (qmail 16863 invoked by uid 500); 30 Oct 2013 15:05:09 -0000 Delivered-To: apmail-sqoop-user-archive@sqoop.apache.org Received: (qmail 16027 invoked by uid 500); 30 Oct 2013 15:05:03 -0000 Mailing-List: contact user-help@sqoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@sqoop.apache.org Delivered-To: mailing list user@sqoop.apache.org Received: (qmail 15367 invoked by uid 99); 30 Oct 2013 15:05:01 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 Oct 2013 15:05:01 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_IMAGE_ONLY_28,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_REMOTE_IMAGE X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of dsuiter@rdx.com designates 74.125.82.46 as permitted sender) Received: from [74.125.82.46] (HELO mail-wg0-f46.google.com) (74.125.82.46) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 Oct 2013 15:04:55 +0000 Received: by mail-wg0-f46.google.com with SMTP id m15so1436452wgh.13 for ; Wed, 30 Oct 2013 08:04:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=rdx.com; s=google; h=mime-version:date:message-id:subject:from:to:content-type; bh=uzxW60j40275PBeXIhl4K8/FMfiztZlUFWTT75StrV4=; b=brpe0PQtFQimkvsddGXp5WVR6QTckETo061TsKcmMuPc9YF0208folvTvKyNJ965qO /o+fSQfvaKNvmdTUF5i/2YbjfInFI7XuhQETF7NZHIcp32ZXorHEQTmsd1jvBB75GsdR eqm5e90Z4aFGzkFf48HJFdLVXmw8ZWIdWt+oI= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:date:message-id:subject:from:to :content-type; bh=uzxW60j40275PBeXIhl4K8/FMfiztZlUFWTT75StrV4=; b=icRQcKE2VF8w1LSZvvJzZmwnNNJEcOS1rBtAnY3uwX6m8Iefa6Kny+gjeV+GmNnUJJ DNe32hikpAH1W4pERMLh/HkPUXElmWxT/wVChNaLgG5MQRpGuwd1e/5GY/+dFxpWaM6l t4FiktcizHWP7AuIYWNYKhhqpdU0PMXbHzTaMWBl9lhPt1DSP8642zF2Uwu2bH7yH0H0 uHgEOLDHMy0Sn5Yt52i1tC2MvIK0+ufJCcniPcAD4LcO17blj0vt5GOjW2VvrHFSuhh+ VcxP2NvZBiqOaOuMknPj3hylxTvbrGHMvgGy/hKGfxgL7V+FJtTgSur0SNxilsvm8eua OktA== X-Gm-Message-State: ALoCoQl7BHzsCsJqZ/BMKb6RoK0/bbYZfI09LfOKPzMRJ03MH6rFypnz6gYnCDg/2oSY65FiGEjQ MIME-Version: 1.0 X-Received: by 10.194.8.137 with SMTP id r9mr771621wja.78.1383145474826; Wed, 30 Oct 2013 08:04:34 -0700 (PDT) Received: by 10.216.52.134 with HTTP; Wed, 30 Oct 2013 08:04:34 -0700 (PDT) Date: Wed, 30 Oct 2013 11:04:34 -0400 Message-ID: Subject: Preserving origin syslog information From: DSuiter RDX To: user@sqoop.apache.org Content-Type: multipart/alternative; boundary=047d7b5d2532b340e804e9f6a629 X-Virus-Checked: Checked by ClamAV on apache.org --047d7b5d2532b340e804e9f6a629 Content-Type: text/plain; charset=ISO-8859-1 Hi, just a general behavioral question. We have a syslogTCP source catching remotely generated syslog events. They got to an Avro sink, which delivers them to an Avro source, then into an HDFS sink. I currently have a test replicating channel delivering it to HDFS with the avro_event serializer, and also delivering the same events to HDFS without the avro_event serializer. The latter results in a text-encoded aggregate file, which works well. The issue I would like clarification on is this: When it is saved to HDFS as Avro, there is a epoch timestamp, the hostname, and some severity and facility information being saved along with the message body. There is a "headers" and "body" section of the Avro schema, and the timestamp etc is in the "headers" section, and the actual text is the "body." However, when the file is saved to HDFS as text, the only thing we get is the content of the "body" field, and there is no longer any host, timestamp, etc., even though those are components of the original message. Where are the components form the generating server being stripped away? By syslogTCP source, or by HDFS sink deserializing into text? Another way to summarize this is: When the server writing the events to syslog writes them, it writes with timestamp and host fields. If we use Avro the whole way, it keeps that information as headers, but if we save as text, no timestamp or host information is preserved. We would like it preserved so we can programmatically parse the timestamp to sort by day. We would also like to not have to deal with Avro MapReduce for the time being, as that has proved challenging. So, is there a way that I can get the WHOLE event body as the "body" using syslogTCP source, or do we need to look at exec source to tail the generating server /var/log/messages and send it that way? Thanks, *Devin Suiter* Jr. Data Solutions Software Engineer 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212 Google Voice: 412-256-8556 | www.rdx.com --047d7b5d2532b340e804e9f6a629 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Hi, just a general behavioral question.
We have a syslogTCP source catching remotely generated syslog e= vents. They got to an Avro sink, which delivers them to an Avro source, the= n into an HDFS sink.

I currently have a test replicating channel delivering = it to HDFS with the avro_event serializer, and also delivering the same eve= nts to HDFS without the avro_event serializer. The latter results in a text= -encoded aggregate file, which works well.

The issue I would like clarification on is this:=A0

When it is saved to HDFS as Avro, there is a epoch ti= mestamp, the hostname, and some severity and facility information being sav= ed along with the message body. There is a "headers" and "bo= dy" section of the Avro schema, and the timestamp etc is in the "= headers" section, and the actual text is the "body."

However, when the file is saved to HDFS as text, the on= ly thing we get is the content of the "body" field, and there is = no longer any host, timestamp, etc., even though those are components of th= e original message.

Where are the components form the generating server bei= ng stripped away? By syslogTCP source, or by HDFS sink deserializing into t= ext?

Another way to summarize this is: When the se= rver writing the events to syslog writes them, it writes with timestamp and= host fields. If we use Avro the whole way, it keeps that information as he= aders, but if we save as text, no timestamp or host information is preserve= d. We would like it preserved so we can programmatically parse the timestam= p to sort by day. We would also like to not have to deal with Avro MapReduc= e for the time being, as that has proved challenging. So, is there a way th= at I can get the WHOLE event body as the "body" using syslogTCP s= ource, or do we need to look at exec source to tail the generating server /= var/log/messages and send it that way?

Thanks,
Devin Suiter<= div>
Jr. Data Solutions Software Engineer
100 Sandusk= y Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 |=A0www.rdx.com
--047d7b5d2532b340e804e9f6a629--