spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Fregly <ch...@fregly.com>
Subject Re: Spark for Log Analytics
Date Thu, 31 Mar 2016 11:46:19 GMT
oh, and I forgot to mention Kafka Streams which has been heavily talked
about the last few days at Strata here in San Jose.

Streams can simplify a lot of this architecture by perform some
light-to-medium-complex transformations in Kafka itself.

i'm waiting anxiously for Kafka 0.10 with production-ready Kafka Streams,
so I can try this out myself - and hopefully remove a lot of extra plumbing.

On Thu, Mar 31, 2016 at 4:42 AM, Chris Fregly <chris@fregly.com> wrote:

> this is a very common pattern, yes.
>
> note that in Netflix's case, they're currently pushing all of their logs
> to a Fronting Kafka + Samza Router which can route to S3 (or HDFS),
> ElasticSearch, and/or another Kafka Topic for further consumption by
> internal apps using other technologies like Spark Streaming (instead of
> Samza).
>
> this Fronting Kafka + Samza Router also helps to differentiate between
> high-priority events (Errors or High Latencies) and normal-priority events
> (normal User Play or Stop events).
>
> here's a recent presentation i did which details this configuration
> starting at slide 104:
> http://www.slideshare.net/cfregly/dc-spark-users-group-march-15-2016-spark-and-netflix-recommendations
> .
>
> btw, Confluent's distribution of Kafka does have a direct Http/REST API
> which is not recommended for production use, but has worked well for me in
> the past.
>
> these are some additional options to think about, anyway.
>
>
> On Thu, Mar 31, 2016 at 4:26 AM, Steve Loughran <stevel@hortonworks.com>
> wrote:
>
>>
>> On 31 Mar 2016, at 09:37, ashish rawat <dceashish@gmail.com> wrote:
>>
>> Hi,
>>
>> I have been evaluating Spark for analysing Application and Server Logs. I
>> believe there are some downsides of doing this:
>>
>> 1. No direct mechanism of collecting log, so need to introduce other
>> tools like Flume into the pipeline.
>>
>>
>> you need something to collect logs no matter what you run. Flume isn't so
>> bad; if you bring it up on the same host as the app then you can even
>> collect logs while the network is playing up.
>>
>> Or you can just copy log4j files to HDFS and process them later
>>
>> 2. Need to write lots of code for parsing different patterns from logs,
>> while some of the log analysis tools like logstash or loggly provide it out
>> of the box
>>
>>
>>
>> Log parsing is essentially an ETL problem, especially if you don't try to
>> lock down the log event format.
>>
>> You can also configure Log4J to save stuff in an easy-to-parse format
>> and/or forward directly to your application.
>>
>> There's a log4j to flume connector to do that for you,
>>
>>
>> http://www.thecloudavenue.com/2013/11/using-log4jflume-to-log-application.html
>>
>> or you can output in, say, JSON (
>> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/log/Log4Json.java
>>  )
>>
>> I'd go with flume unless you had a need to save the logs locally and copy
>> them to HDFS laster.
>>
>>
>>
>> On the benefits side, I believe Spark might be more performant (although
>> I am yet to benchmark it) and being a generic processing engine, might work
>> with complex use cases where the out of the box functionality of log
>> analysis tools is not sufficient (although I don't have any such use case
>> right now).
>>
>> One option I was considering was to use logstash for collection and basic
>> processing and then sink the processed logs to both elastic search and
>> kafka. So that Spark Streaming can pick data from Kafka for the complex use
>> cases, while logstash filters can be used for the simpler use cases.
>>
>> I was wondering if someone has already done this evaluation and could
>> provide me some pointers on how/if to create this pipeline with Spark.
>>
>> Regards,
>> Ashish
>>
>>
>>
>>
>
>
> --
>
> *Chris Fregly*
> Principal Data Solutions Engineer
> IBM Spark Technology Center, San Francisco, CA
> http://spark.tc | http://advancedspark.com
>



-- 

*Chris Fregly*
Principal Data Solutions Engineer
IBM Spark Technology Center, San Francisco, CA
http://spark.tc | http://advancedspark.com

Mime
View raw message