samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shekar Tippur <ctip...@gmail.com>
Subject Re: Samza as a Caching layer
Date Tue, 02 Sep 2014 20:21:09 GMT
Other observation is ..

http://10.132.62.185:8088/cluster shows that no applications are running.

- Shekar




On Tue, Sep 2, 2014 at 1:15 PM, Shekar Tippur <ctippur@gmail.com> wrote:

> Yarn seem to be running ..
>
> yarn      5462  0.0  2.0 1641296 161404 ?      Sl   Jun20  95:26
> /usr/java/jdk1.6.0_31/bin/java -Dproc_resourcemanager -Xmx1000m
> -Dhadoop.log.dir=/var/log/hadoop-yarn -Dyarn.log.dir=/var/log/hadoop-yarn
> -Dhadoop.log.file=yarn-yarn-resourcemanager-pppdc9prd2y2.log
> -Dyarn.log.file=yarn-yarn-resourcemanager-pppdc9prd2y2.log
> -Dyarn.home.dir=/usr/lib/hadoop-yarn -Dhadoop.home.dir=/usr/lib/hadoop-yarn
> -Dhadoop.root.logger=INFO,RFA -Dyarn.root.logger=INFO,RFA
> -Djava.library.path=/usr/lib/hadoop/lib/native -classpath
> /etc/hadoop/conf:/etc/hadoop/conf:/etc/hadoop/conf:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/.//*:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/.//*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/.//*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/.//*:/usr/lib/hadoop-yarn/.//*:/usr/lib/hadoop-yarn/lib/*:/etc/hadoop/conf/rm-config/log4j.properties
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager
>
> I can tail kafka topic as well ..
>
> deploy/kafka/bin/kafka-console-consumer.sh  --zookeeper localhost:2181 --topic wikipedia-raw
>
>
>
>
> On Tue, Sep 2, 2014 at 1:04 PM, Chris Riccomini <
> criccomini@linkedin.com.invalid> wrote:
>
>> Hey Shekar,
>>
>> It looks like your job is hanging trying to connect to the RM on your
>> localhost. I thought that you said that your job was running in local
>> mode. If so, it should be using the LocalJobFactory. If not, and you
>> intend to run on YARN, is your YARN RM up and running on localhost?
>>
>> Cheers,
>> Chris
>>
>> On 9/2/14 12:22 PM, "Shekar Tippur" <ctippur@gmail.com> wrote:
>>
>> >Chris ..
>> >
>> >$ cat ./deploy/samza/undefined-samza-container-name.log
>> >
>> >2014-09-02 11:17:58 JobRunner [INFO] job factory:
>> >org.apache.samza.job.yarn.YarnJobFactory
>> >
>> >2014-09-02 11:17:59 ClientHelper [INFO] trying to connect to RM
>> >127.0.0.1:8032
>> >
>> >2014-09-02 11:17:59 NativeCodeLoader [WARN] Unable to load native-hadoop
>> >library for your platform... using builtin-java classes where applicable
>> >
>> >2014-09-02 11:17:59 RMProxy [INFO] Connecting to ResourceManager at /
>> >127.0.0.1:8032
>> >
>> >
>> >and Log4j ..
>> >
>> ><!DOCTYPE log4j:configuration SYSTEM "log4j.dtd">
>> >
>> ><log4j:configuration xmlns:log4j="http://jakarta.apache.org/log4j/">
>> >
>> >  <appender name="RollingAppender"
>> >class="org.apache.log4j.DailyRollingFileAppender">
>> >
>> >     <param name="File"
>> >value="${samza.log.dir}/${samza.container.name}.log"
>> >/>
>> >
>> >     <param name="DatePattern" value="'.'yyyy-MM-dd" />
>> >
>> >     <layout class="org.apache.log4j.PatternLayout">
>> >
>> >      <param name="ConversionPattern" value="%d{yyyy-MM-dd HH:mm:ss}
>> %c{1}
>> >[%p] %m%n" />
>> >
>> >     </layout>
>> >
>> >  </appender>
>> >
>> >  <root>
>> >
>> >    <priority value="info" />
>> >
>> >    <appender-ref ref="RollingAppender"/>
>> >
>> >  </root>
>> >
>> ></log4j:configuration>
>> >
>> >
>> >On Tue, Sep 2, 2014 at 12:02 PM, Chris Riccomini <
>> >criccomini@linkedin.com.invalid> wrote:
>> >
>> >> Hey Shekar,
>> >>
>> >> Can you attach your log files? I'm wondering if it's a mis-configured
>> >> log4j.xml (or missing slf4j-log4j jar), which is leading to nearly
>> empty
>> >> log files. Also, I'm wondering if the job starts fully. Anything you
>> can
>> >> attach would be helpful.
>> >>
>> >> Cheers,
>> >> Chris
>> >>
>> >> On 9/2/14 11:43 AM, "Shekar Tippur" <ctippur@gmail.com> wrote:
>> >>
>> >> >I am running in local mode.
>> >> >
>> >> >S
>> >> >On Sep 2, 2014 11:42 AM, "Yan Fang" <yanfang724@gmail.com> wrote:
>> >> >
>> >> >> Hi Shekar.
>> >> >>
>> >> >> Are you running job in local mode or yarn mode? If yarn mode, the
>> log
>> >> >>is in
>> >> >> the yarn's container log.
>> >> >>
>> >> >> Thanks,
>> >> >>
>> >> >> Fang, Yan
>> >> >> yanfang724@gmail.com
>> >> >> +1 (206) 849-4108
>> >> >>
>> >> >>
>> >> >> On Tue, Sep 2, 2014 at 11:36 AM, Shekar Tippur <ctippur@gmail.com>
>> >> >>wrote:
>> >> >>
>> >> >> > Chris,
>> >> >> >
>> >> >> > Got some time to play around a bit more.
>> >> >> > I tried to edit
>> >> >> >
>> >> >> >
>> >> >>
>> >>
>>
>> >>>>samza-wikipedia/src/main/java/samza/examples/wikipedia/task/WikipediaFe
>> >>>>ed
>> >> >>StreamTask.java
>> >> >> > to add a logger info statement to tap the incoming message.
I dont
>> >>see
>> >> >> the
>> >> >> > messages being printed to the log file.
>> >> >> >
>> >> >> > Is this the right place to start?
>> >> >> >
>> >> >> > public class WikipediaFeedStreamTask implements StreamTask
{
>> >> >> >
>> >> >> >   private static final SystemStream OUTPUT_STREAM = new
>> >> >> > SystemStream("kafka",
>> >> >> > "wikipedia-raw");
>> >> >> >
>> >> >> >   private static final Logger log = LoggerFactory.getLogger
>> >> >> > (WikipediaFeedStreamTask.class);
>> >> >> >
>> >> >> >   @Override
>> >> >> >
>> >> >> >   public void process(IncomingMessageEnvelope envelope,
>> >> >>MessageCollector
>> >> >> > collector, TaskCoordinator coordinator) {
>> >> >> >
>> >> >> >     Map<String, Object> outgoingMap =
>> >> >> > WikipediaFeedEvent.toMap((WikipediaFeedEvent)
>> >>envelope.getMessage());
>> >> >> >
>> >> >> >     log.info(envelope.getMessage().toString());
>> >> >> >
>> >> >> >     collector.send(new OutgoingMessageEnvelope(OUTPUT_STREAM,
>> >> >> > outgoingMap));
>> >> >> >
>> >> >> >   }
>> >> >> >
>> >> >> > }
>> >> >> >
>> >> >> >
>> >> >> > On Mon, Aug 25, 2014 at 9:01 AM, Chris Riccomini <
>> >> >> > criccomini@linkedin.com.invalid> wrote:
>> >> >> >
>> >> >> > > Hey Shekar,
>> >> >> > >
>> >> >> > > Your thought process is on the right track. It's probably
best
>> to
>> >> >>start
>> >> >> > > with hello-samza, and modify it to get what you want.
To start
>> >>with,
>> >> >> > > you'll want to:
>> >> >> > >
>> >> >> > > 1. Write a simple StreamTask that just does something
silly like
>> >> >>just
>> >> >> > > print messages that it receives.
>> >> >> > > 2. Write a configuration for the job that consumes from
just the
>> >> >>stream
>> >> >> > > (alerts from different sources).
>> >> >> > > 3. Run this to make sure you've got it working.
>> >> >> > > 4. Now add your table join. This can be either a change-data
>> >>capture
>> >> >> > (CDC)
>> >> >> > > stream, or via a remote DB call.
>> >> >> > >
>> >> >> > > That should get you to a point where you've got your
job up and
>> >> >> running.
>> >> >> > > From there, you could create your own Maven project,
and migrate
>> >> >>your
>> >> >> > code
>> >> >> > > over accordingly.
>> >> >> > >
>> >> >> > > Cheers,
>> >> >> > > Chris
>> >> >> > >
>> >> >> > > On 8/24/14 1:42 AM, "Shekar Tippur" <ctippur@gmail.com>
wrote:
>> >> >> > >
>> >> >> > > >Chris,
>> >> >> > > >
>> >> >> > > >I have gone thro the documentation and decided that
the option
>> >> >>that is
>> >> >> > > >most
>> >> >> > > >suitable for me is stream-table.
>> >> >> > > >
>> >> >> > > >I see the following things:
>> >> >> > > >
>> >> >> > > >1. Point samza to a table (database)
>> >> >> > > >2. Point Samza to a stream - Alert stream from different
>> sources
>> >> >> > > >3. Join key like a hostname
>> >> >> > > >
>> >> >> > > >I have Hello Samza working. To extend that to do
what my needs
>> >> >>are, I
>> >> >> am
>> >> >> > > >not sure where to start (Needs more code change OR
>> configuration
>> >> >> changes
>> >> >> > > >OR
>> >> >> > > >both)?
>> >> >> > > >
>> >> >> > > >I have gone thro
>> >> >> > > >
>> >> >> >
>> >> >>
>> >> >>
>> >>
>> >>
>> http://samza.incubator.apache.org/learn/documentation/latest/api/overview
>> >> >> > > .
>> >> >> > > >html
>> >> >> > > >
>> >> >> > > >Is my thought process on the right track? Can you
please point
>> >>me
>> >> >>to
>> >> >> the
>> >> >> > > >right direction?
>> >> >> > > >
>> >> >> > > >- Shekar
>> >> >> > > >
>> >> >> > > >
>> >> >> > > >
>> >> >> > > >On Thu, Aug 21, 2014 at 1:08 PM, Shekar Tippur
>> >><ctippur@gmail.com>
>> >> >> > wrote:
>> >> >> > > >
>> >> >> > > >> Chris,
>> >> >> > > >>
>> >> >> > > >> This is perfectly good answer. I will start
poking more into
>> >> >>option
>> >> >> > #4.
>> >> >> > > >>
>> >> >> > > >> - Shekar
>> >> >> > > >>
>> >> >> > > >>
>> >> >> > > >> On Thu, Aug 21, 2014 at 1:05 PM, Chris Riccomini
<
>> >> >> > > >> criccomini@linkedin.com.invalid> wrote:
>> >> >> > > >>
>> >> >> > > >>> Hey Shekar,
>> >> >> > > >>>
>> >> >> > > >>> Your two options are really (3) or (4),
then. You can either
>> >>run
>> >> >> some
>> >> >> > > >>> external DB that holds the data set, and
you can query it
>> >>from a
>> >> >> > > >>> StreamTask, or you can use Samza's state
store feature to
>> >>push
>> >> >>data
>> >> >> > > >>>into a
>> >> >> > > >>> stream that you can then store in a partitioned
key-value
>> >>store
>> >> >> along
>> >> >> > > >>>with
>> >> >> > > >>> your StreamTasks. There is some documentation
here about the
>> >> >>state
>> >> >> > > >>>store
>> >> >> > > >>> approach:
>> >> >> > > >>>
>> >> >> > > >>>
>> >> >> > > >>>
>> >> >> > > >>>
>> >> >> > >
>> >> >>
>> >>
>> http://samza.incubator.apache.org/learn/documentation/0.7.0/container/st
>> >> >> > > >>>ate
>> >> >> > > >>> -management.html
>> >> >> > > >>>
>> >> >> > > >>><
>> >> >> > >
>> >> >>
>> >>http://samza.incubator.apache.org/learn/documentation/0.7.0/container/s
>> >> >> > > >>>tate-management.html>
>> >> >> > > >>>
>> >> >> > > >>>
>> >> >> > > >>> (4) is going to require more up front effort
from you, since
>> >> >>you'll
>> >> >> > > >>>have
>> >> >> > > >>> to understand how Kafka's partitioning model
works, and
>> setup
>> >> >>some
>> >> >> > > >>> pipeline to push the updates for your state.
In the long
>> >>run, I
>> >> >> > believe
>> >> >> > > >>> it's the better approach, though. Local
lookups on a
>> >>key-value
>> >> >> store
>> >> >> > > >>> should be faster than doing remote RPC calls
to a DB for
>> >>every
>> >> >> > message.
>> >> >> > > >>>
>> >> >> > > >>> I'm sorry I can't give you a more definitive
answer. It's
>> >>really
>> >> >> > about
>> >> >> > > >>> trade-offs.
>> >> >> > > >>>
>> >> >> > > >>> Cheers,
>> >> >> > > >>> Chris
>> >> >> > > >>>
>> >> >> > > >>> On 8/21/14 12:22 PM, "Shekar Tippur" <ctippur@gmail.com>
>> >>wrote:
>> >> >> > > >>>
>> >> >> > > >>> >Chris,
>> >> >> > > >>> >
>> >> >> > > >>> >A big thanks for a swift response. The
data set is huge and
>> >>the
>> >> >> > > >>>frequency
>> >> >> > > >>> >is in burst.
>> >> >> > > >>> >What do you suggest?
>> >> >> > > >>> >
>> >> >> > > >>> >- Shekar
>> >> >> > > >>> >
>> >> >> > > >>> >
>> >> >> > > >>> >On Thu, Aug 21, 2014 at 11:05 AM, Chris
Riccomini <
>> >> >> > > >>> >criccomini@linkedin.com.invalid>
wrote:
>> >> >> > > >>> >
>> >> >> > > >>> >> Hey Shekar,
>> >> >> > > >>> >>
>> >> >> > > >>> >> This is feasible, and you are on
the right thought
>> >>process.
>> >> >> > > >>> >>
>> >> >> > > >>> >> For the sake of discussion, I'm
going to pretend that you
>> >> >>have a
>> >> >> > > >>>Kafka
>> >> >> > > >>> >> topic called "PageViewEvent", which
has just the IP
>> >>address
>> >> >>that
>> >> >> > was
>> >> >> > > >>> >>used
>> >> >> > > >>> >> to view a page. These messages
will be logged every time
>> a
>> >> >>page
>> >> >> > view
>> >> >> > > >>> >> happens. I'm also going to pretend
that you have some
>> >>state
>> >> >> called
>> >> >> > > >>> >>"IPGeo"
>> >> >> > > >>> >> (e.g. The maxmind data set). In
this example, we'll want
>> >>to
>> >> >>join
>> >> >> > the
>> >> >> > > >>> >> long/lat geo information from IPGeo
to the PageViewEvent,
>> >>and
>> >> >> send
>> >> >> > > >>>it
>> >> >> > > >>> >>to a
>> >> >> > > >>> >> new topic: PageViewEventsWithGeo.
>> >> >> > > >>> >>
>> >> >> > > >>> >> You have several options on how
to implement this
>> example.
>> >> >> > > >>> >>
>> >> >> > > >>> >> 1. If your joining data set (IPGeo)
is relatively small
>> >>and
>> >> >> > changes
>> >> >> > > >>> >> infrequently, you can just pack
it up in your jar or .tgz
>> >> >>file,
>> >> >> > and
>> >> >> > > >>> open
>> >> >> > > >>> >> it open in every StreamTask.
>> >> >> > > >>> >> 2. If your data set is small, but
changes somewhat
>> >> >>frequently,
>> >> >> you
>> >> >> > > >>>can
>> >> >> > > >>> >> throw the data set on some HTTP/HDFS/S3
server somewhere,
>> >>and
>> >> >> have
>> >> >> > > >>>your
>> >> >> > > >>> >> StreamTask refresh it periodically
by re-downloading it.
>> >> >> > > >>> >> 3. You can do remote RPC calls
for the IPGeo data on
>> every
>> >> >>page
>> >> >> > view
>> >> >> > > >>> >>event
>> >> >> > > >>> >> by query some remote service or
DB (e.g. Cassandra).
>> >> >> > > >>> >> 4. You can use Samza's state feature
to set your IPGeo
>> >>data
>> >> >>as a
>> >> >> > > >>>series
>> >> >> > > >>> >>of
>> >> >> > > >>> >> messages to a log-compacted Kafka
topic
>> >> >> > > >>> >> (
>> >> >> https://cwiki.apache.org/confluence/display/KAFKA/Log+Compaction
>> >> >> > ),
>> >> >> > > >>> and
>> >> >> > > >>> >> configure your Samza job to read
this topic as a
>> bootstrap
>> >> >> stream
>> >> >> > > >>> >> (
>> >> >> > > >>> >>
>> >> >> > > >>> >>
>> >> >> > > >>>
>> >> >> > > >>>
>> >> >> > >
>> >> >>
>> >>
>> http://samza.incubator.apache.org/learn/documentation/0.7.0/container/st
>> >> >> > > >>>r
>> >> >> > > >>> >>e
>> >> >> > > >>> >> ams.html).
>> >> >> > > >>> >>
>> >> >> > > >>> >> For (4), you'd have to partition
the IPGeo state topic
>> >> >>according
>> >> >> > to
>> >> >> > > >>>the
>> >> >> > > >>> >> same key as PageViewEvent. If PageViewEvent
were
>> >>partitioned
>> >> >>by,
>> >> >> > > >>>say,
>> >> >> > > >>> >> member ID, but you want your IPGeo
state topic to be
>> >> >>partitioned
>> >> >> > by
>> >> >> > > >>>IP
>> >> >> > > >>> >> address, then you'd have to have
an upstream job that
>> >> >> > re-partitioned
>> >> >> > > >>> >> PageViewEvent into some new topic
by IP address. This new
>> >> >>topic
>> >> >> > will
>> >> >> > > >>> >>have
>> >> >> > > >>> >> to have the same number of partitions
as the IPGeo state
>> >> >>topic
>> >> >> (if
>> >> >> > > >>> IPGeo
>> >> >> > > >>> >> has 8 partitions, then the new
PageViewEventRepartitioned
>> >> >>topic
>> >> >> > > >>>needs 8
>> >> >> > > >>> >>as
>> >> >> > > >>> >> well). This will cause your PageViewEventRepartitioned
>> >>topic
>> >> >>and
>> >> >> > > >>>your
>> >> >> > > >>> >> IPGeo state topic to be aligned
such that the StreamTask
>> >>that
>> >> >> gets
>> >> >> > > >>>page
>> >> >> > > >>> >> views for IP address X will also
have the IPGeo
>> >>information
>> >> >>for
>> >> >> IP
>> >> >> > > >>> >>address
>> >> >> > > >>> >> X.
>> >> >> > > >>> >>
>> >> >> > > >>> >> Which strategy you pick is really
up to you. :) (4) is
>> the
>> >> >>most
>> >> >> > > >>> >> complicated, but also the most
flexible, and most
>> >> >>operationally
>> >> >> > > >>>sound.
>> >> >> > > >>> >>(1)
>> >> >> > > >>> >> is the easiest if it fits your
needs.
>> >> >> > > >>> >>
>> >> >> > > >>> >> Cheers,
>> >> >> > > >>> >> Chris
>> >> >> > > >>> >>
>> >> >> > > >>> >> On 8/21/14 10:15 AM, "Shekar Tippur"
<ctippur@gmail.com>
>> >> >>wrote:
>> >> >> > > >>> >>
>> >> >> > > >>> >> >Hello,
>> >> >> > > >>> >> >
>> >> >> > > >>> >> >I am new to Samza. I have just
installed Hello Samza and
>> >> >>got it
>> >> >> > > >>> >>working.
>> >> >> > > >>> >> >
>> >> >> > > >>> >> >Here is the use case for which
I am trying to use Samza:
>> >> >> > > >>> >> >
>> >> >> > > >>> >> >
>> >> >> > > >>> >> >1. Cache the contextual information
which contains more
>> >> >> > information
>> >> >> > > >>> >>about
>> >> >> > > >>> >> >the hostname or IP address
using Samza/Yarn/Kafka
>> >> >> > > >>> >> >2. Collect alert and metric
events which contain either
>> >> >> hostname
>> >> >> > > >>>or IP
>> >> >> > > >>> >> >address
>> >> >> > > >>> >> >3. Append contextual information
to the alert and metric
>> >>and
>> >> >> > > >>>insert to
>> >> >> > > >>> >>a
>> >> >> > > >>> >> >Kafka queue from which other
subscribers read off of.
>> >> >> > > >>> >> >
>> >> >> > > >>> >> >Can you please shed some light
on
>> >> >> > > >>> >> >
>> >> >> > > >>> >> >1. Is this feasible?
>> >> >> > > >>> >> >2. Am I on the right thought
process
>> >> >> > > >>> >> >3. How do I start
>> >> >> > > >>> >> >
>> >> >> > > >>> >> >I now have 1 & 2 of them
working disparately. I need to
>> >> >> integrate
>> >> >> > > >>> them.
>> >> >> > > >>> >> >
>> >> >> > > >>> >> >Appreciate any input.
>> >> >> > > >>> >> >
>> >> >> > > >>> >> >- Shekar
>> >> >> > > >>> >>
>> >> >> > > >>> >>
>> >> >> > > >>>
>> >> >> > > >>>
>> >> >> > > >>
>> >> >> > >
>> >> >> > >
>> >> >> >
>> >> >>
>> >>
>> >>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message