chukwa-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ying Tang <ivytang0...@gmail.com>
Subject Re: chukwa agent doesn't collect the log suddenly , and after several days ,the agent crashes.
Date Fri, 29 Jul 2011 06:59:24 GMT
Hi Eric,
    I meet with an exception.

  Exception in thread "HTTP post thread" java.lang.IllegalArgumentException:
timeout value is negative
        at java.lang.Thread.sleep(Native Method)
        at
org.apache.hadoop.chukwa.datacollection.connector.http.HttpConnector.run(HttpConnector.java:187)
        at java.lang.Thread.run(Thread.java:636)

And the log " INFO Timer-1 HttpConnector - # http chunks ACK'ed since last
report: 0 " appears again.
On Thu, Jul 28, 2011 at 3:23 AM, Eric Yang <eric818@gmail.com> wrote:

> We need two numbers to track rotated logs properly without sending
> duplicates.  A number is the current file offset, and another number is the
> incremental sequence number.  The checkpoint file only preserves the
> incremental sequence number.  The current file offset is not tracked.  If
> agent crashes, agent will stream from offset 0 of the current file.  Hence,
> there maybe duplicate entries, when system recovers from agent crash.  I
> tested this, after file rotated, it is sending the new data from the current
> log file, but the file current offset is tracked in memory only.  It is not
> visible from port 9093.
>
> This issue is not directly related to the issue that you experienced which
> the file tailing adaptor.  From the code, it is still not clear why agent
> crashed.  However, I would recommend to use Chukwa trunk with SocketAdaptor
> with Log4JSocketAppender.  It is more reliable than UTF8 adaptor.
>
> See: http://wiki.apache.org/hadoop/Chukwa_Quick_Start
>
> for installation instructions.
>
> regards,
> Eric
>
> On Jul 26, 2011, at 10:54 PM, Ying Tang wrote:
>
> > And after i restart the chukwa agent , the telnet adaptor last number is
> still 10487067.
> >
> > On Wed, Jul 27, 2011 at 1:52 PM, Ying Tang <ivytang0812@gmail.com>
> wrote:
> > Does this mean the adaptor still transfer the previous log ? The current
> log is missing?
> >
> >
> > On Wed, Jul 27, 2011 at 1:08 PM, Eric Yang <eric818@gmail.com> wrote:
> > This looks like a bug, the last number should be in sync with the
> > current file's size, but the UTF adaptor is still tailing the previous
> > file (which rotated at 10487067)
> > It means there is a bug in handling the file rotation, but the adaptor
> > did not pick up the change.
> >
> > Please open a jira.  Thanks
> >
> > regards,
> > Eric
> >
> > On Tue, Jul 26, 2011 at 8:05 PM, Ying Tang <ivytang0812@gmail.com>
> wrote:
> > > The log didn't rotate very  rapidly.
> > >
> > > Now i can't rebuild the scenario . But when the chukwa agent log looks
> ok,
> > >
> > >  2011-07-27 10:57:38,967 INFO Timer-0 ChukwaAgent - writing checkpoint
> > > 1307083
> > > 2011-07-27 10:57:42,571 INFO HTTP post thread ChukwaHttpSender -
> collected 1
> > > chunks for post_745
> > > 2011-07-27 10:57:42,571 INFO HTTP post thread ChukwaHttpSender - >>>>>>
> HTTP
> > > post_745 to http://chukwacollector1.xingcloud.com:9095/ length = 1837
> > > 2011-07-27 10:57:42,574 INFO HTTP post thread ChukwaHttpSender - >>>>>>
> HTTP
> > > Got success back from
> http://chukwacollector1.xingcloud.com:9095/chukwa;
> > > response length 43
> > > 2011-07-27 10:57:42,574 INFO HTTP post thread ChukwaHttpSender -
> post_745
> > > sent 0 chunks, got back 1 acks
> > >
> > > The list in telnet agent 9093 is:
> > > adaptor_2963225a90653a309cf779d4a1d815a3)
> > >
> org.apache.hadoop.chukwa.datacollection.adaptor.filetailer.CharFileTailingAdaptorUTF8
> > > Gamelog 0 /var/log/gamelog 10487067
> > > After several minites ,  the list is still
> > > adaptor_2963225a90653a309cf779d4a1d815a3)
> > >
> org.apache.hadoop.chukwa.datacollection.adaptor.filetailer.CharFileTailingAdaptorUTF8
> > > Gamelog 0 /var/log/gamelog 10487067
> > >
> > > Is the 10487067 the offset number ?The number didn't changed , and the
> log
> > > file's size is from 0 to 10M .And now the log file's size is 1150872.
> > >
> > > On Wed, Jul 27, 2011 at 12:26 AM, Eric Yang <eric818@gmail.com> wrote:
> > >>
> > >> CharFileTailingAdaptorUTF should handle log rotation gracefully.  Is
> the
> > >> log rotating rapidly?
> > >> Run those command on chukwa agent:
> > >> telnet localhost 9093
> > >> list
> > >> This should show a list of tailing files, and check the offset number
> of
> > >> the tailing log file.  The most right number should be smaller than
> the size
> > >> of your log file.  If it is bigger and not changing, it is most likely
> there
> > >> is a bug that we haven't seen before.  It might be useful to turn on
> debug
> > >> on chukwa agent and see if this can be reproduced to nail down the
> root
> > >> cause.  Thanks
> > >> regards,
> > >> Eric
> > >> On Jul 26, 2011, at 6:13 AM, Ying Tang wrote:
> > >>
> > >> Is there the possibility that
> > >> when the log file reaches the log4g config file size ,the log4j will
> > >> rename this log file and create a new file with this name as the log
> file .
> > >> At the time ,the chukwa adaptor doesn't tail the log properly , and
> this
> > >> cause the chuwa agent can't collector the log any more.
> > >>
> > >> On Tue, Jul 26, 2011 at 2:07 PM, Ying Tang <ivytang0812@gmail.com>
> wrote:
> > >>>
> > >>> The log file is log4j log file ,and the size is 10M ,the
> maxbackupindex
> > >>> is 1.
> > >>>
> > >>>
> > >>> On Tue, Jul 26, 2011 at 1:42 PM, Eric Yang <eric818@gmail.com>
> wrote:
> > >>>>
> > >>>> Can you run "ls -l" to show the size and dateof the log files that
> you
> > >>>> are streaming?
> > >>>>
> > >>>> regards,
> > >>>> Eric
> > >>>>
> > >>>> On Mon, Jul 25, 2011 at 7:36 PM, Ying Tang <ivytang0812@gmail.com>
> > >>>> wrote:
> > >>>> > The chukwa version is 0.4.0 and the adaptor is
> > >>>> >
> > >>>> >
> org.apache.hadoop.chukwa.datacollection.adaptor.filetailer.CharFileTailingAdaptorUTF8
> > >>>> >
> > >>>> > On Mon, Jul 25, 2011 at 11:50 PM, Eric Yang <eric818@gmail.com>
> wrote:
> > >>>> >>
> > >>>> >> Hi Ivy,
> > >>>> >>
> > >>>> >> When data is send from agent to collector, collector send
> > >>>> >> acknowledgment
> > >>>> >> of receiving of the chunks.  At 00:03:28, there are 5
chunks
> > >>>> >> acknowledged.
> > >>>> >>  This means communication between collector and agent
are working
> at
> > >>>> >> that
> > >>>> >> point in time.  However, there is no activity after 00:04:28.
>  This
> > >>>> >> looks
> > >>>> >> like adaptor did not handle the log rotation properly
at close to
> > >>>> >> midnight.
> > >>>> >>  Which version of Chukwa are you using and which adaptor
are you
> > >>>> >> using?
> > >>>> >>
> > >>>> >> regards,
> > >>>> >> Eric
> > >>>> >>
> > >>>> >> On Jul 25, 2011, at 12:40 AM, Ying Tang wrote:
> > >>>> >>
> > >>>> >> > Hi all,
> > >>>> >> >
> > >>>> >> > In my cluster , i have two chukwa agent and one collector
.
> > >>>> >> > At a time ,  both chukwa agents's log :
> > >>>> >> > 2011-07-18 00:03:28,688 INFO Timer-1 HttpConnector
- # http
> chunks
> > >>>> >> > ACK'ed since last report: 5
> > >>>> >> > 2011-07-18 00:04:28,697 INFO Timer-1 HttpConnector
- # http
> chunks
> > >>>> >> > ACK'ed since last report: 0
> > >>>> >> > 2011-07-18 00:05:28,706 INFO Timer-1 HttpConnector
- # http
> chunks
> > >>>> >> > ACK'ed since last report: 0
> > >>>> >> > 2011-07-18 00:06:28,714 INFO Timer-1 HttpConnector
- # http
> chunks
> > >>>> >> > ACK'ed since last report: 0
> > >>>> >> > 2011-07-18 00:07:29,340 INFO Timer-1 HttpConnector
- # http
> chunks
> > >>>> >> > ACK'ed since last report: 0
> > >>>> >> >
> > >>>> >> > And the collector
> > >>>> >> > 2011-07-17 11:02:32,155 INFO Timer-3 SeqFileWriter
-
> > >>>> >> > stat:datacollection.writer.hdfs dataSize=0 dataRate=0
> > >>>> >> > 2011-07-17 11:02:43,074 INFO Timer-1 root -
> > >>>> >> > stats:ServletCollector,numberHTTPConnection:0,numberchunks:0
> > >>>> >> > 2011-07-17 11:03:02,162 INFO Timer-3 SeqFileWriter
-
> > >>>> >> > stat:datacollection.writer.hdfs dataSize=0 dataRate=0
> > >>>> >> > 2011-07-17 11:03:32,168 INFO Timer-3 SeqFileWriter
-
> > >>>> >> > stat:datacollection.writer.hdfs dataSize=0 dataRate=0
> > >>>> >> > 2011-07-17 11:03:43,085 INFO Timer-1 root -
> > >>>> >> > stats:ServletCollector,numberHTTPConnection:0,numberchunks:0
> > >>>> >> > 2011-07-17 11:04:02,174 INFO Timer-3 SeqFileWriter
-
> > >>>> >> > stat:datacollection.writer.hdfs dataSize=0 dataRate=0
> > >>>> >> > 2011-07-17 11:04:32,180 INFO Timer-3 SeqFileWriter
-
> > >>>> >> > stat:datacollection.writer.hdfs dataSize=0 dataRate=0
> > >>>> >> > 2011-07-17 11:04:43,096 INFO Timer-1 root -
> > >>>> >> > stats:ServletCollector,numberHTTPConnection:0,numberchunks:0
> > >>>> >> > 2011-07-17 11:05:02,185 INFO Timer-3 SeqFileWriter
-
> > >>>> >> > stat:datacollection.writer.hdfs dataSize=0 dataRate=0
> > >>>> >> >
> > >>>> >> > (the collector and agent has  different  timezone)
> > >>>> >> > And the collector didn't collect any log.
> > >>>> >> >
> > >>>> >> >
> > >>>> >> > What dons the "http chunks ACK'ed since last report:
0" means?
> > >>>> >> > And from this log "http chunks ACK'ed since last
report: 0"
> appears
> > >>>> >> > to
> > >>>> >> >  agent crash, the chukwa port still on , but after
several
> days,
> > >>>> >> > both agents
> > >>>> >> > crashed without exceptions.
> > >>>> >> >
> > >>>> >> >
> > >>>> >> > --
> > >>>> >> > Best regards,
> > >>>> >> >
> > >>>> >> > Ivy Tang
> > >>>> >> >
> > >>>> >> >
> > >>>> >> >
> > >>>> >>
> > >>>> >
> > >>>> >
> > >>>> >
> > >>>> > --
> > >>>> > Best regards,
> > >>>> > Ivy Tang
> > >>>> >
> > >>>> >
> > >>>> >
> > >>>
> > >>>
> > >>>
> > >>> --
> > >>> Best regards,
> > >>> Ivy Tang
> > >>>
> > >>>
> > >>
> > >>
> > >>
> > >> --
> > >> Best regards,
> > >> Ivy Tang
> > >>
> > >>
> > >>
> > >
> > >
> > >
> > > --
> > > Best regards,
> > > Ivy Tang
> > >
> > >
> > >
> >
> >
> >
> > --
> > Best regards,
> >
> > Ivy Tang
> >
> >
> >
> >
> >
> >
> > --
> > Best regards,
> >
> > Ivy Tang
> >
> >
> >
>
>


-- 
Best regards,

Ivy Tang

Mime
View raw message