chukwa-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Yang <eric...@gmail.com>
Subject Re: chukwa agent doesn't collect the log suddenly , and after several days ,the agent crashes.
Date Tue, 09 Aug 2011 05:19:37 GMT
Thanks Ivy.  Could you put a +1 on:
https://issues.apache.org/jira/browse/CHUKWA-595
Thanks

regards,
Eric

On Mon, Aug 8, 2011 at 8:24 PM, Ying Tang <ivytang0812@gmail.com> wrote:
> Hi Eric,
>    After using this patch , the chuwa in my environment  has run normally
> for almost 10 days . Maybe this exception is the cause of chukwa agent
> crashing. And the " INFO Timer-1 HttpConnector - # http chunks ACK'ed since
> last report: 0 " didn't appear.
>   And in my enviroment  ,we do need sync the machine's time , this should be
> this cause of the exception.
>
> On Sat, Jul 30, 2011 at 12:00 AM, Eric Yang <eric818@gmail.com> wrote:
>>
>> Hi Ivy,
>>
>> In Http Sender, there is a throttle to sleep 5 seconds, if the data
>> sending frequency is high.  If the lastPosted time is more recent than the
>> current time, then the value becomes negative.  This could happen when ntp
>> makes time correction.  When sleeping on negative millisecond, then it
>> throws exception, and prevents agent to send data to collector.  I filed a
>> JIRA (CHUKWA-595) for this, thanks for catching this.  :)
>>
>> regards,
>> Eric
>>
>> On Jul 28, 2011, at 11:59 PM, Ying Tang wrote:
>>
>> > Hi Eric,
>> >     I meet with an exception.
>> >
>> >   Exception in thread "HTTP post thread"
>> > java.lang.IllegalArgumentException: timeout value is negative
>> >         at java.lang.Thread.sleep(Native Method)
>> >         at
>> > org.apache.hadoop.chukwa.datacollection.connector.http.HttpConnector.run(HttpConnector.java:187)
>> >         at java.lang.Thread.run(Thread.java:636)
>> >
>> > And the log " INFO Timer-1 HttpConnector - # http chunks ACK'ed since
>> > last report: 0 " appears again.
>> > On Thu, Jul 28, 2011 at 3:23 AM, Eric Yang <eric818@gmail.com> wrote:
>> > We need two numbers to track rotated logs properly without sending
>> > duplicates.  A number is the current file offset, and another number is the
>> > incremental sequence number.  The checkpoint file only preserves the
>> > incremental sequence number.  The current file offset is not tracked.  If
>> > agent crashes, agent will stream from offset 0 of the current file.  Hence,
>> > there maybe duplicate entries, when system recovers from agent crash.  I
>> > tested this, after file rotated, it is sending the new data from the current
>> > log file, but the file current offset is tracked in memory only.  It is not
>> > visible from port 9093.
>> >
>> > This issue is not directly related to the issue that you experienced
>> > which the file tailing adaptor.  From the code, it is still not clear why
>> > agent crashed.  However, I would recommend to use Chukwa trunk with
>> > SocketAdaptor with Log4JSocketAppender.  It is more reliable than UTF8
>> > adaptor.
>> >
>> > See: http://wiki.apache.org/hadoop/Chukwa_Quick_Start
>> >
>> > for installation instructions.
>> >
>> > regards,
>> > Eric
>> >
>> > On Jul 26, 2011, at 10:54 PM, Ying Tang wrote:
>> >
>> > > And after i restart the chukwa agent , the telnet adaptor last number
>> > > is still 10487067.
>> > >
>> > > On Wed, Jul 27, 2011 at 1:52 PM, Ying Tang <ivytang0812@gmail.com>
>> > > wrote:
>> > > Does this mean the adaptor still transfer the previous log ? The
>> > > current log is missing?
>> > >
>> > >
>> > > On Wed, Jul 27, 2011 at 1:08 PM, Eric Yang <eric818@gmail.com> wrote:
>> > > This looks like a bug, the last number should be in sync with the
>> > > current file's size, but the UTF adaptor is still tailing the previous
>> > > file (which rotated at 10487067)
>> > > It means there is a bug in handling the file rotation, but the adaptor
>> > > did not pick up the change.
>> > >
>> > > Please open a jira.  Thanks
>> > >
>> > > regards,
>> > > Eric
>> > >
>> > > On Tue, Jul 26, 2011 at 8:05 PM, Ying Tang <ivytang0812@gmail.com>
>> > > wrote:
>> > > > The log didn't rotate very  rapidly.
>> > > >
>> > > > Now i can't rebuild the scenario . But when the chukwa agent log
>> > > > looks ok,
>> > > >
>> > > >  2011-07-27 10:57:38,967 INFO Timer-0 ChukwaAgent - writing
>> > > > checkpoint
>> > > > 1307083
>> > > > 2011-07-27 10:57:42,571 INFO HTTP post thread ChukwaHttpSender -
>> > > > collected 1
>> > > > chunks for post_745
>> > > > 2011-07-27 10:57:42,571 INFO HTTP post thread ChukwaHttpSender -
>> > > > >>>>>> HTTP
>> > > > post_745 to http://chukwacollector1.xingcloud.com:9095/ length =
>> > > > 1837
>> > > > 2011-07-27 10:57:42,574 INFO HTTP post thread ChukwaHttpSender -
>> > > > >>>>>> HTTP
>> > > > Got success back from
>> > > > http://chukwacollector1.xingcloud.com:9095/chukwa;
>> > > > response length 43
>> > > > 2011-07-27 10:57:42,574 INFO HTTP post thread ChukwaHttpSender -
>> > > > post_745
>> > > > sent 0 chunks, got back 1 acks
>> > > >
>> > > > The list in telnet agent 9093 is:
>> > > > adaptor_2963225a90653a309cf779d4a1d815a3)
>> > > >
>> > > > org.apache.hadoop.chukwa.datacollection.adaptor.filetailer.CharFileTailingAdaptorUTF8
>> > > > Gamelog 0 /var/log/gamelog 10487067
>> > > > After several minites ,  the list is still
>> > > > adaptor_2963225a90653a309cf779d4a1d815a3)
>> > > >
>> > > > org.apache.hadoop.chukwa.datacollection.adaptor.filetailer.CharFileTailingAdaptorUTF8
>> > > > Gamelog 0 /var/log/gamelog 10487067
>> > > >
>> > > > Is the 10487067 the offset number ?The number didn't changed , and
>> > > > the log
>> > > > file's size is from 0 to 10M .And now the log file's size is
>> > > > 1150872.
>> > > >
>> > > > On Wed, Jul 27, 2011 at 12:26 AM, Eric Yang <eric818@gmail.com>
>> > > > wrote:
>> > > >>
>> > > >> CharFileTailingAdaptorUTF should handle log rotation gracefully.
>> > > >>  Is the
>> > > >> log rotating rapidly?
>> > > >> Run those command on chukwa agent:
>> > > >> telnet localhost 9093
>> > > >> list
>> > > >> This should show a list of tailing files, and check the offset
>> > > >> number of
>> > > >> the tailing log file.  The most right number should be smaller
than
>> > > >> the size
>> > > >> of your log file.  If it is bigger and not changing, it is most
>> > > >> likely there
>> > > >> is a bug that we haven't seen before.  It might be useful to
turn
>> > > >> on debug
>> > > >> on chukwa agent and see if this can be reproduced to nail down
the
>> > > >> root
>> > > >> cause.  Thanks
>> > > >> regards,
>> > > >> Eric
>> > > >> On Jul 26, 2011, at 6:13 AM, Ying Tang wrote:
>> > > >>
>> > > >> Is there the possibility that
>> > > >> when the log file reaches the log4g config file size ,the log4j
>> > > >> will
>> > > >> rename this log file and create a new file with this name as the
>> > > >> log file .
>> > > >> At the time ,the chukwa adaptor doesn't tail the log properly
, and
>> > > >> this
>> > > >> cause the chuwa agent can't collector the log any more.
>> > > >>
>> > > >> On Tue, Jul 26, 2011 at 2:07 PM, Ying Tang <ivytang0812@gmail.com>
>> > > >> wrote:
>> > > >>>
>> > > >>> The log file is log4j log file ,and the size is 10M ,the
>> > > >>> maxbackupindex
>> > > >>> is 1.
>> > > >>>
>> > > >>>
>> > > >>> On Tue, Jul 26, 2011 at 1:42 PM, Eric Yang <eric818@gmail.com>
>> > > >>> wrote:
>> > > >>>>
>> > > >>>> Can you run "ls -l" to show the size and dateof the log
files
>> > > >>>> that you
>> > > >>>> are streaming?
>> > > >>>>
>> > > >>>> regards,
>> > > >>>> Eric
>> > > >>>>
>> > > >>>> On Mon, Jul 25, 2011 at 7:36 PM, Ying Tang
>> > > >>>> <ivytang0812@gmail.com>
>> > > >>>> wrote:
>> > > >>>> > The chukwa version is 0.4.0 and the adaptor is
>> > > >>>> >
>> > > >>>> >
>> > > >>>> > org.apache.hadoop.chukwa.datacollection.adaptor.filetailer.CharFileTailingAdaptorUTF8
>> > > >>>> >
>> > > >>>> > On Mon, Jul 25, 2011 at 11:50 PM, Eric Yang <eric818@gmail.com>
>> > > >>>> > wrote:
>> > > >>>> >>
>> > > >>>> >> Hi Ivy,
>> > > >>>> >>
>> > > >>>> >> When data is send from agent to collector, collector
send
>> > > >>>> >> acknowledgment
>> > > >>>> >> of receiving of the chunks.  At 00:03:28, there
are 5 chunks
>> > > >>>> >> acknowledged.
>> > > >>>> >>  This means communication between collector
and agent are
>> > > >>>> >> working at
>> > > >>>> >> that
>> > > >>>> >> point in time.  However, there is no activity
after 00:04:28.
>> > > >>>> >>  This
>> > > >>>> >> looks
>> > > >>>> >> like adaptor did not handle the log rotation
properly at close
>> > > >>>> >> to
>> > > >>>> >> midnight.
>> > > >>>> >>  Which version of Chukwa are you using and which
adaptor are
>> > > >>>> >> you
>> > > >>>> >> using?
>> > > >>>> >>
>> > > >>>> >> regards,
>> > > >>>> >> Eric
>> > > >>>> >>
>> > > >>>> >> On Jul 25, 2011, at 12:40 AM, Ying Tang wrote:
>> > > >>>> >>
>> > > >>>> >> > Hi all,
>> > > >>>> >> >
>> > > >>>> >> > In my cluster , i have two chukwa agent
and one collector .
>> > > >>>> >> > At a time ,  both chukwa agents's log :
>> > > >>>> >> > 2011-07-18 00:03:28,688 INFO Timer-1 HttpConnector
- # http
>> > > >>>> >> > chunks
>> > > >>>> >> > ACK'ed since last report: 5
>> > > >>>> >> > 2011-07-18 00:04:28,697 INFO Timer-1 HttpConnector
- # http
>> > > >>>> >> > chunks
>> > > >>>> >> > ACK'ed since last report: 0
>> > > >>>> >> > 2011-07-18 00:05:28,706 INFO Timer-1 HttpConnector
- # http
>> > > >>>> >> > chunks
>> > > >>>> >> > ACK'ed since last report: 0
>> > > >>>> >> > 2011-07-18 00:06:28,714 INFO Timer-1 HttpConnector
- # http
>> > > >>>> >> > chunks
>> > > >>>> >> > ACK'ed since last report: 0
>> > > >>>> >> > 2011-07-18 00:07:29,340 INFO Timer-1 HttpConnector
- # http
>> > > >>>> >> > chunks
>> > > >>>> >> > ACK'ed since last report: 0
>> > > >>>> >> >
>> > > >>>> >> > And the collector
>> > > >>>> >> > 2011-07-17 11:02:32,155 INFO Timer-3 SeqFileWriter
-
>> > > >>>> >> > stat:datacollection.writer.hdfs dataSize=0
dataRate=0
>> > > >>>> >> > 2011-07-17 11:02:43,074 INFO Timer-1 root
-
>> > > >>>> >> > stats:ServletCollector,numberHTTPConnection:0,numberchunks:0
>> > > >>>> >> > 2011-07-17 11:03:02,162 INFO Timer-3 SeqFileWriter
-
>> > > >>>> >> > stat:datacollection.writer.hdfs dataSize=0
dataRate=0
>> > > >>>> >> > 2011-07-17 11:03:32,168 INFO Timer-3 SeqFileWriter
-
>> > > >>>> >> > stat:datacollection.writer.hdfs dataSize=0
dataRate=0
>> > > >>>> >> > 2011-07-17 11:03:43,085 INFO Timer-1 root
-
>> > > >>>> >> > stats:ServletCollector,numberHTTPConnection:0,numberchunks:0
>> > > >>>> >> > 2011-07-17 11:04:02,174 INFO Timer-3 SeqFileWriter
-
>> > > >>>> >> > stat:datacollection.writer.hdfs dataSize=0
dataRate=0
>> > > >>>> >> > 2011-07-17 11:04:32,180 INFO Timer-3 SeqFileWriter
-
>> > > >>>> >> > stat:datacollection.writer.hdfs dataSize=0
dataRate=0
>> > > >>>> >> > 2011-07-17 11:04:43,096 INFO Timer-1 root
-
>> > > >>>> >> > stats:ServletCollector,numberHTTPConnection:0,numberchunks:0
>> > > >>>> >> > 2011-07-17 11:05:02,185 INFO Timer-3 SeqFileWriter
-
>> > > >>>> >> > stat:datacollection.writer.hdfs dataSize=0
dataRate=0
>> > > >>>> >> >
>> > > >>>> >> > (the collector and agent has  different
 timezone)
>> > > >>>> >> > And the collector didn't collect any log.
>> > > >>>> >> >
>> > > >>>> >> >
>> > > >>>> >> > What dons the "http chunks ACK'ed since
last report: 0"
>> > > >>>> >> > means?
>> > > >>>> >> > And from this log "http chunks ACK'ed since
last report: 0"
>> > > >>>> >> > appears
>> > > >>>> >> > to
>> > > >>>> >> >  agent crash, the chukwa port still on
, but after several
>> > > >>>> >> > days,
>> > > >>>> >> > both agents
>> > > >>>> >> > crashed without exceptions.
>> > > >>>> >> >
>> > > >>>> >> >
>> > > >>>> >> > --
>> > > >>>> >> > Best regards,
>> > > >>>> >> >
>> > > >>>> >> > Ivy Tang
>> > > >>>> >> >
>> > > >>>> >> >
>> > > >>>> >> >
>> > > >>>> >>
>> > > >>>> >
>> > > >>>> >
>> > > >>>> >
>> > > >>>> > --
>> > > >>>> > Best regards,
>> > > >>>> > Ivy Tang
>> > > >>>> >
>> > > >>>> >
>> > > >>>> >
>> > > >>>
>> > > >>>
>> > > >>>
>> > > >>> --
>> > > >>> Best regards,
>> > > >>> Ivy Tang
>> > > >>>
>> > > >>>
>> > > >>
>> > > >>
>> > > >>
>> > > >> --
>> > > >> Best regards,
>> > > >> Ivy Tang
>> > > >>
>> > > >>
>> > > >>
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > Best regards,
>> > > > Ivy Tang
>> > > >
>> > > >
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > Best regards,
>> > >
>> > > Ivy Tang
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > --
>> > > Best regards,
>> > >
>> > > Ivy Tang
>> > >
>> > >
>> > >
>> >
>> >
>> >
>> >
>> > --
>> > Best regards,
>> >
>> > Ivy Tang
>> >
>> >
>> >
>>
>
>
>
> --
> Best regards,
> Ivy Tang
>
>
>

Mime
View raw message