spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Imran Rashid <iras...@apache.org>
Subject Re: [build system] jenkins wedged again, rebooting master node
Date Tue, 19 Mar 2019 14:18:05 GMT
seems wedged again?

sorry for the bad news Shane, thanks for all the work on fixing it

On Mon, Mar 18, 2019 at 4:02 PM shane knapp <sknapp@berkeley.edu> wrote:

> ok, i dug through the logs and noticed that rsyslogd was dropping messages
> to do imuxsock being spammed by postfix...  which i then tracked down to
> our installation of fail2ban being incorrectly configured and attempting to
> send IP ban/unban status emails to 'email@example.com'.
>
> since we're a university, and especially one w/a reputation like ours, we
> are constantly under attack.  the logs of the attempted dictionary attacks
> would astound you in their size and scope.  since we have so many ban/unban
> actions happening for all of these unique IP address, each of which
> generates an email that was directed to an invalid address, we ended up
> w/well over 100M of plain-text messages waiting in the mail queue.  postfix
> was continually trying to send these messages, which was causing the system
> to behave strangely, including breaking rsyslogd.
>
> so, i disabled email reports in fail2ban, restarted the impacted services,
> picked my sysadmin's brain and then purged the mail queue (when was the
> last time anyone actually used postfix?).  jenkins now seems to be behaving
> (maybe?).
>
> i'm not entirely sure that this will fix the strange GUI hangs, but all
> reports i found on stackoverflow and other sites detail strange system
> behavior across the board when rsyslogd starts dropping messages.  at the
> very least we won't be (potentially) losing system-level log messages
> anymore, which might actually help me track down what's happening if
> jenkins gets wedged again.
>
> and finally, the obligatory IT Crowd clip:
> https://www.youtube.com/watch?v=5UT8RkSmN4k
>
> shane (who expects jenkins to crash within 5 minutes of this email going
> out)
>
> On Fri, Mar 15, 2019 at 8:22 PM Sean Owen <srowen@gmail.com> wrote:
>
>> It's not responding again. Is there any way to kick it harder? I know
>> it's well understood but this means not much can be merged in Spark
>>
>> On Fri, Mar 15, 2019 at 12:08 PM shane knapp <sknapp@berkeley.edu> wrote:
>> >
>> > well, that box rebooted in record time!  we're back up and building.
>> >
>> > and as always, i'll keep a close eye on things today...  jenkins
>> usually works great, until it doesn't.  :\
>> >
>> > On Fri, Mar 15, 2019 at 9:52 AM shane knapp <sknapp@berkeley.edu>
>> wrote:
>> >>
>> >> as some of you may have noticed, jenkins got itself in a bad state
>> multiple times over the past couple of weeks.  usually restarting the
>> service is sufficient, but it appears that i need to hit it w/the reboot
>> hammer.
>> >>
>> >> jenkins will be down for the next 20-30 minutes as the node reboots
>> and jenkins spins back up.  i'll reply here w/any updates.
>> >>
>> >> shane
>> >> --
>> >> Shane Knapp
>> >> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> >> https://rise.cs.berkeley.edu
>> >
>> >
>> >
>> > --
>> > Shane Knapp
>> > UC Berkeley EECS Research / RISELab Staff Technical Lead
>> > https://rise.cs.berkeley.edu
>>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>

Mime
View raw message