monit和supervisor都是进程管理工具,不过进程管理只是monit的功能之一,monit是一个开源的轻量级监控工具,功能十分强大。可以从多个层面进行监控,可以自动维护进程,发送邮件报警等。

系统监控:进程状态,系统负载,cpu负载,内存占用等。
进程监控:monit可以监控守护进程,当被监控进程异常退出时,可以自动被拉起。
文件系统:Monit可以监控本地文件、目录、文件系统的变化,包括时间戳、校验值、大小的变化。例如,可以监控文件sha1以及md5的值,来监控文件是否发生变化
网络监控:monit可以监控网络连接,支持TCP、UDP、Unix domain sockets以及HTTP、SMTP等。

程序安装:
yum install monit -y
配置文件:/etc/monit.conf

常用命令:

monit -t # 配置文件检测
monit # 启动monit daemon
monit -c /var/monit/monitrc # 启动monit daemon时指定配置文件
monit reload # 当更新了配置文件需要重载
monit status # 查看所有服务状态
monit status nginx # 查看nginx服务状态
monit stop all # 停止所有服务
monit stop nginx # 停止nginx服务
monit start all # 启动所有服务
monit start nginx # 启动nginx服务
monit -V # 查看版本

配置告警联系人
set alert 776711462@qq.com
下面是常用的几个功能:
1)监控文件
Nginx的配置文件HASH变化则直接reload

check file nginx.conf path /usr/local/nginx/conf/nginx.conf
    if changed sha1 checksum
    then exec "/usr/local/nginx/sbin/nginx -s reload"

这里也可以指定HASH值

check file nginx.conf path /usr/local/nginx/conf/nginx.conf
    if failed checksum and expect the sum 144f738eee9c0c0bb0b1e62c785e4a76 then alert

监控文件的修改时间,比如DB文件如果15分钟没有修改可能系统服务出现问题。监控文件的权限、属主、属组、大小等。

check file database with path /data/mydatabase.db
    if failed permission 700 then alert
    if failed uid data then alert
    if failed gid data then alert
    if timestamp > 15 minutes then alert
    if size > 100 MB then exec "/my/cleanup/script" as uid dba and gid dba

2)监控进程
监控Nginx进程:

# 提供主进程pid文件
check process nginx with pidfile /usr/local/nginx/logs/nginx.pid
    # 进程启动命令,必须写绝对路径
    start program = "/usr/local/nginx/sbin/nginx" with timeout 30 seconds
    # 进程关闭命令
    stop program  = "/usr/local/nginx/sbin/nginx -s stop"
# 端口状态检测,当状态返回异常,则重启服务。
  if failed host 192.168.192.120 port 80 protocol http then restart
# 当端口状态异常,报警    
  if failed host 192.168.192.120 port 80 protocol http then alert
# 在5个监视周期中,重启了服务3次,则超时不再监视。 因为如果重启了多次不成功,很有可能继续重启下去也不会成功,避免一直无效的重启,白白消耗系统资源影响主机上其他进程的工作,这时应该通知人工处理。
  if 3 restarts within 5 cycles then timeout
# 如果在5个监视周期内,该服务的CPU使用率都超过90%则告警。       
  if cpu usage > 90% for 5 cycles then alert
# 设置分组,可选
   group server
#   可选的ssl端口的监控,如果有的话
#    if failed port 443 type tcpssl protocol http
#       with timeout 15 seconds
#       then restart

监控SSH进程:

check process sshd with pidfile /var/run/sshd.pid
   start program  "/etc/init.d/sshd start"
   stop program  "/etc/init.d/sshd stop"
   if failed port 22 protocol SSH then restart
   if 5 restarts within 5 cycles then timeout

监控apache进程:

  check process apache with pidfile /usr/local/apache/logs/httpd.pid
    start program = "/etc/init.d/httpd start" with timeout 60 seconds
    stop program  = "/etc/init.d/httpd stop"
    if cpu > 60% for 2 cycles then alert
    if cpu > 80% for 5 cycles then restart
    if totalmem > 200.0 MB for 5 cycles then restart
    if children > 250 then restart
    if loadavg(5min) greater than 10 for 8 cycles then stop
    if failed host www.tildeslash.com port 80 protocol http and request "/somefile.html" then restart
    if failed port 443 type tcpssl protocol http with timeout 15 seconds then restart
    if 3 restarts within 5 cycles then unmonitor
    depends on apache_bin
    group server

3)系统负载监控

  check system $HOST
    if loadavg (1min) > 4 then alert
    if loadavg (5min) > 2 then alert
    if cpu usage > 95% for 10 cycles then alert
    if memory usage > 75% then alert
    if swap usage > 25% then alert

4)监控脚本返回值

check program myscript with path /usr/local/bin/myscript.sh
    if status != 0 then alert

5)监控网卡状态

  check network public with interface eth0
    if failed link then alert
    if changed link then alert
    if saturation > 90% then alert
    if download > 10 MB/s then alert
    if total upload > 1 GB in last hour then alert

6)监控远程主机服务

通过发出ping测试来检查远程主机的可用性,并检查来自web服务器的响应的内容。

  check host myserver with address 192.168.192.120
    if failed ping then alert
    if failed port 3306 protocol mysql with timeout 15 seconds then alert
    if failed port 80 protocol http and request /1.html with content = "123" then alert

7)监控文件系统

check filesystem datafs with path /dev/sdb1
 start program = "/bin/mount /data"
 stop program = "/bin/umount /data"
 if failed permission 660 then unmonitor
 if failed uid root then unmonitor
 if failed gid disk then unmonitor
 if space usage > 80% for 5 times within 15 cycles then alert
 if space usage > 99% then stop
 if inode usage > 30000 then alert
 if inode usage > 99% then stop
 group server