Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
biomed-shifts:nagios [2016/02/01 14:45]
fmichel [Proxy certificate renewal]
biomed-shifts:nagios [2017/08/28 11:41] (current)
fmichel
Line 1: Line 1:
-====== ​ biomed VO Nagios ======+====== biomed VO Nagios ======
  
-=====  Information for biomed ​shifters ​ =====+<note warning>​As of Aug. 2017, Nagios is replaced with [[biomed-shifts:​argo|ARGO]]. This page is deprecated.</​note>​
  
-The biomed [[https://​grid16.lal.in2p3.fr/​nagios|Nagios box]] is hosted and maintained by site GRIF from the French NGI. +===== Information ​for biomed ​shifters =====
-It monitors SEs, CEs and WMSs of all sites that support the VO. It is the major reference ​for biomed ​support team members in term of resources monitoring.+
  
-For monitoring results, go to Service Groups -> Summary -> service name (like SERVICE_SRM_V2,​ or SERVICE_CREAM-CE),​ or bookmark direct links like these: +The biomed ​[[https://​grid16.lal.in2p3.fr/​nagios|Nagios box]] is hosted and maintained by site GRIF from the French NGIIt monitors SEs, CEs and WMSs of all sites that support the VOIt is the major reference for biomed support team members in term of resources monitoring.
-  *  ​[[https://​grid16.lal.in2p3.fr/​nagios/​cgi-bin/​status.cgi?​servicegroup=SERVICE_SRMv2&​style=detail&​servicestatustypes=16&​hoststatustypes=15|critical issues on service group SRM]] +
-  *  [[https://​grid16.lal.in2p3.fr/​nagios/​cgi-bin/​status.cgi?​servicegroup=SERVICE_CREAM-CE&​style=detail&​servicestatustypes=16&​hoststatustypes=15|critical issues on service group CREAM-CE]] +
-  *  [[https://​grid16.lal.in2p3.fr/​nagios/​cgi-bin/​status.cgi?​servicegroup=SERVICE_WMS&​style=detail&​servicestatustypes=16&​hoststatustypes=15|critical issues for service group WMS]]+
  
-Clicking on the SE/CE/WMS host name gives information on the scheduled downtimes (host state information section). +For monitoring results, go to Service Groups → Summary → service name (like SERVICE_SRM_V2,​ or SERVICE_CREAM-CE),​ or bookmark direct links like these: 
-Only critical problems (showing in red) may lead to ticket submission+ 
 +  * [[https://​grid16.lal.in2p3.fr/​nagios/​cgi-bin/​status.cgi?​servicegroup=SERVICE_SRMv2&​style=detail&​servicestatustypes=16&​hoststatustypes=15|critical issues on service group SRM]] 
 +  * [[https://​grid16.lal.in2p3.fr/​nagios/​cgi-bin/​status.cgi?​servicegroup=SERVICE_CREAM-CE&​style=detail&​servicestatustypes=16&​hoststatustypes=15|critical issues on service group CREAM-CE]] 
 +  * [[https://​grid16.lal.in2p3.fr/​nagios/​cgi-bin/​status.cgi?​servicegroup=SERVICE_WMS&​style=detail&​servicestatustypes=16&​hoststatustypes=15|critical issues for service group WMS]] 
 + 
 +Clicking on the SE/CE/WMS host name gives information on the scheduled downtimes (host state information section). Only critical problems (showing in red) may lead to ticket submission
  
 A description of Nagios probes is available from [[https://​tomtools.cern.ch/​confluence/​display/​SAMDOC/​grid-monitoring-probes-org.sam|the SAM wiki]]. Source code can also be found from the [[https://​svnweb.cern.ch/​trac/​sam/​browser/​trunk/​probes/​src/​gridmetrics|CERN Trac server]] or directly from the [[http://​svn.cern.ch/​guest/​sam/​trunk/​probes/​src/​gridmetrics/​|SVN repository]]. A description of Nagios probes is available from [[https://​tomtools.cern.ch/​confluence/​display/​SAMDOC/​grid-monitoring-probes-org.sam|the SAM wiki]]. Source code can also be found from the [[https://​svnweb.cern.ch/​trac/​sam/​browser/​trunk/​probes/​src/​gridmetrics|CERN Trac server]] or directly from the [[http://​svn.cern.ch/​guest/​sam/​trunk/​probes/​src/​gridmetrics/​|SVN repository]].
Line 18: Line 19:
 A [[https://​grid16.lal.in2p3.fr/​nagios/​cgi-bin/​status.cgi?​host=grid02.lal.in2p3.fr&​style=detail|specific probe]] checks the status of Nagios itself, i.e. all critical processes for Nagios to run. It should be checked in case of suspicious behaviour of Nagios. A [[https://​grid16.lal.in2p3.fr/​nagios/​cgi-bin/​status.cgi?​host=grid02.lal.in2p3.fr&​style=detail|specific probe]] checks the status of Nagios itself, i.e. all critical processes for Nagios to run. It should be checked in case of suspicious behaviour of Nagios.
  
-The figure below depicts important graphical elements in Nagios referring to downtimes and comments:+The figure below depicts important graphical elements in Nagios referring to downtimes and comments: ​{{:​biomed-shifts:​nagios_comment-24224655.png?​direct&​550}}
  
-{{:​biomed-shifts:​nagios_comment-24224655.png?​600|}} +===== Information for administrators =====
-=====  Information for administrators ​ =====+
  
-====  Paths and configuration ​ ====+==== Paths and configuration ====
  
 __Topology__:​ a VO feed is generated every day at 23h50 by script grid04.lal.in2p3.fr:/​home/​fmichel/​vo-feed-biomed.py. The feed is created from the status of the GRIF top BDII, an EMI BDII with expiration delay set to 24 hours. __Topology__:​ a VO feed is generated every day at 23h50 by script grid04.lal.in2p3.fr:/​home/​fmichel/​vo-feed-biomed.py. The feed is created from the status of the GRIF top BDII, an EMI BDII with expiration delay set to 24 hours.
Line 31: Line 31:
 Consequently,​ the list of monitored resources is updated every day, avoiding to monitor decommissioned resources, but also with a delay of at least 24h to monitor resources that are down for just a few hours for instance. Consequently,​ the list of monitored resources is updated every day, avoiding to monitor decommissioned resources, but also with a delay of at least 24h to monitor resources that are down for just a few hours for instance.
  
-====  Paths and configuration ​ ====+==== Paths and configuration ====
  
-  *  Documentation:​ http:<​nowiki>//</​nowiki>​library.nagios.com/​library/​products/​nagioscore/​manuals/​ +  * Documentation:​ http:<​nowiki>//</​nowiki>​library.nagios.com/​library/​products/​nagioscore/​manuals/​ 
-  *  Configuration:​ /​etc/​nagios:​ nagios.cfg, services.cfg,​ wlcg.d/<​site name>/​*.cfg +  * Configuration:​ /​etc/​nagios:​ nagios.cfg, services.cfg,​ wlcg.d/<​site name>/​*.cfg 
-  *  Probes path: /​usr/​libexec/​grid-monitoring/​probes/​org.sam/​ +  * Probes path: /​usr/​libexec/​grid-monitoring/​probes/​org.sam/​ 
-  *  Actual code of probes: /​usr/​lib/​python2.4/​site-packages/​gridmetrics+  * Actual code of probes: /​usr/​lib/​python2.4/​site-packages/​gridmetrics
  
-**Soft/Hard states vs. max_check_attempts**:​ http://​nagios.sourceforge.net/​docs/​nagioscore/​3/​en/​statetypes.html +**Soft/Hard states vs. max_check_attempts**: ​[[http://​nagios.sourceforge.net/​docs/​nagioscore/​3/​en/​statetypes.html|http://​nagios.sourceforge.net/​docs/​nagioscore/​3/​en/​statetypes.html]]
-  *  normal_check_interval ​          60 +
-  *  retry_check_interval ​           15 +
-  *  max_check_attempts ​             4 +
-=> each service is checked once an hour, if an error occurs, retry each 15 min until 4 times failed => hard state = notification,​ except for passive checks (passive_host_checks_are_soft=0).+
  
-**Passive checks**: they are initiated and performed by external applications/​processes. +  * normal_check_interval 60 
-Passive check results are submitted to Nagios for processing.+  * retry_check_interval 15 
 +  * max_check_attempts 4 
 + 
 +⇒ each service is checked once an hour, if an error occurs, retry each 15 min until 4 times failed ⇒ hard state = notification,​ except for passive checks (passive_host_checks_are_soft=0). 
 + 
 +**Passive checks**: they are initiated and performed by external applications/​processes. Passive check results are submitted to Nagios for processing. 
 + 
 +==== Stop/start Nagios ====
  
-====  Stop/start Nagios ​ ==== 
 As root, run: service nagios restart As root, run: service nagios restart
  
-====  Changing the grid certificate ​ ====+==== Changing the grid certificate ==== 
 When the grid certificate of the user used to run tests is renewed once a year, copy the userkey.pem and usercert.pem files to .globus like on any UI. To do so, follow those steps: When the grid certificate of the user used to run tests is renewed once a year, copy the userkey.pem and usercert.pem files to .globus like on any UI. To do so, follow those steps:
  
-1. Copy the pem files to the gate machine grid11.lal.in2p3.fr: ​\\+1. Copy the pem files to the gate machine grid11.lal.in2p3.fr:​ 
 <​code>​ <​code>​
 eval `ssh-agent` eval `ssh-agent`
Line 61: Line 65:
  
 2. Then log into grid11.lal.in2p3.fr and copy the pem files to the Nagios box grid4: 2. Then log into grid11.lal.in2p3.fr and copy the pem files to the Nagios box grid4:
 +
 <​code>​ <​code>​
 gsissh -AX -p 2222 grid11.lal.in2p3.fr gsissh -AX -p 2222 grid11.lal.in2p3.fr
Line 67: Line 72:
  
 3. Then test the new pem files: 3. Then test the new pem files:
 +
 <​code>​ <​code>​
-ssh fmichel@grid04.lal.in2p3.fr<br>+ssh fmichel@grid04.lal.in2p3.fr
 voms-proxy-init --voms biomed voms-proxy-init --voms biomed
 </​code>​ </​code>​
  
-====  Proxy certificate renewal ​ ====+==== Proxy certificate renewal ====
  
 Ssh to the any UI or on Nagios server: grid04.lal.in2p3.fr Ssh to the any UI or on Nagios server: grid04.lal.in2p3.fr
-  ​ Create a valid proxy certificate:​ + 
-     <​code>​$ voms-proxy-init --voms biomed</​code>​ +  ​* Create a valid proxy certificate:​ 
-  *  Renew the proxy+
 <​code>​ <​code>​
-myproxy-init --cred_lifetime 672 --credname NagiosRetrieve-grid04.lal.in2p3.fr-biomed ​--pshost myproxy.grif.fr --username nagios --regex_dn_match --retrievable_by_cert "/​O=GRID-FR/​C=FR/​O=CNRS/​OU=LAL/​CN=grid04.lal.in2p3.fr"''​+voms-proxy-init --voms biomed
 </​code>​ </​code>​
  
-  *  ​Check ​the proxy+  * Renew the proxy
-<​code>​$ myproxy-info -l nagios -s myproxy.grif.fr''</​code>​+
  
-  ​*  Test the proxy retrieval probe: +<​code>​ 
-<​code>​Ssh to the Nagios server: grid04.lal.in2p3.fr+$ myproxy-init --cred_lifetime 672 --credname NagiosRetrieve-grid04.lal.in2p3.fr-biomed --pshost myproxy.grif.fr --username nagios --regex_dn_match --retrievable_by_cert "/​O=GRID-FR/​C=FR/​O=CNRS/​OU=LAL/​CN=grid04.lal.in2p3.fr"​ 
 +</​code>​ 
 + 
 +  ​Check the proxy: 
 + 
 +<​code>​ 
 +$ myproxy-info -l nagios -s myproxy.grif.fr 
 +</​code>​ 
 + 
 +  Test the proxy retrieval probe: 
 + 
 +<​code>​ 
 +Ssh to the Nagios server: grid04.lal.in2p3.fr
 $ sudo su - nagios $ sudo su - nagios
 $ /​usr/​libexec/​grid-monitoring/​probes/​hr.srce/​refresh_proxy ​ --myproxyuser nagios --cert /​etc/​nagios/​globus/​hostcert.pem --vo biomed --name NagiosRetrieve-grid04.lal.in2p3.fr-biomed -H myproxy.grif.fr --key /​etc/​nagios/​globus/​hostkey.pem -x /​etc/​nagios/​globus/​userproxy.pem-biomed $ /​usr/​libexec/​grid-monitoring/​probes/​hr.srce/​refresh_proxy ​ --myproxyuser nagios --cert /​etc/​nagios/​globus/​hostcert.pem --vo biomed --name NagiosRetrieve-grid04.lal.in2p3.fr-biomed -H myproxy.grif.fr --key /​etc/​nagios/​globus/​hostkey.pem -x /​etc/​nagios/​globus/​userproxy.pem-biomed
 </​code>​ </​code>​
 +