biomed-shifts:argo

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
biomed-shifts:argo [2017/08/28 11:18]
fmichel created
biomed-shifts:argo [2017/08/29 09:38] (current)
fmichel
Line 4: Line 4:
  
 The biomed [[https://​argo-mon-biomed.cro-ngi.hr/​nagios|ARGO box]] is hosted and maintained by the Croatian NGI. It monitors SEs and CEs of all sites that support the VO. It is the major reference for biomed support team members in term of resources monitoring. The biomed [[https://​argo-mon-biomed.cro-ngi.hr/​nagios|ARGO box]] is hosted and maintained by the Croatian NGI. It monitors SEs and CEs of all sites that support the VO. It is the major reference for biomed support team members in term of resources monitoring.
 +
 +**Probes documentation**:​ https://​wiki.egi.eu/​wiki/​ROC_SAM_Tests
  
 For monitoring results, go to Service Groups → Summary → service name (SERVICE_SRM_V2 or SERVICE_CREAM-CE),​ or bookmark direct links like these: For monitoring results, go to Service Groups → Summary → service name (SERVICE_SRM_V2 or SERVICE_CREAM-CE),​ or bookmark direct links like these:
Line 12: Line 14:
 Clicking on the SE/CE host name gives information on the scheduled downtimes (host state information section). <color red>Only critical problems (showing in red) may lead to ticket submission</​color>​. Clicking on the SE/CE host name gives information on the scheduled downtimes (host state information section). <color red>Only critical problems (showing in red) may lead to ticket submission</​color>​.
  
-description of ARGO probes is available from [[https://tomtools.cern.ch/confluence/display/SAMDOC/grid-monitoring-probes-org.sam|the SAM wiki]]. Source code can also be found from the [[https://​svnweb.cern.ch/​trac/​sam/​browser/​trunk/​probes/​src/​gridmetrics|CERN Trac server]] or directly from the [[http://​svn.cern.ch/​guest/​sam/​trunk/​probes/​src/​gridmetrics/​|SVN repository]].+A [[https://argo-mon-biomed.cro-ngi.hr/nagios/cgi-bin/status.cgi?​host=argo-mon-biomed.cro-ngi.hr|specific probe]] checks ​the status of AGRO itself, i.eall critical processes for ARGO to runIt should be checked in case of suspicious behaviour.
  
-A [[https://​grid16.lal.in2p3.fr/​nagios/​cgi-bin/​status.cgi?​host=grid02.lal.in2p3.fr&​style=detail|specific probe]] checks the status of Nagios itself, i.e. all critical processes for Nagios to run. It should be checked in case of suspicious behaviour of Nagios. +The figure below depicts important graphical elements in ARGO referring to downtimes and comments: {{:​biomed-shifts:​nagios_comment-24224655.png?​direct&​450}}
- +
-The figure below depicts important graphical elements in Nagios ​referring to downtimes and comments: {{:​biomed-shifts:​nagios_comment-24224655.png?​direct&​550}}+
  
 ===== Information for administrators ===== ===== Information for administrators =====
  
-==== Paths and configuration ====+The ARGO instance is using the following POEM profile: https://​poem.egi.eu/​poem/​admin/​poem/​profile/​2/​
  
-__Topology__:​ a VO feed is generated every day at 23h50 by script grid04.lal.in2p3.fr:/home/fmichel/vo-feed-biomed.py. The feed is created from the status of the GRIF top BDII, an EMI BDII with expiration delay set to 24 hours.+The topology ​is fetched from VAPOR: 
 +  * http://​operations-portal.egi.eu/vapor/downloadLavoisier/​option/​xml/​view/​vapor_sites/​param/vo=biomed 
 +  * http://​operations-portal.egi.eu/​vapor/​downloadLavoisier/​option/​xml/​view/​vapor_endpoints/​param/​vo=biomed
  
-The VO feed, biomed.xml, is then copied to a web server at 0h00 and used by Nagios to build the list of resources monitored. +**Soft/Hard states vs. max_check_attempts**:​ [[http://​nagios.sourceforge.net/​docs/​nagioscore/​4/​en/​statetypes.html|http://​nagios.sourceforge.net/​docs/​nagioscore/​4/​en/​statetypes.html]]
- +
-Consequently,​ the list of monitored resources is updated every day, avoiding to monitor decommissioned resources, but also with a delay of at least 24h to monitor resources that are down for just a few hours for instance. +
- +
-==== Paths and configuration ==== +
- +
-  * Documentation:​ http:<​nowiki>//</​nowiki>​library.nagios.com/​library/​products/​nagioscore/​manuals/​ +
-  * Configuration:​ /​etc/​nagios:​ nagios.cfg, services.cfg,​ wlcg.d/<​site name>/​*.cfg +
-  * Probes path: /​usr/​libexec/​grid-monitoring/​probes/​org.sam/​ +
-  * Actual code of probes: /​usr/​lib/​python2.4/​site-packages/​gridmetrics +
- +
-**Soft/Hard states vs. max_check_attempts**:​ [[http://​nagios.sourceforge.net/​docs/​nagioscore/​3/​en/​statetypes.html|http://​nagios.sourceforge.net/​docs/​nagioscore/​3/​en/​statetypes.html]]+
  
   * normal_check_interval 60   * normal_check_interval 60
Line 44: Line 35:
  
 **Passive checks**: they are initiated and performed by external applications/​processes. Passive check results are submitted to Nagios for processing. **Passive checks**: they are initiated and performed by external applications/​processes. Passive check results are submitted to Nagios for processing.
- 
-==== Stop/start Nagios ==== 
- 
-As root, run: service nagios restart 
- 
-==== Changing the grid certificate ==== 
- 
-When the grid certificate of the user used to run tests is renewed once a year, copy the userkey.pem and usercert.pem files to .globus like on any UI. To do so, follow those steps: 
- 
-1. Copy the pem files to the gate machine grid11.lal.in2p3.fr:​ 
- 
-<​code>​ 
-eval `ssh-agent` 
-ssh-add 
-gsiscp -P 2222 user*.pem grid11.lal.in2p3.fr:​ 
-</​code>​ 
- 
-2. Then log into grid11.lal.in2p3.fr and copy the pem files to the Nagios box grid4: 
- 
-<​code>​ 
-gsissh -AX -p 2222 grid11.lal.in2p3.fr 
-scp user*pem fmichel@grid04.lal.in2p3.fr:/​.globus 
-</​code>​ 
- 
-3. Then test the new pem files: 
- 
-<​code>​ 
-ssh fmichel@grid04.lal.in2p3.fr 
-voms-proxy-init --voms biomed 
-</​code>​ 
- 
-==== Proxy certificate renewal ==== 
- 
-Ssh to the any UI or on Nagios server: grid04.lal.in2p3.fr 
- 
-  * Create a valid proxy certificate:​ 
- 
-<​code>​ 
-$ voms-proxy-init --voms biomed 
-</​code>​ 
- 
-  * Renew the proxy 
- 
-<​code>​ 
-$ myproxy-init --cred_lifetime 672 --credname NagiosRetrieve-grid04.lal.in2p3.fr-biomed --pshost myproxy.grif.fr --username nagios --regex_dn_match --retrievable_by_cert "/​O=GRID-FR/​C=FR/​O=CNRS/​OU=LAL/​CN=grid04.lal.in2p3.fr"​ 
-</​code>​ 
- 
-  * Check the proxy: 
- 
-<​code>​ 
-$ myproxy-info -l nagios -s myproxy.grif.fr 
-</​code>​ 
- 
-  * Test the proxy retrieval probe: 
- 
-<​code>​ 
-Ssh to the Nagios server: grid04.lal.in2p3.fr 
-$ sudo su - nagios 
-$ /​usr/​libexec/​grid-monitoring/​probes/​hr.srce/​refresh_proxy ​ --myproxyuser nagios --cert /​etc/​nagios/​globus/​hostcert.pem --vo biomed --name NagiosRetrieve-grid04.lal.in2p3.fr-biomed -H myproxy.grif.fr --key /​etc/​nagios/​globus/​hostkey.pem -x /​etc/​nagios/​globus/​userproxy.pem-biomed 
-</​code>​ 
- 
  
  • biomed-shifts/argo.1503911911.txt.gz
  • Last modified: 2017/08/28 11:18
  • by fmichel