Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Last revision Both sides next revision | ||
biomed-shifts:argo [2017/08/28 11:20] fmichel |
biomed-shifts:argo [2017/08/29 09:38] fmichel |
||
---|---|---|---|
Line 14: | Line 14: | ||
A [[https://argo-mon-biomed.cro-ngi.hr/nagios/cgi-bin/status.cgi?host=argo-mon-biomed.cro-ngi.hr|specific probe]] checks the status of AGRO itself, i.e. all critical processes for ARGO to run. It should be checked in case of suspicious behaviour. | A [[https://argo-mon-biomed.cro-ngi.hr/nagios/cgi-bin/status.cgi?host=argo-mon-biomed.cro-ngi.hr|specific probe]] checks the status of AGRO itself, i.e. all critical processes for ARGO to run. It should be checked in case of suspicious behaviour. | ||
- | The figure below depicts important graphical elements in ARGO referring to downtimes and comments: {{:biomed-shifts:nagios_comment-24224655.png?direct&550}} | + | The figure below depicts important graphical elements in ARGO referring to downtimes and comments: {{:biomed-shifts:nagios_comment-24224655.png?direct&450}} |
===== Information for administrators ===== | ===== Information for administrators ===== | ||
- | ==== Paths and configuration ==== | + | The ARGO instance is using the following POEM profile: https://poem.egi.eu/poem/admin/poem/profile/2/ |
- | __Topology__: a VO feed is generated every day at 23h50 by script grid04.lal.in2p3.fr:/home/fmichel/vo-feed-biomed.py. The feed is created from the status of the GRIF top BDII, an EMI BDII with expiration delay set to 24 hours. | + | **Probes documentation**: https://wiki.egi.eu/wiki/ROC_SAM_Tests |
- | The VO feed, biomed.xml, is then copied to a web server at 0h00 and used by Nagios to build the list of resources monitored. | + | The topology is fetched from VAPOR: |
+ | * http://operations-portal.egi.eu/vapor/downloadLavoisier/option/xml/view/vapor_sites/param/vo=biomed | ||
+ | * http://operations-portal.egi.eu/vapor/downloadLavoisier/option/xml/view/vapor_endpoints/param/vo=biomed | ||
- | Consequently, the list of monitored resources is updated every day, avoiding to monitor decommissioned resources, but also with a delay of at least 24h to monitor resources that are down for just a few hours for instance. | + | **Soft/Hard states vs. max_check_attempts**: [[http://nagios.sourceforge.net/docs/nagioscore/4/en/statetypes.html|http://nagios.sourceforge.net/docs/nagioscore/4/en/statetypes.html]] |
- | + | ||
- | ==== Paths and configuration ==== | + | |
- | + | ||
- | * Documentation: http:<nowiki>//</nowiki>library.nagios.com/library/products/nagioscore/manuals/ | + | |
- | * Configuration: /etc/nagios: nagios.cfg, services.cfg, wlcg.d/<site name>/*.cfg | + | |
- | * Probes path: /usr/libexec/grid-monitoring/probes/org.sam/ | + | |
- | * Actual code of probes: /usr/lib/python2.4/site-packages/gridmetrics | + | |
- | + | ||
- | **Soft/Hard states vs. max_check_attempts**: [[http://nagios.sourceforge.net/docs/nagioscore/3/en/statetypes.html|http://nagios.sourceforge.net/docs/nagioscore/3/en/statetypes.html]] | + | |
* normal_check_interval 60 | * normal_check_interval 60 | ||
Line 42: | Line 35: | ||
**Passive checks**: they are initiated and performed by external applications/processes. Passive check results are submitted to Nagios for processing. | **Passive checks**: they are initiated and performed by external applications/processes. Passive check results are submitted to Nagios for processing. | ||
- | |||
- | ==== Stop/start Nagios ==== | ||
- | |||
- | As root, run: service nagios restart | ||
- | |||
- | ==== Changing the grid certificate ==== | ||
- | |||
- | When the grid certificate of the user used to run tests is renewed once a year, copy the userkey.pem and usercert.pem files to .globus like on any UI. To do so, follow those steps: | ||
- | |||
- | 1. Copy the pem files to the gate machine grid11.lal.in2p3.fr: | ||
- | |||
- | <code> | ||
- | eval `ssh-agent` | ||
- | ssh-add | ||
- | gsiscp -P 2222 user*.pem grid11.lal.in2p3.fr: | ||
- | </code> | ||
- | |||
- | 2. Then log into grid11.lal.in2p3.fr and copy the pem files to the Nagios box grid4: | ||
- | |||
- | <code> | ||
- | gsissh -AX -p 2222 grid11.lal.in2p3.fr | ||
- | scp user*pem fmichel@grid04.lal.in2p3.fr:/.globus | ||
- | </code> | ||
- | |||
- | 3. Then test the new pem files: | ||
- | |||
- | <code> | ||
- | ssh fmichel@grid04.lal.in2p3.fr | ||
- | voms-proxy-init --voms biomed | ||
- | </code> | ||
- | |||
- | ==== Proxy certificate renewal ==== | ||
- | |||
- | Ssh to the any UI or on Nagios server: grid04.lal.in2p3.fr | ||
- | |||
- | * Create a valid proxy certificate: | ||
- | |||
- | <code> | ||
- | $ voms-proxy-init --voms biomed | ||
- | </code> | ||
- | |||
- | * Renew the proxy | ||
- | |||
- | <code> | ||
- | $ myproxy-init --cred_lifetime 672 --credname NagiosRetrieve-grid04.lal.in2p3.fr-biomed --pshost myproxy.grif.fr --username nagios --regex_dn_match --retrievable_by_cert "/O=GRID-FR/C=FR/O=CNRS/OU=LAL/CN=grid04.lal.in2p3.fr" | ||
- | </code> | ||
- | |||
- | * Check the proxy: | ||
- | |||
- | <code> | ||
- | $ myproxy-info -l nagios -s myproxy.grif.fr | ||
- | </code> | ||
- | |||
- | * Test the proxy retrieval probe: | ||
- | |||
- | <code> | ||
- | Ssh to the Nagios server: grid04.lal.in2p3.fr | ||
- | $ sudo su - nagios | ||
- | $ /usr/libexec/grid-monitoring/probes/hr.srce/refresh_proxy --myproxyuser nagios --cert /etc/nagios/globus/hostcert.pem --vo biomed --name NagiosRetrieve-grid04.lal.in2p3.fr-biomed -H myproxy.grif.fr --key /etc/nagios/globus/hostkey.pem -x /etc/nagios/globus/userproxy.pem-biomed | ||
- | </code> | ||
- | |||