biomed-shifts:practices

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revision Both sides next revision
biomed-shifts:practices [2017/08/17 18:39]
sorina [Identify the problems]
biomed-shifts:practices [2022/05/19 16:29]
sorina [Reproduce the problem]
Line 13: Line 13:
   * <text type="​danger">​**Follow-up on open tickets**</​text>,​ and verify solved tickets   * <text type="​danger">​**Follow-up on open tickets**</​text>,​ and verify solved tickets
   * <text type="​danger">​**Monitor critical services**</​text>​ (VOMS, LFC).   * <text type="​danger">​**Monitor critical services**</​text>​ (VOMS, LFC).
-  * <text type="​danger">​**Check ​Nagios ​alarms**</​text>​ concerning SEs, CEs, WMSs.+  * <text type="​danger">​**Check ​ARGO alarms**</​text>​ concerning SEs, CEs.
   * <text type="​danger">​**Deal with full SEs, and resource decommissioning**</​text>​   * <text type="​danger">​**Deal with full SEs, and resource decommissioning**</​text>​
-  * <text type="​danger">​**Report detected issues concerning ​the Nagios**</​text> ​box by assigning a team ticket to site "​GRIF"​.+  * <text type="​danger">​**Report detected issues concerning ​ARGO box **</​text>​ by assigning a team ticket to the dedicated ARGO support unit.
  
 <text type="​danger">​**Before submitting GGUS Team tickets**, have a **careful** look at the [[#​Advices_about_ticket_submission|advices about ticket submission]]</​text>​. <text type="​danger">​**Before submitting GGUS Team tickets**, have a **careful** look at the [[#​Advices_about_ticket_submission|advices about ticket submission]]</​text>​.
Line 57: Line 57:
 ====== ​ Identification of issues ​ ====== ====== ​ Identification of issues ​ ======
  
 +Link to Biomed ARGO page: [[https://​biomed.ui.argo.grnet.gr/​]]
 =====  VOMS server ​ ===== =====  VOMS server ​ =====
 The proxy certificate creation should work: The proxy certificate creation should work:
Line 63: Line 64:
 The VOMS administration interface should be available. From a UI, run the command: The VOMS administration interface should be available. From a UI, run the command:
 <​code>​voms-admin --vo=biomed --host ​ voms-biomed.in2p3.fr --port 8443 list-cas</​code>​ <​code>​voms-admin --vo=biomed --host ​ voms-biomed.in2p3.fr --port 8443 list-cas</​code>​
- 
-=====  LFC server ​ ===== 
-Command "''​time lfc-ls /​grid''"​ should return in less than 30 seconds. 
  
 =====  Monitoring SEs  ===== =====  Monitoring SEs  =====
  
 ====  Identify the problems ​ ==== ====  Identify the problems ​ ====
-''​lcg-cr''​ and ''​lcg-del''​ should work on every SE of the VO. The Nagios box will help you identify faulty servers. You may use the following straight links: [[https://argo-mon-biomed.cro-ngi.hr/nagios/cgi-bin/​status.cgi?​servicegroup=SERVICE_SRM&​style=overview|SRM service group status]]or [[https://​argo-mon-biomed.cro-ngi.hr/​nagios/​cgi-bin/​status.cgi?​servicegroup=SERVICE_SRM&​style=detail&​servicestatustypes=16&​sorttype=2&​sortoption=6|Critical issues for service group SRM]]+ 
 +SRM probes used by ARGO box 
 + 
 +  ​- https://github.com/EGI-Foundation/nagios-plugins-srm 
 +  - based on the gfal2 library for the storage operations (gfal-copyetc) 
 +  ​queries the BDII service in order to build the Storage URL to test given the host-name and the VO name 
 +  ​a X509 valid proxy certificate is needed to execute the probe (configured via X509_USER_PROXY variable). 
  
 __Reminder__:​ do **NOT** submit a ticket if the service is in **downtime** or it is **not in proper production status**: see on [[http://​operations-portal.in2p3.fr/​vapor/​resources/​GL2ResVO?​VOfilter=biomed|VAPOR]] the supporting resources or faulty resources. __Reminder__:​ do **NOT** submit a ticket if the service is in **downtime** or it is **not in proper production status**: see on [[http://​operations-portal.in2p3.fr/​vapor/​resources/​GL2ResVO?​VOfilter=biomed|VAPOR]] the supporting resources or faulty resources.
  
 ====  Reproduce the problem ​ ==== ====  Reproduce the problem ​ ====
-<​code>​lcg-cr -v --connect-timeout 30 --sendreceive-timeout 900 --bdii-timeout 30 --srm-timeout 300 --vo biomed file:///​some_of_your_file -l lfn:/​grid/​biomed/​yourhome/​some_file_name -d SE_HOSTNAME</​code>​ 
  
-The following script automates the lcg-cr and lcg-del commands on a SE: [[https://​github.com/​frmichel/​vo-support-tools/​blob/​master/​SE/​lcg-cr.sh|lcg-cr.sh]]+Manual SRM testing (copy file to SE)
  
 +From the biomed-ui.fedcloud.fr VM, where gfal2 is already installed :
  
-<alert type="​danger"​> +1. Build the Storage URL following the model <code>srm://​marsedpm.in2p3.fr:​8446/​dpm/​in2p3.fr/​home/​biomed</​code>​ 
-The error **"No information for [attribute(s)['​GlueServiceEndpoint',​ '​GlueSAPath',​ '​GlueVOInfoPath'​]]"​**this error occurs when the SE does not publish information in the BDII. This may be due to network outage, unscheduled downtime...=> do **NOT** submit a ticket until the SE is back on line. +   
-</alert>+NOTE 1: the model works for DPM SEs, not sure about storm or dCache (storm example is srm://​storm-01.roma3.infn.it:​8444/​srm/​managerv2?​SFN=/biomed)
  
-Reminder: __before submitting a ticket make sure one is not open yet__.+NOTE 2: would be interesing to use the probe for building this URL 
 + 
 +2. Use gfal-ls to check that we can list the folder 
 +<​code>​gfal-ls srm://​marsedpm.in2p3.fr:​8446/​dpm/​in2p3.fr/​home/​biomed/​user/​s/​scamarasu </​code>​ 
 +3. Use gfal-copy to copy a file (in this case, job.jdl) to the above URL 
 +<​code>​gfal-copy job.jdl srm://​marsedpm.in2p3.fr:​8446/​dpm/​in2p3.fr/​home/​biomed/​user/​s/​scamarasu/​  
 +  Copying file:///​home/​spop/​dirac/​job.jdl ​  ​[DONE] ​ after 17s </​code>​ 
 +4. Check the copy was copied and is now listed 
 +<​code>​gfal-ls srm://​marsedpm.in2p3.fr:​8446/​dpm/​in2p3.fr/​home/​biomed/​user/​s/​scamarasu 
 +  job.jdl </​code>​ 
 + 
 +Note that in some cases, the gfal-ls may work (as well as gfal-mkdir),​ but not the gfal-copy:  
 +<​code>​gfal-mkdir srm://​clrlcgse01.in2p3.fr:​8446/​dpm/​in2p3.fr/​home/​biomed/​scamarasu  
 +gfal-ls srm://​clrlcgse01.in2p3.fr:​8446/​dpm/​in2p3.fr/​home/​biomed/​scamarasu 
 +gfal-copy dirac/​job.jdl srm://​clrlcgse01.in2p3.fr:​8446/​dpm/​in2p3.fr/​home/​biomed/​scamarasu/​ 
 +gfal-copy error: 70 (Communication error on send) - Could not open destination:​ globus_xio: Unable to connect to clrlcgse01.in2p3.fr:​2811 globus_xio: System error in connect: Connection refused globus_xio: A system call failed: Connection refused </​code>​ 
 + 
 +====  Ignored alarms ==== 
 + 
 +The error ''"​No information for [attribute(s):​ ['​GlueServiceEndpoint',​ '​GlueSAPath',​ '​GlueVOInfoPath'​]]"'' ​ occurs when the SE does not publish information in the BDII. This may be due to a network outage, unscheduled downtime...  
 + 
 +In such cases, **DO NOT SUBMIT** a ticket until the SE is back on line. 
 + 
 +Reminder: __before submitting a ticketmake sure one is not open yet__.
  
 ====  Dealing with full SEs  ==== ====  Dealing with full SEs  ====
Line 92: Line 120:
 When a SE is to planned for decommissioning,​ launch the specific [[Biomed-Shifts:​decommisioning|SE decommissioning procedure]]. When a SE is to planned for decommissioning,​ launch the specific [[Biomed-Shifts:​decommisioning|SE decommissioning procedure]].
  
 +Older decommissionning page is available here  [[Biomed-Shifts:​old:​old-decommisioning|Old SE decommissioning procedure]].
 =====  Monitoring CEs  ===== =====  Monitoring CEs  =====
  
 ====  Identify the problems ​ ==== ====  Identify the problems ​ ====
-''​glite-wms-job-submit''​ should work on every CE of the VO. The Nagios ​box remains ​the best way to identify faulty resources. ​You may use the following straight link: [[https://​argo-mon-biomed.cro-ngi.hr/​nagios/​cgi-bin/​status.cgi?​servicegroup=SERVICE_CREAM-CE&​style=detail&​servicestatustypes=16&​sorttype=2&​sortoption=6|Critical issues for service group CREAM-CE]]+The ARGO box is the best way to identify faulty resources. ​ 
 +====  Reproduce ​the problem  ​====
  
-It is sometimes necessary to manually rerun tests using the Reschedule optionIn that case, only ACTIVE checks must be rescheduled,​ as opposed to PASSIVE checks. +1Manual ARC CE submission
-For a CREAM CE, only ''​org.sam.CREAMCE-JobState-biomed can be rescheduled''​.+
  
-====  Reproduce the problem ​ ==== +- see https://www.nordugrid.org/​arc/​arc6/​users/​submit_job.html for more details and a job description example
-Reproduce the problem by one of the two methods below.+
  
-1. Download this {{:biomed-shifts:​test.jdl|test JDL}} files, rename it as test_ce.jdl and submit ​it to the concerned CE. In the line +- submit ​with "arcsub job.xrsl -c CENAME"
-<​code>​Requirements = (RegExp("ce_name",​other.GlueCEUniqueID));</​code>​ +
-replace the "ce_name"​ with the hostname of the CE to check, and submit the job: +
-<​code>​glite-wms-job-submit -a test_ce.jdl</​code>​+
  
-2. Alternatively,​ download this {{:biomed-shifts:test.jdl|test JDL}}, rename it as test_ce_noreq.jdl and submit it to the concerned CE. Check the BDII (lcg-infosites) to get the full name of a queue on that CE and run the command: +Further ARC CE documentation available in French ​https://grand-est.fr/support-utilisateurs/​documentation-en-ligne/guide-dutilisation-de-arc-ce/
-<​code>​glite-wms-job-submit -a -r <CE hostname>><​port>/<​queue_name>​ test_ce_noreq.jdl</code> +
-Then check that the status and the output when the submit command has completed:​ +
-<​code>​glite-wms-job-status <​jobId>​ +
-glite-wms-job-output ​--dir <local directory>​ <​jobId><​/code>+
  
-Reminder__before submitting a ticket make sure one is not open yet__.+and DIRAC :https://​grand-est.fr/​support-utilisateurs/​documentation-en-ligne/​guide-dutilisation-de-dirac/​
  
 +2. Manual HTCndorCE submission
 +
 +TO BE DONE
 ==== Ignored alarms ==== ==== Ignored alarms ====
-Shifters shall focus on failed job submissions in priority. ​Regarding ​other types of alarmsmore investigation is generally ​needed to understand what the problem is, and whether a ticket should be submitted. In particular, <span style="background:#​ffff00">​hereafter are several reasons that should lead to ignoring alarms</font>, but not necessarily:​+Shifters shall focus on failed job submissions in priority: probes ''​emi.cream.CREAMCE-AllowedSubmission''​. 
 +However, ​other faling probes such as ''​emi.cream.CREAMCE-JobCancel''​ and ''​emi.cream.CREAMCE-JobPurge''​ may be the sign of the fact that the test did not follow the expected workflowhence tests should also be performed in those cases. 
 + 
 +Some investigation is often needed to understand what the problem is, and whether a ticket should be submitted. In particular, <text background="danger">​hereafter are several reasons that should lead to ignoring alarms</text>, but not necessarily:​
  
   *  The service is in **downtime** or it is **not in proper production status**: check out [[http://​operations-portal.in2p3.fr/​vapor/​resources/​GL2ResVO?​VOfilter=biomed|VAPOR]] for supporting resources or faulty resources.   *  The service is in **downtime** or it is **not in proper production status**: check out [[http://​operations-portal.in2p3.fr/​vapor/​resources/​GL2ResVO?​VOfilter=biomed|VAPOR]] for supporting resources or faulty resources.
   *  **The queue is disabled**: its status may be "​Closed"​ or "​Draining",​ the error on the log shows the message "queue is disabled"​.   *  **The queue is disabled**: its status may be "​Closed"​ or "​Draining",​ the error on the log shows the message "queue is disabled"​.
-  *  **Job submission ​time-outs**: ​Nagios ​probes are configured to time out after some time. However, that should not raise critical alarms, warning or unknown would be more accurate.+  *  **Probe time-outs**: ​ARGO probes are configured to time out after some time. However, that should not raise critical alarms, warning or unknown would be more accurate.
   *  **"No compatible resources"​**:​ this type of alarm is most likely a problem on the WMS that did send a job to an inappropriate CE. A non urgent ticket may be submitted to  require the opinion of the admin.   *  **"No compatible resources"​**:​ this type of alarm is most likely a problem on the WMS that did send a job to an inappropriate CE. A non urgent ticket may be submitted to  require the opinion of the admin.
-  *  **Alarms of probes WNRep***: this type of issue is frequently related to SE issues or close SE configuration. First check if there is an alarm on the SE, if not then submit a new ticket. 
-  *  **File was NOT copied to SE...**: that test consists in copying a file from the CE to the close SE. If the SE is in alarm, then a ticket should be submitted with regards to the SE, not the CE. 
-  *  **Probe org.sam.WN-SoftVer-biomed fails for most CREAM CEs**. This is due to an old version of the probe that is not able to parse EMI2 version. This should no longer appear, it is fixed in probes installed with Nagios update 17.1. 
   *  **Maximum number of jobs already in queue MSG=total number of jobs in queue exceeds the queue limit**: each site decides on its policy as to accept of reject biomed jobs. We cannot submit a ticket when the queue is full, given that we use resources in an opportunistic manner.   *  **Maximum number of jobs already in queue MSG=total number of jobs in queue exceeds the queue limit**: each site decides on its policy as to accept of reject biomed jobs. We cannot submit a ticket when the queue is full, given that we use resources in an opportunistic manner.
  
 ====  Check CEs publishing bad number of running/​waiting jobs  ==== ====  Check CEs publishing bad number of running/​waiting jobs  ====
-Check the CEs that publish wrong (default) values for running or waiting jobs on [[http://​operations-portal.in2p3.fr/​vapor/​resources/​GL2ResVO?​VOfilter=biomed|VAPOR]] ​faulty resources ​report. The default figure is 4444 or 444444.+Check the CEs that publish wrong (default) values for running or waiting jobs on [[http://​operations-portal.in2p3.fr/​vapor/​resources/​GL2ResVO?​VOfilter=biomed|VAPOR]] report: this can be global CE data (tab //Faulty Computing//​) ​or per-share job counts (tab //Faulty jobs//)
  
-For each of those, __if the CE is in normal production (no downtime, production status)__, submit a __non urgent__ ticket asking to solve the problem. Also note that several lines may refer to the same CE with several queues, in that case of course only one ticket should be submitted.+The default figure is 4444 or 444444. 
 +For each of those, __if the CE is in normal production (no downtime, production status)__, submit a __non urgent__ ticket asking to solve the problem. ​
  
-This is a template ticket message that can be reused to report a faulty resource. <text background="​danger">​**Don'​t forget to customize the bold text.**</​text>​+To make direct checks in the BDII, use the [[https://​github.com/​frmichel/​vo-support-tools/​blob/​master/​ldap_requests_glue2.txt|example LDAP requests]] provided in the [[https://​github.com/​frmichel/​vo-support-tools|VO Support Tools project]]. 
 +This is a template ticket message that can be reused to report a faulty resource. <text background="​danger">​**Don'​t forget to customize the colored ​text.**</​text>​
  
 <callout title="​Subject">​ <callout title="​Subject">​
-CE **hostname** publishes invalid data+CE <text background="​danger">​**hostname**</​text> ​publishes invalid data
 </​callout>​ </​callout>​
  
Line 143: Line 169:
  
 The CE publishes erroneous default data: The CE publishes erroneous default data:
-- **no running job** +<text background="​danger">​**no running job**</​text>​ 
-- **444444** waiting jobs +<text background="​danger">​**444444**</​text> ​waiting jobs 
-- ERT = **2146660842** +- ERT = <text background="​danger">​**2146660842**</​text>​ 
-- WRT = **2146660842**+- WRT = <text background="​danger">​**2146660842**</​text>​
  
-This was reported by VAPOR: http://​operations-portal.in2p3.fr/​vapor/​resources/​GL2ResVO?​VOfilter=biomed (tab faulty ​jobs).+This was reported by VAPOR: http://​operations-portal.in2p3.fr/​vapor/​resources/​GL2ResVO?​VOfilter=biomed (tab <text background="​danger">​**Faulty Computing|Faulty ​jobs**</​text>​).
  
-Can this be fixed? ​You may want to check the suggestions here: https:<​nowiki>//</​nowiki>​wiki.egi.eu/​wiki/​Tools/​Manuals/​TS59+Note that VAPOR reports job counts published in GLUE2 data which may differ from GLUE1 data. 
 +You may want to check the suggestions here: https:<​nowiki>//</​nowiki>​wiki.egi.eu/​wiki/​Tools/​Manuals/​TS59
  
 Thx in advance for your support. Thx in advance for your support.
-**<​shifter name>**, for the Biomed VO.+<text background="​danger">​**<​shifter name>**</​text>​, for the Biomed VO.
 </​callout>​ </​callout>​
  
-=====  Monitoring WMSs  ===== 
- 
-''​glite-wms-job-submit''​ should return a job ID on every WMS. Testing whether jobs eventually manage to reach a CE is currently out of scope => **Ignore the cases when jobs stay in status "​Submitted"​ forever**. 
- 
-The Nagios box will help you identify faulty servers. You may use the following straight links: [[https://​grid16.lal.in2p3.fr/​nagios/​cgi-bin/​status.cgi?​servicegroup=SERVICE_WMS&​style=detail&​servicestatustypes=16&​sorttype=2&​sortoption=6|Critical issues for service group WMS]] 
- 
-<​u>​Reminder</​u>>​ do **NOT** submit a ticket if the service is in **downtime** or it is **not in proper production status**: check out supporting resources or faulty resources on [[http://​operations-portal.in2p3.fr/​vapor/​resources/​GL2ResVO?​VOfilter=biomed|VAPOR]] . 
- 
-Reproduce the problem on a UI: 
-  *  Check the BDII (lcg-infosites) to get the full name of the WMS endpoint: 
-<​code>​lcg-infosites wms | grep <WMS host name></​code>​ 
-  *  Submit this {{:​biomed-shifts:​teset.jdl|test jdl}} to the WMS endpoint returned above: 
-<​code>​glite-wms-job-submit -a -e <WMS endpoint>​ test.jdl</​code>​ 
  
 =====  CVMFS support ​ ===== =====  CVMFS support ​ =====
Line 184: Line 198:
 =====  Before a ticket is submitted ​ ===== =====  Before a ticket is submitted ​ =====
  
-  -  Check that the Nagios ​alarm can be reproduced manually (make sure it is not a monitoring issue) and that it still happens some time after it was first detected (make sure it is not a temporary error).+  -  Check that the ARGO alarm can be reproduced manually (make sure it is not a monitoring issue) and that it still happens some time after it was first detected (make sure it is not a temporary error).
   -  Check that **no GGUS ticket is already open on this issue**: look at [[https://​ggus.eu/​index.php?​mode=ticket_search|open team tickets]].   -  Check that **no GGUS ticket is already open on this issue**: look at [[https://​ggus.eu/​index.php?​mode=ticket_search|open team tickets]].
   -  Check that the **concerned host is not in status "​downtime",​ "not in production",​ or "not monitored"​** using VAPOR'​s supporting resources or faulty resources on [[http://​operations-portal.in2p3.fr/​vapor/​resources/​GL2ResVO?​VOfilter=biomed|VAPOR]].   -  Check that the **concerned host is not in status "​downtime",​ "not in production",​ or "not monitored"​** using VAPOR'​s supporting resources or faulty resources on [[http://​operations-portal.in2p3.fr/​vapor/​resources/​GL2ResVO?​VOfilter=biomed|VAPOR]].
Line 203: Line 217:
 Dear site admin, Dear site admin,
  
-**<​hostname>​** is not working for Biomed users. The incident was detected from the Biomed ​Nagios ​box that you may want to check to see the status: +**<​hostname>​** is not working for Biomed users. The incident was detected from the Biomed ​ARGO box that you may want to check to see the status: 
-**<link to the Nagios ​alarm page>**+**<link to the ARGO alarm page>**
 The problem was reproduced by hand in the log below. The problem was reproduced by hand in the log below.
  
  • biomed-shifts/practices.txt
  • Last modified: 2022/05/19 16:32
  • by sorina