biomed-shifts:practices

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
biomed-shifts:practices [2017/09/28 23:44]
fmichel [Reproduce the problem]
biomed-shifts:practices [2022/05/19 16:32] (current)
sorina [Reproduce the problem]
Line 15: Line 15:
   * <text type="​danger">​**Check ARGO alarms**</​text>​ concerning SEs, CEs.   * <text type="​danger">​**Check ARGO alarms**</​text>​ concerning SEs, CEs.
   * <text type="​danger">​**Deal with full SEs, and resource decommissioning**</​text>​   * <text type="​danger">​**Deal with full SEs, and resource decommissioning**</​text>​
-  * <text type="​danger">​**Report detected issues concerning ARGO box **</​text>​ by assigning a team ticket to NGI_HR (Croatia).+  * <text type="​danger">​**Report detected issues concerning ARGO box **</​text>​ by assigning a team ticket to the dedicated ARGO support unit.
  
 <text type="​danger">​**Before submitting GGUS Team tickets**, have a **careful** look at the [[#​Advices_about_ticket_submission|advices about ticket submission]]</​text>​. <text type="​danger">​**Before submitting GGUS Team tickets**, have a **careful** look at the [[#​Advices_about_ticket_submission|advices about ticket submission]]</​text>​.
Line 57: Line 57:
 ====== ​ Identification of issues ​ ====== ====== ​ Identification of issues ​ ======
  
 +Link to Biomed ARGO page: [[https://​biomed.ui.argo.grnet.gr/​]]
 =====  VOMS server ​ ===== =====  VOMS server ​ =====
 The proxy certificate creation should work: The proxy certificate creation should work:
Line 63: Line 64:
 The VOMS administration interface should be available. From a UI, run the command: The VOMS administration interface should be available. From a UI, run the command:
 <​code>​voms-admin --vo=biomed --host ​ voms-biomed.in2p3.fr --port 8443 list-cas</​code>​ <​code>​voms-admin --vo=biomed --host ​ voms-biomed.in2p3.fr --port 8443 list-cas</​code>​
- 
-=====  LFC server ​ ===== 
-Command "''​time lfc-ls /​grid''"​ should return in less than 30 seconds. 
  
 =====  Monitoring SEs  ===== =====  Monitoring SEs  =====
  
 ====  Identify the problems ​ ==== ====  Identify the problems ​ ====
-''​lcg-cr''​ and ''​lcg-del''​ should work on every SE of the VO. The ARGO box will help you identify faulty servers. You may use the following straight links: [[https://argo-mon-biomed.cro-ngi.hr/nagios/cgi-bin/​status.cgi?​servicegroup=SERVICE_SRM&​style=overview|SRM service group status]]or [[https://​argo-mon-biomed.cro-ngi.hr/​nagios/​cgi-bin/​status.cgi?​servicegroup=SERVICE_SRM&​style=detail&​servicestatustypes=16&​sorttype=2&​sortoption=6|Critical issues for service group SRM]]+ 
 +SRM probes used by ARGO box 
 + 
 +  - https://github.com/EGI-Foundation/nagios-plugins-srm 
 +  - based on the gfal2 library for the storage operations (gfal-copyetc) 
 +  ​queries the BDII service in order to build the Storage URL to test given the host-name and the VO name 
 +  ​a X509 valid proxy certificate is needed to execute the probe (configured via X509_USER_PROXY variable). 
  
 __Reminder__:​ do **NOT** submit a ticket if the service is in **downtime** or it is **not in proper production status**: see on [[http://​operations-portal.in2p3.fr/​vapor/​resources/​GL2ResVO?​VOfilter=biomed|VAPOR]] the supporting resources or faulty resources. __Reminder__:​ do **NOT** submit a ticket if the service is in **downtime** or it is **not in proper production status**: see on [[http://​operations-portal.in2p3.fr/​vapor/​resources/​GL2ResVO?​VOfilter=biomed|VAPOR]] the supporting resources or faulty resources.
  
 ====  Reproduce the problem ​ ==== ====  Reproduce the problem ​ ====
-<​code>​lcg-cr -v --connect-timeout 30 --sendreceive-timeout 900 --bdii-timeout 30 --srm-timeout 300 --vo biomed file:///​some_of_your_file -l lfn:/​grid/​biomed/​yourhome/​some_file_name -d SE_HOSTNAME</​code>​ 
  
-The [[https://github.com/frmichel/vo-support-tools/blob/master/SE/lcg-cr.sh|lcg-cr.sh]] script automates ​the lcg-cr ​and lcg-del commands ​on a SE.+Manual SRM testing (copy file to SE) 
 + 
 +From the biomed-ui.fedcloud.fr VM, where gfal2 is already installed : 
 + 
 +1. Build the Storage URL following the model <​code>​srm://marsedpm.in2p3.fr:​8446/dpm/in2p3.fr/​home/​biomed</​code>​ 
 +   
 +NOTE 1: the model works for DPM SEs, not sure about storm or dCache (a storm example is srm:////​storm-01.roma3.infn.it:​8444/​srm/​managerv2?​SFN=/​biomed) 
 + 
 +NOTE 2: would be interesing to use the probe for building this URL 
 + 
 +2. Use gfal-ls to check that we can list the folder 
 +<​code>​gfal-ls srm://marsedpm.in2p3.fr:​8446/dpm/in2p3.fr/​home/​biomed/​user/​s/​scamarasu </​code>​ 
 +3. Use gfal-copy to copy a file (in this case, job.jdl) to the above URL 
 +<​code>​gfal-copy job.jdl srm://​marsedpm.in2p3.fr:​8446/​dpm/​in2p3.fr/​home/​biomed/​user/​s/​scamarasu/​  
 +  Copying file:///​home/​spop/​dirac/​job.jdl ​  [DONE after 17s </​code>​ 
 +4. Check the copy was copied ​and is now listed 
 +<​code>​gfal-ls srm://​marsedpm.in2p3.fr:​8446/​dpm/​in2p3.fr/​home/​biomed/​user/​s/​scamarasu 
 +  job.jdl </​code>​ 
 + 
 +Note that in some cases, the gfal-ls may work (as well as gfal-mkdir),​ but not the gfal-copy:  
 +<​code>​gfal-mkdir srm://​clrlcgse01.in2p3.fr:​8446/​dpm/​in2p3.fr/​home/​biomed/​scamarasu  
 +gfal-ls srm://​clrlcgse01.in2p3.fr:​8446/​dpm/​in2p3.fr/​home/​biomed/​scamarasu 
 +gfal-copy dirac/​job.jdl srm://​clrlcgse01.in2p3.fr:​8446/​dpm/​in2p3.fr/​home/​biomed/​scamarasu/​ 
 +gfal-copy error: 70 (Communication error on send) - Could not open destination:​ globus_xio: Unable to connect to clrlcgse01.in2p3.fr:​2811 globus_xio: System error in connect: Connection refused globus_xio: A system call failed: Connection refused </​code>​
  
 ====  Ignored alarms ==== ====  Ignored alarms ====
Line 93: Line 120:
 When a SE is to planned for decommissioning,​ launch the specific [[Biomed-Shifts:​decommisioning|SE decommissioning procedure]]. When a SE is to planned for decommissioning,​ launch the specific [[Biomed-Shifts:​decommisioning|SE decommissioning procedure]].
  
 +Older decommissionning page is available here  [[Biomed-Shifts:​old:​old-decommisioning|Old SE decommissioning procedure]].
 =====  Monitoring CEs  ===== =====  Monitoring CEs  =====
  
 ====  Identify the problems ​ ==== ====  Identify the problems ​ ====
-The ARGO box is the best way to identify faulty resources. ​You may use the following straight link: [[https://​argo-mon-biomed.cro-ngi.hr/​nagios/​cgi-bin/​status.cgi?​servicegroup=SERVICE_CREAM-CE&​style=detail&​servicestatustypes=16&​sorttype=2&​sortoption=6|Critical issues for service group CREAM-CE]]+The ARGO box is the best way to identify faulty resources. ​ 
 +====  ​Reproduce the problem ​ ====
  
-=== Probes === +1Manual ARC CE submission
-Probes documentation is available at https://​wiki.egi.eu/​wiki/​ROC_SAM_Tests.+
  
-**Ignore** alarms generated by probes ''​emi.cream.CREAMCE-JobCancel'',​ ''​emi.cream.CREAMCE-JobPurge''​.+see https://www.nordugrid.org/​arc/​arc6/​users/​submit_job.html for more details and a job description example
  
-**Focus** more specifically on alarms generated by probes ''​emi.cream.CREAMCE-AllowedSubmission''​.+- submit with "​arcsub job.xrsl -c CENAME"​
  
-====  Reproduce the problem ​ ==== +Further ARC CE documentation available in French : https://​grand-est.fr/​support-utilisateurs/​documentation-en-ligne/​guide-dutilisation-de-arc-ce/​
-Reproduce the problem by one of the two methods below.+
  
-1. Download this {{:biomed-shifts:​test.jdl|test JDL}} files, rename it as test_ce.jdl and submit it to the concerned CE. In the line +and DIRAC :https://​grand-est.fr/support-utilisateurs/​documentation-en-ligne/guide-dutilisation-de-dirac/
-<​code>​Requirements = (RegExp("​ce_name",​other.GlueCEUniqueID));<​/code> +
-replace the "​ce_name"​ with the hostname of the CE to check, and submit the job: +
-<​code>​glite-wms-job-submit ​-a test_ce.jdl<​/code>+
  
-2. Alternatively,​ download this {{:​biomed-shifts:​test.jdl|test JDL}}, rename it as test_ce_noreq.jdl and submit it to the concerned CE. Check the BDII (lcg-infosites) to get the full name of a queue on that CE and run the command: +2. Manual HTCndorCE submission
-<​code>​glite-wms-job-submit -a -r <CE hostname>><​port>/<​queue_name>​ test_ce_noreq.jdl</​code>​ +
-Then check that the status and the output when the submit command has completed:​ +
-<​code>​glite-wms-job-status <​jobId>​ +
-glite-wms-job-output --dir <local directory>​ <​jobId></​code>​ +
- +
-Reminder: __before submitting a ticket make sure one is not open yet__.+
  
 +TO BE DONE
 ==== Ignored alarms ==== ==== Ignored alarms ====
-Shifters shall focus on failed job submissions in priority ​(probes ''​emi.cream.CREAMCE-AllowedSubmission''​)+Shifters shall focus on failed job submissions in priorityprobes ''​emi.cream.CREAMCE-AllowedSubmission''​. 
-Yetsome investigation is often needed to understand what the problem is, and whether a ticket should be submitted. In particular, <span style="background:#​ffff00">​hereafter are several reasons that should lead to ignoring alarms</font>, but not necessarily:​+Howeverother faling probes such as ''​emi.cream.CREAMCE-JobCancel''​ and ''​emi.cream.CREAMCE-JobPurge''​ may be the sign of the fact that the test did not follow the expected workflow, hence tests should also be performed in those cases. 
 + 
 +Some investigation is often needed to understand what the problem is, and whether a ticket should be submitted. In particular, <text background="danger">​hereafter are several reasons that should lead to ignoring alarms</text>, but not necessarily:​
  
   *  The service is in **downtime** or it is **not in proper production status**: check out [[http://​operations-portal.in2p3.fr/​vapor/​resources/​GL2ResVO?​VOfilter=biomed|VAPOR]] for supporting resources or faulty resources.   *  The service is in **downtime** or it is **not in proper production status**: check out [[http://​operations-portal.in2p3.fr/​vapor/​resources/​GL2ResVO?​VOfilter=biomed|VAPOR]] for supporting resources or faulty resources.
   *  **The queue is disabled**: its status may be "​Closed"​ or "​Draining",​ the error on the log shows the message "queue is disabled"​.   *  **The queue is disabled**: its status may be "​Closed"​ or "​Draining",​ the error on the log shows the message "queue is disabled"​.
-  *  **Job submission ​time-outs**:​ ARGO probes are configured to time out after some time. However, that should not raise critical alarms, warning or unknown would be more accurate.+  *  **Probe time-outs**:​ ARGO probes are configured to time out after some time. However, that should not raise critical alarms, warning or unknown would be more accurate.
   *  **"No compatible resources"​**:​ this type of alarm is most likely a problem on the WMS that did send a job to an inappropriate CE. A non urgent ticket may be submitted to  require the opinion of the admin.   *  **"No compatible resources"​**:​ this type of alarm is most likely a problem on the WMS that did send a job to an inappropriate CE. A non urgent ticket may be submitted to  require the opinion of the admin.
   *  **Maximum number of jobs already in queue MSG=total number of jobs in queue exceeds the queue limit**: each site decides on its policy as to accept of reject biomed jobs. We cannot submit a ticket when the queue is full, given that we use resources in an opportunistic manner.   *  **Maximum number of jobs already in queue MSG=total number of jobs in queue exceeds the queue limit**: each site decides on its policy as to accept of reject biomed jobs. We cannot submit a ticket when the queue is full, given that we use resources in an opportunistic manner.
  • biomed-shifts/practices.1506635087.txt.gz
  • Last modified: 2017/09/28 23:44
  • by fmichel