Differences

This shows you the differences between two versions of the page.

--- biomed-shifts:practices [2018/01/08 10:17]
fmichel [Identify the problems]
+++ biomed-shifts:practices [2022/05/19 16:29]
sorina [Reproduce the problem]
@@ Line 15: / Line 15: @@
   * <text type="danger">**Check ARGO alarms**</text> concerning SEs, CEs.
   * <text type="danger">**Deal with full SEs, and resource decommissioning**</text>
-  * <text type="danger">**Report detected issues concerning ARGO box **</text> by assigning a team ticket to NGI_HR (Croatia).
+  * <text type="danger">**Report detected issues concerning ARGO box **</text> by assigning a team ticket to the dedicated ARGO support unit.
 <text type="danger">**Before submitting GGUS Team tickets**, have a **careful** look at the [[#Advices_about_ticket_submission|advices about ticket submission]]</text>.
@@ Line 57: / Line 57: @@
 ======  Identification of issues  ======
+Link to Biomed ARGO page: [[https://biomed.ui.argo.grnet.gr/]]
 =====  VOMS server  =====
 The proxy certificate creation should work:
@@ Line 63: / Line 64: @@
 The VOMS administration interface should be available. From a UI, run the command:
 <code>voms-admin --vo=biomed --host  voms-biomed.in2p3.fr --port 8443 list-cas</code>
-=====  LFC server  =====
-Command "''time lfc-ls /grid''" should return in less than 30 seconds.
 =====  Monitoring SEs  =====
 ====  Identify the problems  ====
-''lcg-cr'' and ''lcg-del'' should work on every SE of the VO. The ARGO box will help you identify faulty servers. You may use the following straight links: [[https://argo-mon-biomed.cro-ngi.hr/nagios/cgi-bin/status.cgi?servicegroup=SERVICE_SRM&style=overview|SRM service group status]], or [[https://argo-mon-biomed.cro-ngi.hr/nagios/cgi-bin/status.cgi?servicegroup=SERVICE_SRM&style=detail&servicestatustypes=16&sorttype=2&sortoption=6|Critical issues for service group SRM]]
+SRM probes used by ARGO box
+  - https://github.com/EGI-Foundation/nagios-plugins-srm
+  - based on the gfal2 library for the storage operations (gfal-copy, etc)
+  - queries the BDII service in order to build the Storage URL to test given the host-name and the VO name
+  - a X509 valid proxy certificate is needed to execute the probe (configured via X509_USER_PROXY variable).
 __Reminder__: do **NOT** submit a ticket if the service is in **downtime** or it is **not in proper production status**: see on [[http://operations-portal.in2p3.fr/vapor/resources/GL2ResVO?VOfilter=biomed|VAPOR]] the supporting resources or faulty resources.
 ====  Reproduce the problem  ====
-<code>lcg-cr -v --connect-timeout 30 --sendreceive-timeout 900 --bdii-timeout 30 --srm-timeout 300 --vo biomed file:///some_of_your_file -l lfn:/grid/biomed/yourhome/some_file_name -d SE_HOSTNAME</code>
-The [[https://github.com/frmichel/vo-support-tools/blob/master/SE/lcg-cr.sh|lcg-cr.sh]] script automates the lcg-cr and lcg-del commands on a SE.
+Manual SRM testing (copy file to SE)
+From the biomed-ui.fedcloud.fr VM, where gfal2 is already installed :
+. Build the Storage URL following the model <code>srm://marsedpm.in2p3.fr:8446/dpm/in2p3.fr/home/biomed</code>
+NOTE 1: the model works for DPM SEs, not sure about storm or dCache (a storm example is srm://storm-01.roma3.infn.it:8444/srm/managerv2?SFN=/biomed)
+NOTE 2: would be interesing to use the probe for building this URL
+. Use gfal-ls to check that we can list the folder
+<code>gfal-ls srm://marsedpm.in2p3.fr:8446/dpm/in2p3.fr/home/biomed/user/s/scamarasu </code>
+. Use gfal-copy to copy a file (in this case, job.jdl) to the above URL
+<code>gfal-copy job.jdl srm://marsedpm.in2p3.fr:8446/dpm/in2p3.fr/home/biomed/user/s/scamarasu/
+  Copying file:///home/spop/dirac/job.jdl   [DONE]  after 17s </code>
+. Check the copy was copied and is now listed
+<code>gfal-ls srm://marsedpm.in2p3.fr:8446/dpm/in2p3.fr/home/biomed/user/s/scamarasu
+  job.jdl </code>
+Note that in some cases, the gfal-ls may work (as well as gfal-mkdir), but not the gfal-copy:
+<code>gfal-mkdir srm://clrlcgse01.in2p3.fr:8446/dpm/in2p3.fr/home/biomed/scamarasu
+gfal-ls srm://clrlcgse01.in2p3.fr:8446/dpm/in2p3.fr/home/biomed/scamarasu
+gfal-copy dirac/job.jdl srm://clrlcgse01.in2p3.fr:8446/dpm/in2p3.fr/home/biomed/scamarasu/
+gfal-copy error: 70 (Communication error on send) - Could not open destination: globus_xio: Unable to connect to clrlcgse01.in2p3.fr:2811 globus_xio: System error in connect: Connection refused globus_xio: A system call failed: Connection refused </code>
 ====  Ignored alarms ====
@@ Line 93: / Line 120: @@
 When a SE is to planned for decommissioning, launch the specific [[Biomed-Shifts:decommisioning|SE decommissioning procedure]].
+Older decommissionning page is available here  [[Biomed-Shifts:old:old-decommisioning|Old SE decommissioning procedure]].
 =====  Monitoring CEs  =====
 ====  Identify the problems  ====
-The ARGO box is the best way to identify faulty resources. You may use the following straight link: [[https://argo-mon-biomed.cro-ngi.hr/nagios/cgi-bin/status.cgi?servicegroup=SERVICE_CREAM-CE&style=detail&servicestatustypes=16&sorttype=2&sortoption=6|Critical issues for service group CREAM-CE]]
+The ARGO box is the best way to identify faulty resources.
+====  Reproduce the problem  ====
-Probes documentation is available at https://wiki.egi.eu/wiki/ROC_SAM_Tests.
+. Manual ARC CE submission
-====  Reproduce the problem  ====
+- see https://www.nordugrid.org/arc/arc6/users/submit_job.html for more details and a job description example
-Reproduce the problem by one of the two methods below.
-. Download this {{:biomed-shifts:test.jdl|test JDL}} files, rename it as test_ce.jdl and submit it to the concerned CE. In the line
+- submit with "arcsub job.xrsl -c CENAME"
-<code>Requirements = (RegExp("ce_name",other.GlueCEUniqueID));</code>
-replace the "ce_name" with the hostname of the CE to check, and submit the job:
-<code>glite-wms-job-submit -a test_ce.jdl</code>
-. Alternatively, download this {{:biomed-shifts:test.jdl|test JDL}}, rename it as test_ce_noreq.jdl and submit it to the concerned CE. Check the BDII (lcg-infosites) to get the full name of a queue on that CE and run the command:
+Further ARC CE documentation available in French : https://grand-est.fr/support-utilisateurs/documentation-en-ligne/guide-dutilisation-de-arc-ce/
-<code>glite-wms-job-submit -a -r <CE hostname>><port>/<queue_name> test_ce_noreq.jdl</code>
-Then check that the status and the output when the submit command has completed:
-<code>glite-wms-job-status <jobId>
-glite-wms-job-output --dir <local directory> <jobId></code>
-Reminder: __before submitting a ticket make sure one is not open yet__.
+and DIRAC :https://grand-est.fr/support-utilisateurs/documentation-en-ligne/guide-dutilisation-de-dirac/
+. Manual HTCndorCE submission
+TO BE DONE
 ==== Ignored alarms ====
-Shifters shall focus on failed job submissions in priority (probes ''emi.cream.CREAMCE-AllowedSubmission'').
+Shifters shall focus on failed job submissions in priority: probes ''emi.cream.CREAMCE-AllowedSubmission''.
-Yet, some investigation is often needed to understand what the problem is, and whether a ticket should be submitted. In particular, <text background="danger">hereafter are several reasons that should lead to ignoring alarms</text>, but not necessarily:
+However, other faling probes such as ''emi.cream.CREAMCE-JobCancel'' and ''emi.cream.CREAMCE-JobPurge'' may be the sign of the fact that the test did not follow the expected workflow, hence tests should also be performed in those cases.
+Some investigation is often needed to understand what the problem is, and whether a ticket should be submitted. In particular, <text background="danger">hereafter are several reasons that should lead to ignoring alarms</text>, but not necessarily:
   *  The service is in **downtime** or it is **not in proper production status**: check out [[http://operations-portal.in2p3.fr/vapor/resources/GL2ResVO?VOfilter=biomed|VAPOR]] for supporting resources or faulty resources.
   *  **The queue is disabled**: its status may be "Closed" or "Draining", the error on the log shows the message "queue is disabled".
-  *  **Job submission time-outs**: ARGO probes are configured to time out after some time. However, that should not raise critical alarms, warning or unknown would be more accurate.
+  *  **Probe time-outs**: ARGO probes are configured to time out after some time. However, that should not raise critical alarms, warning or unknown would be more accurate.
   *  **"No compatible resources"**: this type of alarm is most likely a problem on the WMS that did send a job to an inappropriate CE. A non urgent ticket may be submitted to  require the opinion of the admin.
   *  **Maximum number of jobs already in queue MSG=total number of jobs in queue exceeds the queue limit**: each site decides on its policy as to accept of reject biomed jobs. We cannot submit a ticket when the queue is full, given that we use resources in an opportunistic manner.