Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
biomed-shifts:practices [2021/10/13 14:03] sorina [Reproduce the problem] |
biomed-shifts:practices [2022/05/19 16:32] (current) sorina [Reproduce the problem] |
||
---|---|---|---|
Line 57: | Line 57: | ||
====== Identification of issues ====== | ====== Identification of issues ====== | ||
+ | Link to Biomed ARGO page: [[https://biomed.ui.argo.grnet.gr/]] | ||
===== VOMS server ===== | ===== VOMS server ===== | ||
The proxy certificate creation should work: | The proxy certificate creation should work: | ||
Line 63: | Line 64: | ||
The VOMS administration interface should be available. From a UI, run the command: | The VOMS administration interface should be available. From a UI, run the command: | ||
<code>voms-admin --vo=biomed --host voms-biomed.in2p3.fr --port 8443 list-cas</code> | <code>voms-admin --vo=biomed --host voms-biomed.in2p3.fr --port 8443 list-cas</code> | ||
- | |||
- | ===== LFC server ===== | ||
- | Command "''time lfc-ls /grid''" should return in less than 30 seconds. | ||
===== Monitoring SEs ===== | ===== Monitoring SEs ===== | ||
Line 72: | Line 70: | ||
SRM probes used by ARGO box | SRM probes used by ARGO box | ||
- | - https://github.com/EGI-Foundation/nagios-plugins-srm | + | |
- | - based on the gfal2 library for the storage operations (gfal-copy, etc) | + | - https://github.com/EGI-Foundation/nagios-plugins-srm |
- | - queries the BDII service in order to build the Storage URL to test given the host-name and the VO name | + | - based on the gfal2 library for the storage operations (gfal-copy, etc) |
- | - a X509 valid proxy certificate is needed to execute the probe (configured via X509_USER_PROXY variable). | + | - queries the BDII service in order to build the Storage URL to test given the host-name and the VO name |
+ | - a X509 valid proxy certificate is needed to execute the probe (configured via X509_USER_PROXY variable). | ||
Line 82: | Line 81: | ||
==== Reproduce the problem ==== | ==== Reproduce the problem ==== | ||
+ | Manual SRM testing (copy file to SE) | ||
+ | |||
+ | From the biomed-ui.fedcloud.fr VM, where gfal2 is already installed : | ||
+ | |||
+ | 1. Build the Storage URL following the model <code>srm://marsedpm.in2p3.fr:8446/dpm/in2p3.fr/home/biomed</code> | ||
+ | | ||
+ | NOTE 1: the model works for DPM SEs, not sure about storm or dCache (a storm example is srm:////storm-01.roma3.infn.it:8444/srm/managerv2?SFN=/biomed) | ||
- | 2. Manual SRM testing (copy file to SE) | ||
- | - from the biomed-ui.fedcloud.fr VM, where gfal2 is already installed : | ||
- | i) build the Storage URL following the model "srm://marsedpm.in2p3.fr:8446/dpm/in2p3.fr/home/biomed" ; | ||
- | NOTE 1: the model works for DPM SEs, not sure about storm or dCache | ||
NOTE 2: would be interesing to use the probe for building this URL | NOTE 2: would be interesing to use the probe for building this URL | ||
- | ii) use gfal-ls to check that we can list the folder | + | |
- | <code>[spop@biomed-ui ~]$ gfal-ls srm://marsedpm.in2p3.fr:8446/dpm/in2p3.fr/home/biomed/user/s/scamarasu </code> | + | 2. Use gfal-ls to check that we can list the folder |
- | iii) use gfal-copy to copy a file (in this case, job.jdl) to the above URL | + | <code>gfal-ls srm://marsedpm.in2p3.fr:8446/dpm/in2p3.fr/home/biomed/user/s/scamarasu </code> |
- | <code> gfal-copy job.jdl srm://marsedpm.in2p3.fr:8446/dpm/in2p3.fr/home/biomed/user/s/scamarasu/ | + | 3. Use gfal-copy to copy a file (in this case, job.jdl) to the above URL |
- | Copying file:///home/spop/dirac/job.jdl [DONE] after 17s </code> | + | <code>gfal-copy job.jdl srm://marsedpm.in2p3.fr:8446/dpm/in2p3.fr/home/biomed/user/s/scamarasu/ |
- | iv) check the copy was copied and is now listed | + | Copying file:///home/spop/dirac/job.jdl [DONE] after 17s </code> |
- | <code> gfal-ls srm://marsedpm.in2p3.fr:8446/dpm/in2p3.fr/home/biomed/user/s/scamarasu | + | 4. Check the copy was copied and is now listed |
- | job.jdl </code> | + | <code>gfal-ls srm://marsedpm.in2p3.fr:8446/dpm/in2p3.fr/home/biomed/user/s/scamarasu |
+ | job.jdl </code> | ||
Note that in some cases, the gfal-ls may work (as well as gfal-mkdir), but not the gfal-copy: | Note that in some cases, the gfal-ls may work (as well as gfal-mkdir), but not the gfal-copy: | ||
- | <code> gfal-mkdir srm://clrlcgse01.in2p3.fr:8446/dpm/in2p3.fr/home/biomed/scamarasu </code> | + | <code>gfal-mkdir srm://clrlcgse01.in2p3.fr:8446/dpm/in2p3.fr/home/biomed/scamarasu |
- | <code> gfal-ls srm://clrlcgse01.in2p3.fr:8446/dpm/in2p3.fr/home/biomed/scamarasu </code> | + | gfal-ls srm://clrlcgse01.in2p3.fr:8446/dpm/in2p3.fr/home/biomed/scamarasu |
- | <code> gfal-copy dirac/job.jdl srm://clrlcgse01.in2p3.fr:8446/dpm/in2p3.fr/home/biomed/scamarasu/ | + | gfal-copy dirac/job.jdl srm://clrlcgse01.in2p3.fr:8446/dpm/in2p3.fr/home/biomed/scamarasu/ |
gfal-copy error: 70 (Communication error on send) - Could not open destination: globus_xio: Unable to connect to clrlcgse01.in2p3.fr:2811 globus_xio: System error in connect: Connection refused globus_xio: A system call failed: Connection refused </code> | gfal-copy error: 70 (Communication error on send) - Could not open destination: globus_xio: Unable to connect to clrlcgse01.in2p3.fr:2811 globus_xio: System error in connect: Connection refused globus_xio: A system call failed: Connection refused </code> | ||
Line 117: | Line 120: | ||
When a SE is to planned for decommissioning, launch the specific [[Biomed-Shifts:decommisioning|SE decommissioning procedure]]. | When a SE is to planned for decommissioning, launch the specific [[Biomed-Shifts:decommisioning|SE decommissioning procedure]]. | ||
+ | Older decommissionning page is available here [[Biomed-Shifts:old:old-decommisioning|Old SE decommissioning procedure]]. | ||
===== Monitoring CEs ===== | ===== Monitoring CEs ===== | ||
==== Identify the problems ==== | ==== Identify the problems ==== | ||
- | The ARGO box is the best way to identify faulty resources. You may use the following straight link: [[https://argo-mon-biomed.cro-ngi.hr/nagios/cgi-bin/status.cgi?servicegroup=SERVICE_CREAM-CE&style=detail&servicestatustypes=16&sorttype=2&sortoption=6|Critical issues for service group CREAM-CE]] | + | The ARGO box is the best way to identify faulty resources. |
+ | ==== Reproduce the problem ==== | ||
- | Probes documentation is available at https://wiki.egi.eu/wiki/ROC_SAM_Tests. | + | 1. Manual ARC CE submission |
- | ==== Reproduce the problem ==== | + | - see https://www.nordugrid.org/arc/arc6/users/submit_job.html for more details and a job description example |
- | Reproduce the problem by one of the two methods below. | + | |
+ | - submit with "arcsub job.xrsl -c CENAME" | ||
+ | |||
+ | Further ARC CE documentation available in French : https://grand-est.fr/support-utilisateurs/documentation-en-ligne/guide-dutilisation-de-arc-ce/ | ||
- | Download this {{:biomed-shifts:test.jdl|test JDL}} (or {{:biomed-shifts:test2.jdl|this one}}, since the 1st one seems to fail) , rename it as test_ce_noreq.jdl and submit it to the concerned CE. Check the BDII (lcg-infosites) to get the full name of a queue on that CE and run the command: | + | and DIRAC :https://grand-est.fr/support-utilisateurs/documentation-en-ligne/guide-dutilisation-de-dirac/ |
- | <code> glite-ce-job-submit -a -r <CE hostname>:<port>/<queue_name> test_ce_noreq.jdl</code> | + | |
- | Then check that the status and the output when the submit command has completed: | + | |
- | <code>glite-ce-job-status <jobId></code> | + | |
- | Reminder: __before submitting a ticket make sure one is not open yet__. | + | 2. Manual HTCndorCE submission |
+ | TO BE DONE | ||
==== Ignored alarms ==== | ==== Ignored alarms ==== | ||
Shifters shall focus on failed job submissions in priority: probes ''emi.cream.CREAMCE-AllowedSubmission''. | Shifters shall focus on failed job submissions in priority: probes ''emi.cream.CREAMCE-AllowedSubmission''. |