This section describes the usual proceedings of a biomed support duty shift. Subsequent sections give further details as to specific tasks.
Before submitting GGUS Team tickets, have a careful look at the advices about ticket submission.
The actions below that should be performed on a daily basis as much as possible.
The follow-up of open issues is as important as the monitoring of resources.
At least once a week, we have to check on open tickets and:
These are notified to the biomed technical shift list, but it is a good practise to check on them regularly. Note that VOSupport are team tickets when we submitted them, they are not team tickets when they were submitted by site admins or users.
Solved is not closed: validate the ticket once in status solved, or re-open it if the problem persists.
Link to Biomed ARGO page: https://biomed.ui.argo.grnet.gr/
The proxy certificate creation should work:
voms-proxy-init -voms biomed
The VOMS administration interface should be available. From a UI, run the command:
voms-admin --vo=biomed --host voms-biomed.in2p3.fr --port 8443 list-cas
SRM probes used by ARGO box
Reminder: do NOT submit a ticket if the service is in downtime or it is not in proper production status: see on VAPOR the supporting resources or faulty resources.
Manual SRM testing (copy file to SE)
From the biomed-ui.fedcloud.fr VM, where gfal2 is already installed :
1. Build the Storage URL following the model
srm://marsedpm.in2p3.fr:8446/dpm/in2p3.fr/home/biomed
NOTE 1: the model works for DPM SEs, not sure about storm or dCache (a storm example is srm:storm-01.roma3.infn.it:8444/srm/managerv2?SFN=/biomed)
NOTE 2: would be interesing to use the probe for building this URL
2. Use gfal-ls to check that we can list the folder
gfal-ls srm://marsedpm.in2p3.fr:8446/dpm/in2p3.fr/home/biomed/user/s/scamarasu
3. Use gfal-copy to copy a file (in this case, job.jdl) to the above URL
gfal-copy job.jdl srm://marsedpm.in2p3.fr:8446/dpm/in2p3.fr/home/biomed/user/s/scamarasu/ Copying file:///home/spop/dirac/job.jdl [DONE] after 17s
4. Check the copy was copied and is now listed
gfal-ls srm://marsedpm.in2p3.fr:8446/dpm/in2p3.fr/home/biomed/user/s/scamarasu job.jdl
Note that in some cases, the gfal-ls may work (as well as gfal-mkdir), but not the gfal-copy:
gfal-mkdir srm://clrlcgse01.in2p3.fr:8446/dpm/in2p3.fr/home/biomed/scamarasu gfal-ls srm://clrlcgse01.in2p3.fr:8446/dpm/in2p3.fr/home/biomed/scamarasu gfal-copy dirac/job.jdl srm://clrlcgse01.in2p3.fr:8446/dpm/in2p3.fr/home/biomed/scamarasu/ gfal-copy error: 70 (Communication error on send) - Could not open destination: globus_xio: Unable to connect to clrlcgse01.in2p3.fr:2811 globus_xio: System error in connect: Connection refused globus_xio: A system call failed: Connection refused
The error “No information for [attribute(s): ['GlueServiceEndpoint', 'GlueSAPath', 'GlueVOInfoPath']]”
occurs when the SE does not publish information in the BDII. This may be due to a network outage, unscheduled downtime…
In such cases, DO NOT SUBMIT a ticket until the SE is back on line.
Reminder: before submitting a ticket, make sure one is not open yet.
See the full SE procedure.
When a SE is to planned for decommissioning, launch the specific SE decommissioning procedure.
Older decommissionning page is available here Old SE decommissioning procedure.
The ARGO box is the best way to identify faulty resources.
1. Manual ARC CE submission
- see https://www.nordugrid.org/arc/arc6/users/submit_job.html for more details and a job description example
- submit with “arcsub job.xrsl -c CENAME”
Further ARC CE documentation available in French : https://grand-est.fr/support-utilisateurs/documentation-en-ligne/guide-dutilisation-de-arc-ce/
and DIRAC :https://grand-est.fr/support-utilisateurs/documentation-en-ligne/guide-dutilisation-de-dirac/
2. Manual HTCndorCE submission
TO BE DONE
Shifters shall focus on failed job submissions in priority: probes emi.cream.CREAMCE-AllowedSubmission
.
However, other faling probes such as emi.cream.CREAMCE-JobCancel
and emi.cream.CREAMCE-JobPurge
may be the sign of the fact that the test did not follow the expected workflow, hence tests should also be performed in those cases.
Some investigation is often needed to understand what the problem is, and whether a ticket should be submitted. In particular, hereafter are several reasons that should lead to ignoring alarms, but not necessarily:
Check the CEs that publish wrong (default) values for running or waiting jobs on VAPOR report: this can be global CE data (tab Faulty Computing) or per-share job counts (tab Faulty jobs).
The default figure is 4444 or 444444. For each of those, if the CE is in normal production (no downtime, production status), submit a non urgent ticket asking to solve the problem.
To make direct checks in the BDII, use the example LDAP requests provided in the VO Support Tools project. This is a template ticket message that can be reused to report a faulty resource. Don't forget to customize the colored text.
The CE publishes erroneous default data: - no running job - 444444 waiting jobs - ERT = 2146660842 - WRT = 2146660842
This was reported by VAPOR: http://operations-portal.in2p3.fr/vapor/resources/GL2ResVO?VOfilter=biomed (tab Faulty Computing|Faulty jobs).
Note that VAPOR reports job counts published in GLUE2 data which may differ from GLUE1 data. You may want to check the suggestions here: https://wiki.egi.eu/wiki/Tools/Manuals/TS59
Thx in advance for your support. <shifter name>, for the Biomed VO.
Biomed progressively migrates to the CVMFS solution (Cern Virtual Machine File System) to manage VO specific software. In time, it should replace the variable VO_BIOMED_SW_DIR.
To do so, biomed VO administrators submit tickets to sites supporting CVMFS to ask them to enable biomed in their configuration. Shifters are requested to follow-up on those tickets: in particular, when site admins agree to enable biomed they ask us to test it, shifters then have to test the CVMFS service by submitting a job to one CE on that site. Use this test script and this jdl. Example:
export CE=ce04-lcg.cr.cnaf.infn.it:8443/cream-lsf-biomed glite-wms-job-submit -a -o job_id.txt -r $CE test_ce.jdl
If you need to ask a site to enable biomed, you may want to copy one of the tickets submitted recently.
biomed
in the Concerned VO
field of the GGUS submission form.This is a template ticket message that can be reused to report a faulty resource. Don't forget to customize the bold text.
<hostname> is not working for Biomed users. The incident was detected from the Biomed ARGO box that you may want to check to see the status: <link to the ARGO alarm page> The problem was reproduced by hand in the log below.
Thanks in advance for your support,
<shifter name>, for the Biomed VO.
<detailed log>
lcg-infosites
will be assumed in operations and are a potential ticket target.solved
without a single comment and the problem still persists). This may be the result of an automatic action from the local system used to answer GGUS tickets, and not necessarily from a person.