====== Daily tasks and best practices ======
This section describes the usual proceedings of a biomed support duty shift. Subsequent sections give further details as to specific tasks.
When your shift starts, get the VO status summary (salient tickets, on-going issues) from the previous team on shift.
The page provides a description of daily tasks and best practices. Those recommendations aim at organizing the handling of technical issues across shifters and at providing a coherent interface to ensure a successful communication.
In particular, here are the **Daily tasks at a glance**:
* **Follow-up on open tickets**, and verify solved tickets
* **Monitor critical services** (VOMS, LFC).
* **Check ARGO alarms** concerning SEs, CEs.
* **Deal with full SEs, and resource decommissioning**
* **Report detected issues concerning ARGO box ** by assigning a team ticket to the dedicated ARGO support unit.
**Before submitting GGUS Team tickets**, have a **careful** look at the [[#Advices_about_ticket_submission|advices about ticket submission]].
At the end of the shift, please report a VO status summary to next team on shift to help seamless take-over.
The actions below that should be performed on a daily basis as much as possible.
====== Follow-up on open tickets ======
The follow-up of open issues is as important as the monitoring of resources.
**__At least once a week__**, we have to check on open tickets and:
* send a reminder in case there has been no progress,
* answer questions or take appropriate actions in case admins expect some inputs from us.
==== Check the status of open team tickets (sorted by last update) ====
{{url>https://ggus.eu/index.php?mode=ticket_search&show_columns_check[]=SUBMITTER&show_columns_check[]=TICKET_TYPE&show_columns_check[]=PRIORITY&show_columns_check[]=STATUS&show_columns_check[]=DATE_OF_CHANGE&show_columns_check[]=DATE_OF_CREATION&show_columns_check[]=SHORT_DESCRIPTION&ticket_id=&supportunit=&su_hierarchy=0&former_su=&vo=biomed&user=&keyword=&involvedsupporter=&assignedto=&affectedsite=&specattrib=Team-Ticket&status=open&priority=&typeofproblem=all&ticket_category=all&mouarea=&date_type=creation+date&tf_radio=1&timeframe=any&from_date=&to_date=&untouched_date=&orderticketsby=DATE_OF_CHANGE&orderhow=desc&search_submit=GO! 900px,250px left}}
==== Check the status of VOSupport tickets ====
These are notified to the biomed technical shift list, but it is a good practise to check on them regularly. Note that VOSupport are team tickets when we submitted them, they are not team tickets when they were submitted by site admins or users.
{{url>https://ggus.eu/index.php?mode=ticket_search&show_columns_check[]=SUBMITTER&show_columns_check[]=TICKET_TYPE&show_columns_check[]=AFFECTED_SITE&show_columns_check[]=PRIORITY&show_columns_check[]=DATE_OF_CHANGE&show_columns_check[]=DATE_OF_CREATION&show_columns_check[]=SHORT_DESCRIPTION&ticket_id=&supportunit=VOSupport&su_hierarchy=0&former_su=&vo=biomed&user=&keyword=&involvedsupporter=&assignedto=&affectedsite=&specattrib=none&status=open&priority=&typeofproblem=all&ticket_category=all&mouarea=&date_type=creation+date&tf_radio=1&timeframe=any&from_date=&to_date=&untouched_date=&orderticketsby=DATE_OF_CHANGE&orderhow=desc&search_submit=GO! 900px,250px left}}
==== Verify solved tickets ====
//Solved is not closed//: validate the ticket once in status solved, or re-open it if the problem persists.
{{url>https://ggus.eu/index.php?mode=ticket_search&show_columns_check[]=SUBMITTER&show_columns_check[]=TICKET_TYPE&show_columns_check[]=AFFECTED_SITE&show_columns_check[]=RESPONSIBLE_UNIT&show_columns_check[]=DATE_OF_CHANGE&show_columns_check[]=DATE_OF_CREATION&show_columns_check[]=SHORT_DESCRIPTION&ticket_id=&supportunit=&su_hierarchy=0&former_su=&vo=biomed&user=&keyword=&involvedsupporter=&assignedto=&affectedsite=&specattrib=Team-Ticket&status=solved&priority=&typeofproblem=all&ticket_category=all&mouarea=&date_type=creation+date&tf_radio=1&timeframe=any&from_date=&to_date=&untouched_date=&orderticketsby=DATE_OF_CHANGE&orderhow=desc&search_submit=GO! 900px,250px left}}
==== Team and VOSupport tickets solved/verified/unsolved during the last month ====
{{url>https://ggus.eu/index.php?mode=ticket_search&show_columns_check[]=SUBMITTER&show_columns_check[]=TICKET_TYPE&show_columns_check[]=AFFECTED_SITE&show_columns_check[]=RESPONSIBLE_UNIT&show_columns_check[]=DATE_OF_CHANGE&show_columns_check[]=DATE_OF_CREATION&show_columns_check[]=SHORT_DESCRIPTION&ticket_id=&supportunit=&su_hierarchy=0&former_su=&vo=biomed&user=&keyword=&involvedsupporter=&assignedto=&affectedsite=&specattrib=none&status=terminal&priority=&typeofproblem=all&ticket_category=all&mouarea=&date_type=creation+date&tf_radio=1&timeframe=any&from_date=&to_date=&untouched_date=&orderticketsby=DATE_OF_CHANGE&orderhow=desc&search_submit=GO! 900px,250px left}}
====== Identification of issues ======
Link to Biomed ARGO page: [[https://biomed.ui.argo.grnet.gr/]]
===== VOMS server =====
The proxy certificate creation should work:
voms-proxy-init -voms biomed
The VOMS administration interface should be available. From a UI, run the command:
voms-admin --vo=biomed --host voms-biomed.in2p3.fr --port 8443 list-cas
===== Monitoring SEs =====
==== Identify the problems ====
SRM probes used by ARGO box
- https://github.com/EGI-Foundation/nagios-plugins-srm
- based on the gfal2 library for the storage operations (gfal-copy, etc)
- queries the BDII service in order to build the Storage URL to test given the host-name and the VO name
- a X509 valid proxy certificate is needed to execute the probe (configured via X509_USER_PROXY variable).
__Reminder__: do **NOT** submit a ticket if the service is in **downtime** or it is **not in proper production status**: see on [[http://operations-portal.in2p3.fr/vapor/resources/GL2ResVO?VOfilter=biomed|VAPOR]] the supporting resources or faulty resources.
==== Reproduce the problem ====
Manual SRM testing (copy file to SE)
From the biomed-ui.fedcloud.fr VM, where gfal2 is already installed :
1. Build the Storage URL following the model srm://marsedpm.in2p3.fr:8446/dpm/in2p3.fr/home/biomed
NOTE 1: the model works for DPM SEs, not sure about storm or dCache (a storm example is srm:////storm-01.roma3.infn.it:8444/srm/managerv2?SFN=/biomed)
NOTE 2: would be interesing to use the probe for building this URL
2. Use gfal-ls to check that we can list the folder
gfal-ls srm://marsedpm.in2p3.fr:8446/dpm/in2p3.fr/home/biomed/user/s/scamarasu
3. Use gfal-copy to copy a file (in this case, job.jdl) to the above URL
gfal-copy job.jdl srm://marsedpm.in2p3.fr:8446/dpm/in2p3.fr/home/biomed/user/s/scamarasu/
Copying file:///home/spop/dirac/job.jdl [DONE] after 17s
4. Check the copy was copied and is now listed
gfal-ls srm://marsedpm.in2p3.fr:8446/dpm/in2p3.fr/home/biomed/user/s/scamarasu
job.jdl
Note that in some cases, the gfal-ls may work (as well as gfal-mkdir), but not the gfal-copy:
gfal-mkdir srm://clrlcgse01.in2p3.fr:8446/dpm/in2p3.fr/home/biomed/scamarasu
gfal-ls srm://clrlcgse01.in2p3.fr:8446/dpm/in2p3.fr/home/biomed/scamarasu
gfal-copy dirac/job.jdl srm://clrlcgse01.in2p3.fr:8446/dpm/in2p3.fr/home/biomed/scamarasu/
gfal-copy error: 70 (Communication error on send) - Could not open destination: globus_xio: Unable to connect to clrlcgse01.in2p3.fr:2811 globus_xio: System error in connect: Connection refused globus_xio: A system call failed: Connection refused
==== Ignored alarms ====
The error ''"No information for [attribute(s): ['GlueServiceEndpoint', 'GlueSAPath', 'GlueVOInfoPath']]"'' occurs when the SE does not publish information in the BDII. This may be due to a network outage, unscheduled downtime...
In such cases, **DO NOT SUBMIT** a ticket until the SE is back on line.
Reminder: __before submitting a ticket, make sure one is not open yet__.
==== Dealing with full SEs ====
See the [[Biomed-Shifts:full |full SE procedure]].
==== SE Decommissioning ====
When a SE is to planned for decommissioning, launch the specific [[Biomed-Shifts:decommisioning|SE decommissioning procedure]].
Older decommissionning page is available here [[Biomed-Shifts:old:old-decommisioning|Old SE decommissioning procedure]].
===== Monitoring CEs =====
==== Identify the problems ====
The ARGO box is the best way to identify faulty resources.
==== Reproduce the problem ====
1. Manual ARC CE submission
- see https://www.nordugrid.org/arc/arc6/users/submit_job.html for more details and a job description example
- submit with "arcsub job.xrsl -c CENAME"
Further ARC CE documentation available in French : https://grand-est.fr/support-utilisateurs/documentation-en-ligne/guide-dutilisation-de-arc-ce/
and DIRAC :https://grand-est.fr/support-utilisateurs/documentation-en-ligne/guide-dutilisation-de-dirac/
2. Manual HTCndorCE submission
TO BE DONE
==== Ignored alarms ====
Shifters shall focus on failed job submissions in priority: probes ''emi.cream.CREAMCE-AllowedSubmission''.
However, other faling probes such as ''emi.cream.CREAMCE-JobCancel'' and ''emi.cream.CREAMCE-JobPurge'' may be the sign of the fact that the test did not follow the expected workflow, hence tests should also be performed in those cases.
Some investigation is often needed to understand what the problem is, and whether a ticket should be submitted. In particular, hereafter are several reasons that should lead to ignoring alarms, but not necessarily:
* The service is in **downtime** or it is **not in proper production status**: check out [[http://operations-portal.in2p3.fr/vapor/resources/GL2ResVO?VOfilter=biomed|VAPOR]] for supporting resources or faulty resources.
* **The queue is disabled**: its status may be "Closed" or "Draining", the error on the log shows the message "queue is disabled".
* **Probe time-outs**: ARGO probes are configured to time out after some time. However, that should not raise critical alarms, warning or unknown would be more accurate.
* **"No compatible resources"**: this type of alarm is most likely a problem on the WMS that did send a job to an inappropriate CE. A non urgent ticket may be submitted to require the opinion of the admin.
* **Maximum number of jobs already in queue MSG=total number of jobs in queue exceeds the queue limit**: each site decides on its policy as to accept of reject biomed jobs. We cannot submit a ticket when the queue is full, given that we use resources in an opportunistic manner.
==== Check CEs publishing bad number of running/waiting jobs ====
Check the CEs that publish wrong (default) values for running or waiting jobs on [[http://operations-portal.in2p3.fr/vapor/resources/GL2ResVO?VOfilter=biomed|VAPOR]] report: this can be global CE data (tab //Faulty Computing//) or per-share job counts (tab //Faulty jobs//).
The default figure is 4444 or 444444.
For each of those, __if the CE is in normal production (no downtime, production status)__, submit a __non urgent__ ticket asking to solve the problem.
To make direct checks in the BDII, use the [[https://github.com/frmichel/vo-support-tools/blob/master/ldap_requests_glue2.txt|example LDAP requests]] provided in the [[https://github.com/frmichel/vo-support-tools|VO Support Tools project]].
This is a template ticket message that can be reused to report a faulty resource. **Don't forget to customize the colored text.**
CE **hostname** publishes invalid data
Dear site admin,
The CE publishes erroneous default data:
- **no running job**
- **444444** waiting jobs
- ERT = **2146660842**
- WRT = **2146660842**
This was reported by VAPOR: http://operations-portal.in2p3.fr/vapor/resources/GL2ResVO?VOfilter=biomed (tab **Faulty Computing|Faulty jobs**).
Note that VAPOR reports job counts published in GLUE2 data which may differ from GLUE1 data.
You may want to check the suggestions here: https://wiki.egi.eu/wiki/Tools/Manuals/TS59
Thx in advance for your support.
****, for the Biomed VO.
===== CVMFS support =====
Biomed progressively migrates to the [[https://wiki.egi.eu/wiki/CVMFS_Task_Force|CVMFS]] solution (Cern Virtual Machine File System) to manage VO specific software. In time, it should replace the variable VO_BIOMED_SW_DIR.
To do so, biomed VO administrators submit tickets to sites supporting CVMFS to ask them to enable biomed in their configuration. Shifters are requested to follow-up on those tickets: in particular, when site admins agree to enable biomed they ask us to test it, shifters then have to test the CVMFS service by submitting a job to one CE on that site. Use this [[https://github.com/frmichel/vo-support-tools/blob/master/CVMFS/test.sh|test script]] and this [[https://github.com/frmichel/vo-support-tools/blob/master/CVMFS/test_ce.jdl|jdl]]. Example:
export CE=ce04-lcg.cr.cnaf.infn.it:8443/cream-lsf-biomed
glite-wms-job-submit -a -o job_id.txt -r $CE test_ce.jdl
If you need to ask a site to enable biomed, you may want to copy [[https://ggus.eu/?mode=ticket_info&ticket_id=108728|one of the tickets]] submitted recently.
====== Advices about ticket submission ======
===== Before a ticket is submitted =====
- Check that the ARGO alarm can be reproduced manually (make sure it is not a monitoring issue) and that it still happens some time after it was first detected (make sure it is not a temporary error).
- Check that **no GGUS ticket is already open on this issue**: look at [[https://ggus.eu/index.php?mode=ticket_search|open team tickets]].
- Check that the **concerned host is not in status "downtime", "not in production", or "not monitored"** using VAPOR's supporting resources or faulty resources on [[http://operations-portal.in2p3.fr/vapor/resources/GL2ResVO?VOfilter=biomed|VAPOR]].
===== During ticket submission =====
- [[https://ggus.eu/?mode=ticket_team|Submit a team ticket]]. This will ensure that the next teams on duty will follow-up on the tickets submitted during your shift.
- Specify ''biomed'' in the ''Concerned VO'' field of the GGUS submission form.
- Clearly specify the concerned service name in the subject (i.e. "LFC", "VOMS", "SE" and the SE name, "CE" and the CE name, etc.) to facilitate further searching in the ticket database.
- Priority: local incidents (e.g. 1 SE is down) should be at priority "urgent" unless more than 50% of the sites are down (then set priority to "very urgent"). Incidents stopping the production (i.e. LFC or VOMS down) should be at "top priority".
This is a template ticket message that can be reused to report a faulty resource. **Don't forget to customize the bold text.**
{**SE|CE|WMS|...} hostname** is not working for the biomed VO
Dear site admin,
**** is not working for Biomed users. The incident was detected from the Biomed ARGO box that you may want to check to see the status:
****
The problem was reproduced by hand in the log below.
Thanks in advance for your support,
****, for the Biomed VO.
****
===== Ticket follow-up: general remarks =====
- Sites announcing that they give up with biomed VO support should be asked to send a notification to biomed-vo-managers [at] googlegroups [dot] com and to kindly keep the site up to allow for file migration. In this case the [[Biomed-Shifts:decommisioning|SE decommissioning procedure]] must be initiated.
- Sites claiming to be in downtime should be asked to remove their entries from the BDII. Sites showing in ''lcg-infosites'' will be assumed in operations and are a potential ticket target.
- Messages from sites admins sometimes seem impolite (e.g. the ticket is put in status ''solved'' without a single comment and the problem still persists). This may be the result of an automatic action from the local system used to answer GGUS tickets, and not necessarily from a person.