Daily tasks and best practices

This section describes the usual proceedings of a biomed support duty shift. Subsequent sections give further details as to specific tasks.

Starting a shift

When your shift starts, get the VO status summary (salient tickets, on-going issues) from the previous team on shift.

During the shift

The page provides a description of daily tasks and best practices. Those recommendations aim at organizing the handling of technical issues across shifters and at providing a coherent interface to ensure a successful communication. In particular, here are the Daily tasks at a glance:
  • Follow-up on open tickets, and verify solved tickets
  • Monitor critical services (VOMS, LFC).
  • Check Nagios alarms concerning SEs, CEs, WMSs.
  • Deal with full SEs, and resource decommissioning
  • Report detected issues concerning the Nagios box by assigning a team ticket to site “GRIF”.

Before submitting GGUS Team tickets, have a careful look at the advices about ticket submission.

Ending a shift

At the end of the shift, please report a VO status summary to next team on shift to help seamless take-over.

The actions below that should be performed on a daily basis as much as possible.

Follow-up on open tickets

The follow-up of open issues is as important as the monitoring of resources.

At least once a week, we have to check on open tickets and:

  • send a reminder in case there has been no progress,
  • answer questions or take appropriate actions in case admins expect some inputs from us.

Check the status of open team tickets (sorted by last update)

Check the status of VOSupport tickets

These are notified to the biomed technical shift list, but it is a good practise to check on them regularly. Note that VOSupport are team tickets when we submitted them, they are not team tickets when they were submitted by site admins or users.

Verify solved tickets

Solved is not closed: validate the ticket once in status solved, or re-open it if the problem persists.

Team and VOSupport tickets solved/verified/unsolved during the last month

Identification of issues

VOMS server

The proxy certificate creation should work:

voms-proxy-init -voms biomed

The VOMS administration interface should be available. From a UI, run the command:

voms-admin --vo=biomed --host  voms-biomed.in2p3.fr --port 8443 list-cas

LFC server

Command “time lfc-ls /grid” should return in less than 30 seconds.

Monitoring SEs

Identify the problems

lcg-cr and lcg-del should work on every SE of the VO. The Nagios box will help you identify faulty servers. You may use the following straight links: SRM service group status, or Critical issues for service group SRM

Reminder: do NOT submit a ticket if the service is in downtime or it is not in proper production status: see on VAPOR the supporting resources or faulty resources.

Reproduce the problem

lcg-cr -v --connect-timeout 30 --sendreceive-timeout 900 --bdii-timeout 30 --srm-timeout 300 --vo biomed file:///some_of_your_file -l lfn:/grid/biomed/yourhome/some_file_name -d SE_HOSTNAME

The following script automates the lcg-cr and lcg-del commands on a SE: lcg-cr.sh

Reminder: before submitting a ticket make sure one is not open yet.

Dealing with full SEs

SE Decommissioning

When a SE is to planned for decommissioning, launch the specific SE decommissioning procedure.

Monitoring CEs

Identify the problems

glite-wms-job-submit should work on every CE of the VO. The Nagios box remains the best way to identify faulty resources. You may use the following straight link: Critical issues for service group CREAM-CE

It is sometimes necessary to manually rerun tests using the Reschedule option. In that case, only ACTIVE checks must be rescheduled, as opposed to PASSIVE checks. For a CREAM CE, only org.sam.CREAMCE-JobState-biomed can be rescheduled.

Reproduce the problem

Reproduce the problem by one of the two methods below.

1. Download this test JDL files, rename it as test_ce.jdl and submit it to the concerned CE. In the line

Requirements = (RegExp("ce_name",other.GlueCEUniqueID));

replace the “ce_name” with the hostname of the CE to check, and submit the job:

glite-wms-job-submit -a test_ce.jdl

2. Alternatively, download this test JDL, rename it as test_ce_noreq.jdl and submit it to the concerned CE. Check the BDII (lcg-infosites) to get the full name of a queue on that CE and run the command:

glite-wms-job-submit -a -r <CE hostname>><port>/<queue_name> test_ce_noreq.jdl

Then check that the status and the output when the submit command has completed:

glite-wms-job-status <jobId>
glite-wms-job-output --dir <local directory> <jobId>

Reminder: before submitting a ticket make sure one is not open yet.

Ignored alarms

Shifters shall focus on failed job submissions in priority. Regarding other types of alarms, more investigation is generally needed to understand what the problem is, and whether a ticket should be submitted. In particular, <span style=“background:#ffff00”>hereafter are several reasons that should lead to ignoring alarms</font>, but not necessarily:

  • The service is in downtime or it is not in proper production status: check out VAPOR for supporting resources or faulty resources.
  • The queue is disabled: its status may be “Closed” or “Draining”, the error on the log shows the message “queue is disabled”.
  • Job submission time-outs: Nagios probes are configured to time out after some time. However, that should not raise critical alarms, warning or unknown would be more accurate.
  • “No compatible resources”: this type of alarm is most likely a problem on the WMS that did send a job to an inappropriate CE. A non urgent ticket may be submitted to require the opinion of the admin.
  • Alarms of probes WNRep*: this type of issue is frequently related to SE issues or close SE configuration. First check if there is an alarm on the SE, if not then submit a new ticket.
  • File was NOT copied to SE…: that test consists in copying a file from the CE to the close SE. If the SE is in alarm, then a ticket should be submitted with regards to the SE, not the CE.
  • Probe org.sam.WN-SoftVer-biomed fails for most CREAM CEs. This is due to an old version of the probe that is not able to parse EMI2 version. This should no longer appear, it is fixed in probes installed with Nagios update 17.1.
  • Maximum number of jobs already in queue MSG=total number of jobs in queue exceeds the queue limit: each site decides on its policy as to accept of reject biomed jobs. We cannot submit a ticket when the queue is full, given that we use resources in an opportunistic manner.

Check CEs publishing bad number of running/waiting jobs

Check the CEs that publish wrong (default) values for running or waiting jobs on VAPOR faulty resources report. The default figure is 4444 or 444444.

For each of those, if the CE is in normal production (no downtime, production status), submit a non urgent ticket asking to solve the problem. Also note that several lines may refer to the same CE with several queues, in that case of course only one ticket should be submitted.

This is a template ticket message that can be reused to report a faulty resource. Don't forget to customize the bold text.

Subject

CE hostname publishes invalid data

Body

Dear site admin,

The CE publishes erroneous default data: - no running job - 444444 waiting jobs - ERT = 2146660842 - WRT = 2146660842

This was reported by VAPOR: http://operations-portal.in2p3.fr/vapor/resources/GL2ResVO?VOfilter=biomed (tab faulty jobs).

Can this be fixed? You may want to check the suggestions here: https://wiki.egi.eu/wiki/Tools/Manuals/TS59

Thx in advance for your support. <shifter name>, for the Biomed VO.

Monitoring WMSs

glite-wms-job-submit should return a job ID on every WMS. Testing whether jobs eventually manage to reach a CE is currently out of scope ⇒ Ignore the cases when jobs stay in status “Submitted” forever.

The Nagios box will help you identify faulty servers. You may use the following straight links: Critical issues for service group WMS

<u>Reminder</u» do NOT submit a ticket if the service is in downtime or it is not in proper production status: check out supporting resources or faulty resources on VAPOR .

Reproduce the problem on a UI:

  • Check the BDII (lcg-infosites) to get the full name of the WMS endpoint:
lcg-infosites wms | grep <WMS host name>
  • Submit this test jdl to the WMS endpoint returned above:
glite-wms-job-submit -a -e <WMS endpoint> test.jdl

CVMFS support

Biomed progressively migrates to the CVMFS solution (Cern Virtual Machine File System) to manage VO specific software. In time, it should replace the variable VO_BIOMED_SW_DIR.

To do so, biomed VO administrators submit tickets to sites supporting CVMFS to ask them to enable biomed in their configuration. Shifters are requested to follow-up on those tickets: in particular, when site admins agree to enable biomed they ask us to test it, shifters then have to test the CVMFS service by submitting a job to one CE on that site. Use this test script and this jdl. Example:

export CE=ce04-lcg.cr.cnaf.infn.it:8443/cream-lsf-biomed
glite-wms-job-submit -a -o job_id.txt -r $CE test_ce.jdl

If you need to ask a site to enable biomed, you may want to copy one of the tickets submitted recently.

Advices about ticket submission

Before a ticket is submitted

  1. Check that the Nagios alarm can be reproduced manually (make sure it is not a monitoring issue) and that it still happens some time after it was first detected (make sure it is not a temporary error).
  2. Check that no GGUS ticket is already open on this issue: look at open team tickets.
  3. Check that the concerned host is not in status “downtime”, “not in production”, or “not monitored” using VAPOR's supporting resources or faulty resources on VAPOR.

During ticket submission

  1. Submit a team ticket. This will ensure that the next teams on duty will follow-up on the tickets submitted during your shift.
  2. Specify biomed in the Concerned VO field of the GGUS submission form.
  3. Clearly specify the concerned service name in the subject (i.e. “LFC”, “VOMS”, “SE” and the SE name, “CE” and the CE name, etc.) to facilitate further searching in the ticket database.
  4. Priority: local incidents (e.g. 1 SE is down) should be at priority “urgent” unless more than 50% of the sites are down (then set priority to “very urgent”). Incidents stopping the production (i.e. LFC or VOMS down) should be at “top priority”.

This is a template ticket message that can be reused to report a faulty resource. Don't forget to customize the bold text.

Subject

{SE|CE|WMS|…} hostname is not working for the biomed VO

Body

Dear site admin,

<hostname> is not working for Biomed users. The incident was detected from the Biomed Nagios box that you may want to check to see the status: <link to the Nagios alarm page> The problem was reproduced by hand in the log below.

Thanks in advance for your support,

<shifter name>, for the Biomed VO.

<detailed log>

Ticket follow-up: general remarks

  1. Sites announcing that they give up with biomed VO support should be asked to send a notification to biomed-vo-managers [at] googlegroups [dot] com and to kindly keep the site up to allow for file migration. In this case the SE decommissioning procedure must be initiated.
  2. Sites claiming to be in downtime should be asked to remove their entries from the BDII. Sites showing in lcg-infosites will be assumed in operations and are a potential ticket target.
  3. Messages from sites admins sometimes seem impolite (e.g. the ticket is put in status solved without a single comment and the problem still persists). This may be the result of an automatic action from the local system used to answer GGUS tickets, and not necessarily from a person.