Next Previous Contents

4. System configuration

Without any specification, the installation phase will produce a working PBS system with the following defaults:

Once that the system has been built and installed, the Server and Moms must be configured and the scheduling policy must be implemented. These items are closely coupled. Managing which and how many jobs are scheduled into execution can be done in several methods. Each method has an impact on the implementation of the scheduling policy and server attributes. An example is the decision to schedule jobs out of a single pool (queue) or divide jobs into one of multiple queues each of which is managed differently. If you want to run jobs on more than a single computer, you will need to install the execution daemon (pbs_mom) on each host where jobs are expected to execute. If you are running the default scheduler, fifo, you will need to fill a nodes file (PBS_HOME/server_priv/nodes) with one entry for each execution host specifying, if appropriate, the number of processors per host. For example:

node1 np=4
node2 np=4
node3 np=2
node4 np=2 
If you write your own Scheduler, it can be told in ways other than the Server's nodes file on which hosts jobs could be run.

4.1 qmgr - pbs batch system manager command

qmgr [-a] [-c command] [-e] [-n] [-z] [server...]

The qmgr command provides an administrator interface to the batch system. The command reads directives from standard input. The syntax of each directive is checked and the appropriate request is sent to the batch server or servers. The list or print subcommands of qmgr can be executed by general users. Creating or deleting a queue requires PBS Manager privilege. Setting or unsetting server or queue attributes requires PBS Operator or Manager privilege. The server operands identify the name of the batch server to which the administrator requests are sent. Each server conforms to the following syntax: host_name[:port] where host_name is the network name of the host on which the server is running and port is the port number to which to connect. If server is not specified, the administrator requests are sent to the local server.

A qmgr directive is one of the following forms:

   command server [names] [attr OP value[,attr OP value,...]]
   command queue [names] [attr OP value[,attr OP value,...]]
   command node [names] [attr OP value[,attr OP value,...]]

Where, command is the command to perform on a object. Commands are:

active

sets the active objects. If the active objects are specified, and the name is not given in a qmgr cmd the active object names will be used.

create

is to create a new object, applies to queues and nodes.

delete

is to destroy an existing object, applies to queues and nodes.

set

is to define or alter attribute values of the object.

unset

is to clear the value of attributes of the object. Note, this form does not accept an OP and value, only the attribute name.

list

is to list the current attributes and associated values of the object.

print

is to print all the queue and server attributes in a format that will be usable as input to the qmgr command.

names is a list of one or more names of specific objects The name list is in the form: [name][@server][,queue_name[@server]...] with no intervening white space. The name of an object is declared when the object is first created. If the name is @server, then all the objects of specified type at the server will be effected.

attr specifies the name of an attribute of the object which is to be set or modified. If the attribute is one which consist of a set of resources, then the attribute is specified in the form: attribute_name.resource_name

OP operation to be performed with the attribute and its value:

=

set the value of the attribute. If the attribute has a existing value, the current value is replaced with the new value.

+=

increase the current value of the attribute by the amount in the new value.

-=

decrease the current value of the attribute by the amount in the new value.

value the value to assign to an attribute. If the value includes white space, commas or other special characters, such as the # character, the value string must be inclosed in quote marks (").

4.2 Starting the daemons

All three of the daemon processes, Server, Scheduler and Mom, must run with the real and effective uid of root. Typically, the daemons are started from the systems boot files, e.g. /etc/rc.local. However, it is recommended that the Server be brought up "by hand" the first time and configured before being run at boot time.

Starting mom

Mom should be started at boot time. Typically there are no required options. It works best if Mom is started before the Server on every node so they will be ready to respond to the Server's "are you there?" ping. Start Mom with the line:

{sbindir}/pbs_mom [options]

in the /etc/rc2 or equivalent boot file. If the Server or Scheduler are running on a different host, the host name(s) must be specified in Mom's configuration file; see the pbs_mom configuration section.

Starting the server

The initial run of the Server or any first time run after recreating the home directory must be with the -t create option:

{sbindir}/pbs_server -t create

This option directs the Server to discard any existing configuration files, queues and jobs, and initialize configuration files to the default values. This is best done by hand. At this point it is necessary to configure the Server. See the pbs_server configuration section.

After the Server is configured it may be placed into service. Normally it is started in the system boot file via a line such as:

{sbindir}/pbs_server [options]

The -t start_type option may be specified where start_type is one of the options (hot|warm|cold) specified in the pbs_server man page. The default is warm.

Starting the scheduler

The Scheduler should also be started at boot time. Start it with an entry in the /etc/rc2 or equivalent file:

{sbindir}/pbs_sched [options]

There are no required options for the default fifo scheduler.

4.3 Configuring the Execution Server, pbs_mom

The function of pbs_mom is to place jobs into execution as directed by the server, establish resource usage limits, monitor the job's usage, and notify the server when the job completes. If they exist, pbs_mom will execute a prologue script before executing a job and an epilogue script after executing the job. The next function of pbs_mom is to provide information about the status of running jobs, memory available etc. as response of a resource monitor request typically submitted by the PBS scheduler. Pbs_mom will record a diagnostic message in a log file for any error occurrence. The log files are maintained in the mom_logs directory below the home directory of the server (default /usr/spool/PBS/mom_logs). If the log file cannot be opened, the diagnostic message is written to the system console.

Mom must know the name of the server that manages it: it must be declared in the file PBS_HOME/server_name. The Mom's configuration is achieved via a configuration file which is reads at initialization time and when Mom receive a SIGHUP signal. This file is described in the pbs_mom(8) man page as well as in the following section. If the -c option is not specified when Mom is run, she will open PBS_HOME/mom_priv/config if it exists. If it does not, Mom will continue anyway. The configuration file must be "secure": it must be owned by a user id and group id less than 10 and not be world writtable.

The file provides several types of run time information to pbs_mom: static resource names and values, external resources provided by a program to be run on request via a shell escape, and values to pass to internal set up functions at initialization (and re-initialization).

Each item type is on a single line with the component parts separated by white space. If the line starts with a hash mark (pound sign, #), the line is considered to be a comment and is skipped. An example of configuration file is:

$logevent 0x0ff             #enables logging of all events except debug events
$clienthost fe.widget.com   #mom will accept privileged connections from this host
                            #typically host where server and scheduler run      
$restricted *.widget.com    #mom will accept connections from this host
                            #typically hosts on which a monitoring tool
                            #(as xpbsmon) can be run
$ideal_load 2.0             #When the load average on the node drops below this value
                            #Mom inform the server that the node is no longer busy
$max_load   3.5             #When the load average on the node exceeds this value
                            #Mom inform the server that the node is busy
$cputmult 1.3               #factor used to adjust cpu time usage by to job to allow
                            #comparison with different cpu performance nodes
$wallmult 1.3               #factor used to adjust wall time usage of the job to allow
                            #comparison with different cpu performance nodes
$usecp bevyboss.widget.com:/u/home /r/home   #Inform mom to use cp instead of rcp or scp
                            #to transfer file from/to that destination because it's NFS mounted
tape8mm 2                   #inform the mom about the value of a static resource
                            #(e.g. number of resources)

The directories and files involved are:

$PBS_SERVER_HOME/mom_priv the default directory for configuration files, typical (/usr/spool/PBS)/mom_priv.

$PBS_SERVER_HOME/mom_logs directory for log files recorded by the server.

$PBS_SERVER_HOME/mom_priv/config the default configuration file

$PBS_SERVER_HOME/mom_priv/prologue the administrative script to be run before job execution.

$PBS_SERVER_HOME/mom_priv/epilogue the administrative script to be run after job execution.

4.4 Configuring the Job Server, pbs_server

Server management consist of configuring the Server attributes and establishing queues and their attributes. Unlike Mom and the Job Scheduler, the Job Server (pbs_server) is configured while it is running, except for the nodes file. Configuring server and queue attributes and creating queues is done with the qmgr command. This must be either as root or as a user who has been granted PBS Manager privilege. Exactly what needs to be set depends on your scheduling policy and how you chose to implement it. The system needs at least one queue established and certain server attributes initialized.

The following are the "minimum required" server attributes and the recommended attributes; see the pbs_server_attributes man page for a complete list of server attributes. They are set via the set server (s s) subcommand to the qmgr command.

default_queue Declares the default queue to which jobs are submitted if a queue is not specified on the qsub command. The queue must be created first. Example: Qmgr: c q dque queue_type=execution Qmgr: s s default_queue=dque

acl_hosts A list of hosts from which jobs may be submitted. Example: Qmgr: s s acl_hosts=*.foo.bar.com,boss.hq.bar.com

acl_host_enable Enables the Server's host access control list, see above. Qmgr: s s acl_host_enable=true

default_node Defines the node on which jobs are run if not otherwise directed.Example: Qmgr: s s default_node=big

managers Defines which users, at a specified host, are granted batch system administrator privilege. For example,Qmgr: s s managers=me@*.foo.bar.com,sam@big.foo.bar.com

node_pack Defines the order in which multiple cpu cluster nodes are allocated to jobs.

resources_defaults This attribute establishes the resource limits assigned to jobs that were submitted without a limit and for which there are no queue limits. See the pbs_resources_* man page for your system type (* is irix6, linux, solaris5, ...). Example Qmgr: s s resources_defaults.cput=5:00 Qmgr: s s resources_defaults.mem=4mb

resources_max This attribute sets the maximum amount of resources which can be used by a job entering any queue on the Server. This limit is checked only if there is not a queue specific resources_max attribute defined for the specific resource.

Queues Configuration

There are two types of queues defined by PBS, routing and execution. A routing queue is a queue used to move jobs to other queues which may even exist on different PBS Servers. Routing queues are similar to the old NQS pipe queues. A job must reside in an execution queue to be eligible to run. The job remains in the execution queue during the time it is running.

A Server may have multiple queues of either or both types. A Server must have at least one queue defined. Typically it will be an execution queue; jobs cannot be executed while residing in an routing queue.

Queue attributes fall into three groups: those which are applicable to both types of queues, those applicable only to execution queues, and those applicable only to routing queues. If an "execution queue only" attribute is set for a routing queue, or vice versa, it is simply ignored by the system. However, as this situation might indicate the administrator made a mistake, the Server will issue a warning message about the conflict. The same message will be issued if the queue type is changed and there are attributes that do not apply to the new type.

Not all of the Queue Attributes are discussed here, only what is needed to get a reasonable system up and running. See the pbs_queue_attributes man page for a complete list of queue attributes.

queue_type Must be set to either execution or routing (e or r will do). The queue type must be set before the queue can be enabled. Example: Qmgr: s q dque queue_type=execution

enabled If set to true, jobs may be enqueued into the queue. If false, jobs will not be accepted.

started If set to true, jobs in the queue will be processed, either routed by the Server

route_destinations (Only for routing queues) List the local queues or queues at other Servers to which jobs in this routing queue may be sent. For example: Qmgr: s q routem route_destinations=dque,overthere@another.foo.bar.com

resources_max If you chose to have more than one execution queue based on the size or type of job, you may wish to establish maximum and minimum values for various resource limits. This will restrict which jobs may enter the queue and will override the same resource resources_max defined at the Server level. If there is no maximum value declared for a resource type, there is no restriction on that resource. For example: s q dque resources_max.cput=2:00:00 places a restriction that no job requesting more than 2 hours of cpu time will be allowed in the queue. There is no restriction on the memory, mem, limit a job may request.

resources_min Defines the minimum value of resource limit specified by a job before the job will be accepted into the queue. If not set, there is no minimum restriction.

resources_default Defines a set of default values for jobs entering the queue that did not specify certain resource limits. There is a corresponding server attribute which sets a default for all jobs.

The limit for a specific resource usage is established by checking various job, queue, and server attributes. The following list shows the attributes and their order of precedence:

1. The job attribute Resource_List, i.e. what was requested by the user.
2. The queue attribute resources_default.
3. The Server attribute resources_default.
4. The queue attribute resources_max.
5. The Server attribute resources_max.
Please note, an unset resource limit for a job is treated as an infinite limit.

Recording Server Configuration

Should you wish to record the configuration of a Server for re-use, you may use the print subcommand of qmgr. For example,

qmgr -c "print server" > /tmp/server.con

will record in the file server.con the qmgr subcommands required to recreate the current configuration including the queues. The commands could be feed back into qmgr via standard input:

qmgr < /tmp/server.con

It isn't necessary to do this at every pbs_server startup because (unless -t create is specified) it maintains current configuration in a private database (server_priv/serverdb)

4.5 Configuring the Scheduler, pbs_sched

PBS provides a separate process to schedule which jobs should be placed into execution. This is a flexible mechanism by which you may implement a very wide variety of policies. In fact it is possible to implement a replacement Scheduler using the provided APIs which will enforce the desired policies. The configuration required for a Scheduler depends on the Scheduler itself. The delivered FIFO Scheduler provides the ability to sort the jobs in several different ways, in addition to FIFO order. There is also the ability to sort on user and group priority. Mainly this Scheduler is intended to be a jumping off point for a real Scheduler to be written. A good amount of code has been written to make it easier to change and add to this Scheduler. As distributed, the fifo Scheduler is configured with the following options, see file PBS_HOME/sched_priv/sched_config:

Change directory into PBS_HOME/sched_priv and edit the scheduling policy config file sched_config, or use the default values. This file controls the scheduling policy (which jobs are run when).The format of the sched_config file is:

name: value [prime | non_prime | all]

name and value may not contain any white space value can be: true | false | number | string any line starting with a '#' is a comment. A blank third word is equivalent to "all" which is both prime and non-prime. The associated values as shipped as defaults are shown in braces {}. Here is some of scheduler attributes you can set:

round_robin {false all} boolean: If true - run jobs one from each queue in a circular fashion; if false - run as many jobs as possible up to queue/server limits from one queue before processing the next queue. The following server and queue attributes, if set, will control if a job "can be" run: resources_max, max_running, max_user_run, and max_group_run. See the man pages pbs_server_attributes and pbs_queue_attributes.

by_queue {true all} boolean: If true - the jobs will be run from their queues; if false - the entire job pool in the Server is looked at as one large queue.

strict_fifo {false all} boolean: If true - will run jobs in a strict FIFO order. This means if a job fails to run for any reason, no more jobs will run from that queue/server that scheduling cycle. If strict_fifo is not set, large jobs can be starved, i.e., not allowed to run because a never ending series of small jobs use the available resources. Also see the server attribute resources_max and the fifo parameter help_starving_jobs below.

fair_share {false all} boolean: This will turn on the fair share algorithm. It will also turn on usage collecting and jobs will be selected using a function of their usage and priority(shares).

load_balancing {false all} boolean: If this is set the Scheduler will load balance the jobs between a list of time-shared hosts (:ts) obtained from the Server (pbs_server). The Server reads the list from its nodes file.

help_starving_jobs boolean: This bit will have the Scheduler turn on its rudimentary starving jobs support. Once jobs have waited for the amount of time give by starve_max, they are considered starving, i.e. no jobs will run until the starving job can be run. Starve_max needs to be set also.

starve_max The amount of time before a job is considered starving. This config variable is not used if help_starving_jobs is not set.

sort_by {shortest_job_first} string: have the jobs sorted. sort_by can be set to a single sort type or multi_sort. If set to multi_sort, multiple key fields are used. Each key field will be a key for the multi sort. The order of the key fields decides which sort type is used first. Possible sort keys: no_sort, shortest_job_first, longest_job_first, smallest_memory_first, largest_memory_first, high_priority_first, low_priority_first, multi_sort, fair_share, large_walltime_first, short_walltime_first.

log_filter {256} What event types not to log. The value should be the addition of the event classes which should be filtered (i.e. ORing them together). The numbers are defined in src/include/log.h. NOTE: those numbers are in hex and log_filter is in base 10.

dedicated_prefix {ded} The queues with this prefix will be considered dedicated queues.

Example of FIFO Configuration file

#Set the boolean values which define how the scheduling policy finds
#the next job to consider to run.
round_robin: False      ALL
by_queue: True          prime
by_queue: false         non-prime
strict_fifo: true       ALL
fair_share: True        prime
fair_share: false       non-prime

# help jobs which have been waiting too long
help_starving_jobs: true        prime
help_starving_jobs: false       non-prime

# Set a multi_sort
# This example will sort jobs first by ascending cpu time requested, and then
# by ascending memory requested, and then finally by descending job priority
#
sort_by: multi_sort
key: shortest_job_first
key: smallest_memory_first
key: high_priority_first

# Set the debug level to only show high level messages.
# Currently this only shows jobs being run
debug_level: high_mess

# a job is considered starving if it has waited for this long
max_starve:     24:00:00

# If the Scheduler comes by a user which is not currently in the resource group
# tree, they get added to the "unknown" group.  The "unknown" group is in roots
# resource group.  This says how many shares it gets.
unknown_shares: 10

# The usage information needs to be written to disk in case the Scheduler
# goes down for any reason.  This is the amount of time between when the
# usage information in memory is written to disk.  The example syncs the
# information ever hour.
sync_time: 1:00:00

# What events do you not want to log.  The event numbers are defined in
# src/include/log.h.  NOTE: the numbers are in hex, and log_filter is in
# base 10.

Next Previous Contents