The partition manager defines one or more load balancing policies. Whereas the PSU produces a prioritized set of candidate pools using a collection of rules defined by the administrator, the load balancing policy determines the specific pool to use. It is also the load balancing policy that determines when to fall back to lesser prirority links, or when to trigger creation of additional copies of a file.
Since the load balancing policy and parameters are defined per partition, understanding the partition manager is essential to tuning load balancing. This does not imply that one has to partition the dCache instance. It is perfectly valid to use a single partition for the complete instance.
This section documents the use of the partition manager, how to create partitions, set parameters and how to associate links with partitions. In the following sections the available partition types and their configuration parameters are described.
There are various parameters that affect the load balancing policy. Some of them are generic and apply to any load balancing policy, but many are specific to a particular policy. To avoid limiting the complete dCache instance to a single configuration, the choice of load balancing policy and the various parameters apply to partitions of the instance. The load balancing algorithm and the available parameters is determined by the partition type.
Each PSU link can be associated with a different
partion and the policy and parameters of that partition will
be used to choose a pool from the set of candidate pools. The
only partition that exists without being explicitly created is
the partition called default
. This
partition is used by all links that do not explicitly identify
a partition. Other partitions can be created or modified as
needed.
The default
partition has a
hard-coded partition type called
classic
. This type implements the one load
balancing policy that was available in dCache before version
2.0. The classic
partition type is
described later. Other partitions have one of a number of
available types. The system is pluggable, meaning that third
party plugins can be loaded at runtime and add additional
partition types, thus providing the ultimate control over load
balancing in dCache. Documentation on how to develop plugins
is beyond the scope of this chapter.
To ease the management of partition parameters, a common
set of shared parameters can be defined outside all
partitions. Any parameter not explicitly set on a partition
inherits the value from the common set. If not defined in the
common set, a default value determined by the partition type
is used. Currently, the common set of parameters happens to be
the same as the parameters of the default
partition, however this is only due to compatibility
constraints and may change in future versions.
For each partition you can choose the load balancing policy. You do this by chosing the type of the partition.
Currently four different partition types are supported:
classic
:
This is the pool selection algorithm used in the versions
of dCache prior to version 2.0
. See
the section called “Classic Partitions” for a detailed
description.
random
:This pool selection algorithm selects a pool randomly from the set of available pools.
lru
:This pool selection algorithm selects the pool that has not been used the longest.
wass
:This pool selection algorithm selects pools randomly weighted by available space, while incorporating age and amount of garbage collectible files and information about load.
This is the partition type of the default partition. See How to Pick a Pool for more details.
Commands related to dCache partitioning:
pm types
Lists available partition types. New partition types can be added through plugins.
pm create
[-type=<partitionType>] <partitionName>
Creates a new partition. If no partition type is
specified, then a wass
partition
is created.
pm set
[<partitionName>] -<parameterName>
=<value>|off
Sets a parameter <parameterName> to a new value.
If <partitionName> is omitted, the common shared set of parameters is updated. The value is used by any partition for which the parameter is not explicitly set.
If a parameter is set to off
then this parameter is no longer defined and is
inherited from the common shared set of parameters, or
a partition type specific default value is used if the
parameter is not defined in the common set.
pm ls
[-l] [<partitionName>]
Lists a single or all partitions, including the
type of each partition. If a partition name or the
-l
option are used, then the
partition parameters are shown too. Inherited and
default values are identified as such.
pm destroy
<partitionName>
Removes a partition from dCache. Any links
configured to use this partition will fall back to
the default
partition.
A partition, so far, is just a set of parameters which
may or may not differ from the default set. To let a partition
relate to a part of the dCache, links are used. Each link
may be assigned to exactly one partition. If not set, or the
assigned partition doesn’t exist, the link defaults to the
default
partition.
psu set link
[<linkName>] -section=<partitionName> [<other-options>...]
Whenever this link is chosen for pool selection, the associated parameters of the assigned partition will become active for further processing.
Depending on the way links are setup it may very
well happen that more than just one link is triggered for
a particular dCache request. This is not illegal but
leads to an ambiguity in selecting an appropriate dCache
partition. If only one of the selected links has a
partition assigned, this partition is chosen. Otherwise,
if different links point to different partitions, the
result is indeterminate. This issue is not yet solved and
we recommend to clean up the poolmanager
configuration
to eliminate links with the same preferences for the same
type of requests.
In the Web Interface you can find a web page listing partitions and more information. You will find a page summarizing the partition status of the system. This is essentially the output of the command pm ls -l.
Example:
For your dCache on dcache.example.org
the address is
http://dcache.example.org:2288/poolInfo/parameterHandler/set/matrix/*
For the subsequent examples we assume a basic poolmanager
setup :
Example:
# # define the units # psu create unit -protocol */* psu create unit -protocol xrootd/* psu create unit -net 0.0.0.0/0.0.0.0 psu create unit -net 131.169.0.0/255.255.0.0 psu create unit -store *@* # # define unit groups # psu create ugroup any-protocol psu create ugroup any-store psu create ugroup world-net psu create ugroup xrootd # psu addto ugroup any-protocol */* psu addto ugroup any-store *@* psu addto ugroup world-net 0.0.0.0/0.0.0.0 psu addto ugroup desy-net 131.169.0.0/255.255.0.0 psu addto ugroup xrootd xrootd/* # # define the pools # psu create pool pool1 psu create pool pool2 psu create pool pool3 psu create pool pool4 # # define the pool groups # psu create pgroup default-pools psu create pgroup special-pools # psu addto pgroup default-pools pool1 psu addto pgroup default-pools pool2 # psu addto pgroup special-pools pool3 psu addto pgroup special-pools pool4 #
For a special set of pools, where we only allow the xrootd
protocol, we don’t want the datasets to be replicated on high
load while for the rest of the pools we allow replication on
hot spot detection.
Example:
# pm create xrootd-section # pm set default -p2p=0.4 pm set xrootd-section -p2p=0.0 # psu create link default-link any-protocol any-store world-net psu add link default-link default-pools psu set link default-link -readpref=10 -cachepref=10 -writepref=0 # psu create link xrootd-link xrootd any-store world-net psu add link xrootd-link special-pools psu set link xrootd-link -readpref=10 -cachepref=10 -writepref=0 psu set link xrootd-link -section=xrootd-section #
For a set of pools we select pools following the default setting of cpu and space related cost factors. For incoming traffic from outside, though, we select the same pools, but in a randomly distributed fashion. Please note that this is not really a physical partitioning of the dCache system, but rather a virtual one, applied to the same set of pools.
Example:
# pm create incoming-section # pm set default -cpucostfactor=0.2 -spacecostfactor=1.0 pm set incoming-section -cpucostfactor=0.0 -spacecostfactor=0.0 # psu create link default-link any-protocol any-store desy-net psu add link default-link default-pools psu set link default-link -readpref=10 -cachepref=10 -writepref=10 # psu create link default-link any-protocol any-store world-net psu add link default-link default-pools psu set link default-link -readpref=10 -cachepref=10 -writepref=0 # psu create link incoming-link any-protocol any-store world-net psu add link incoming-link default-pools psu set link incoming-link -readpref=10 -cachepref=10 -writepref=10 psu set link incoming-link -section=incoming-section #
The classic
partition type implements
the load balancing policy known from dCache releases before
version 2.0. This partition type is still the default. This
section describes this load balancing policy and the available
configuration parameters.
Example:
To create a classic partition use the command: pm create -type=classic <partitionName>
From the allowable pools as determined by the pool selection unit, the pool manager determines the pool used for storing or reading a file by calculating a cost value for each pool. The pool with the lowest cost is used.
If a client requests to read a file which is stored on more than one allowable pool, the performance costs are calculated for these pools. In short, this cost value describes how much the pool is currently occupied with transfers.
If a pool has to be selected for storing a file, which is either written by a client or restored from a tape backend, this performance cost is combined with a space cost value to a total cost value for the decision. The space cost describes how much it “hurts” to free space on the pool for the file.
The cost module is responsible for calculating the cost values for all pools. The pools regularly send all necessary information about space usage and request queue lengths to the cost module. It can be regarded as a cache for all this information. This way it is not necessary to send “get cost” requests to the pools for each client request. The cost module interpolates the expected costs until a new precise information package is coming from the pools. This mechanism prevents clumping of requests.
Calculating the cost for a data transfer is done in two steps. First, the cost module merges all information about space and transfer queues of the pools to calculate the performance and space costs separately. Second, in the case of a write or stage request, these two numbers are merged to build the total cost for each pool. The first step is isolated within a separate loadable class. The second step is done by the partition.
The load of a pool is determined by comparing the current number of active and waiting transfers to the maximum number of concurrent transfers allowed. This is done separately for each of the transfer types (store, restore, pool-to-pool client, pool-to-pool server, and client request) with the following equation:
perfCost(per Type) = ( activeTransfers + waitingTransfers ) / maxAllowed .
The maximum number of concurrent transfers (maxAllowed) can be configured with the commands st set max active (store), rh set max active (restore), mover set max active (client request), mover set max active -queue=p2p (pool-to-pool server), and pp set max active (pool-to-pool client).
Then the average is taken for each mover type where maxAllowed is not zero. For a pool where store, restore and client transfers are allowed, e.g.,
perfCost(total) = ( perfCost(store) + perfCost(restore) + perfCost(client) ) / 3 ,
and for a read only pool:
perfCost(total) = ( perfCost(restore) + perfCost(client) ) / 2 .
For a well balanced system, the performance cost should not exceed 1.0.
In this section only the new scheme for calculating the space cost will be described. Be aware, that the old scheme will be used if the breakeven parameter of a pool is larger or equal 1.0.
The cost value used for determining a pool for storing a file depends either on the free space on the pool or on the age of the least recently used (LRU) file, which whould have to be deleted.
The space cost is calculated as follows:
If | freeSpace > gapPara | then | spaceCost = 3 * newFileSize / freeSpace | ||
If | freeSpace <= gapPara | and | lruAge < 60 | then | spaceCost = 1 + costForMinute |
If | freeSpace <= gapPara | and | lruAge >= 60 | then | spaceCost = 1 + costForMinute * 60 / lruAge |
where the variable names have the following meanings:
The free space left on the pool
The size of the file to be written to one of the pools, and at least 50MB.
The age of the least recently used file on the pool.
The gap parameter. Default is 4 GiB. The
size of free space below which it will
be assumed that the pool is full and
consequently the least recently used
file has to be removed. If, on the other
hand, the free space is greater than
gapPara
, it will be
expensive to store a file on the pool
which exceeds the free space.
It can be set per pool with the set gap command. This has to be done in the pool cell and not in the pool manager cell. Nevertheless it only influences the cost calculation scheme within the pool manager and not the bahaviour of the pool itself.
A parameter which fixes the space cost of a one-minute-old LRU file to (1 + costForMinute). It can be set with the set breakeven, where
costForMinute = breakeven * 7 * 24 * 60.
I.e. the the space cost of a one-week-old LRU file will be (1 + breakeven). Note again, that all this only applies if breakeven < 1.0
The prescription above can be stated a little differently as follows:
If | freeSpace > gapPara | then | spaceCost = 3 * newFileSize / freeSpace |
If | freeSpace <= gapPara | then | spaceCost = 1 + breakeven * 7 * 24 * 60 * 60 / lruAge , |
where newFileSize
is at least 50MB and
lruAge
at least one minute.
As the last version of the formula suggests, a pool can be in two states: Either freeSpace > gapPara or freeSpace <= gapPara - either there is free space left to store files without deleting cached files or there isn’t.
Therefore, gapPara
should be
around the size of the smallest files which
frequently might be written to the pool. If files
smaller than gapPara
appear
very seldom or never, the pool might get stuck in
the first of the two cases with a high cost.
If the LRU file is smaller than the new file, other files might have to be deleted. If these are much younger than the LRU file, this space cost calculation scheme might not lead to a selection of the optimal pool. However, in pratice this happens very seldomly and this scheme turns out to be very efficient.
The total cost is a linear combination of the
performance
and space
cost.
I.e.
totalCost = ccf * perfCost + scf * spaceCost ,
where ccf
and scf
are
configurable with the command set pool decision.
E.g.,
(PoolManager) admin >
set pool decision
-spacecostfactor=3
-cpucostfactor=1
will give the space cost three times the weight of the performance cost.
Classic partitions have a large number of tunable parameters. These parameters are set using the pm set command.
Example:
To set the space cost factor on the
default
partition to
0.3
, use the following
command:
pm set default -spacecostfactor=0.3
Command | Meaning | Type |
---|---|---|
|
Sets the
The default value is | float |
|
Sets the cpu cost factor for the partition.
The default value is | float |
|
The concept of the idle value
will be turned on if
<idle-value> >
A pool is idle if its performance cost is smaller than the <idle-value>. Otherwise it is not idle. If one or more pools that satisfy the read request are idle then only one of them is chosen for a particular file irrespective of total cost. I.e. if the same file is requested more than once it will always be taken from the same pool. This allowes the copies on the other pools to age and be garbage collected.
The default value is | float |
|
Sets the static replication threshold for the partition.
If the performance cost on the best pool exceeds
<p2p-value> and the value for
<slope>
=
The default value is | float |
|
Sets the alert value for the partition. If the best pool’s performance cost exceeds the p2p value and the alert value then no pool to pool copy is triggered and a message will be logged stating that no pool to pool copy will be made.
The default value is | float |
|
Sets the panic cost cut level for the partition. If the performance cost of the best pool exceeds the panic cost cut level the request will fail.
The default value is | float |
|
Sets the fallback cost cut level for the partition.
If the best pool’s performance cost exceeds the
fallback cost cut level then a pool of the next
level will be chosen. This
means for example that instead of choosing a pool with
The default value is | float |
|
Sets the dynamic replication threshold value for the partition.
If <slope>>
If the performance cost on the best pool exceeds this threshold then this pool is called hot.
The default value is | float |
|
This value can be specified if an HSM is attached to the dCache.
If a partition has no HSM connected, then this
option is overridden. This means that no matter which
value is set for
By setting <value> =
If <value> =
As a side effect of setting <value> =
The default value is | boolean |
|
Determines whether pool to pool replication is allowed if the best pool for a read request is hot.
The default value is | boolean |
|
If the best pool is hot and the requested file will be
copied either from the hot pool or from tape to
another pool, then the requested file will be read
from the pool where it just had been copied to if
<value> =
The default value is | boolean |
|
Set the stage allowed value to
As a side effect, setting the value for
The default value is | boolean |
|
If the best pool is hot, p2p-oncost is disabled and an HSM is connected to a pool then this parameter determines whether to stage the requested file to a different pool.
The default value is | boolean |
|
Sets the maximal number of replicas of one file. If the maximum is reached no more replicas will be created.
The default value is | integer |