release notes | Book: 1.9.5, 1.9.12 (opt, FHS), 2.7 (FHS), 2.8 (FHS), 2.9 (FHS), 2.10 (FHS), 2.11 (FHS), 2.12 (FHS), 2.13 (FHS), | Wiki | Q&A black_bg
Web: Multi-page, Single page | PDF: A4-size, Letter-size | eBook: epub black_bg

The Partition Manager

The partition manager defines one or more load balancing policies. Whereas the PSU produces a prioritized set of candidate pools using a collection of rules defined by the administrator, the load balancing policy determines the specific pool to use. It is also the load balancing policy that determines when to fall back to lesser prirority links, or when to trigger creation of additional copies of a file.

Since the load balancing policy and parameters are defined per partition, understanding the partition manager is essential to tuning load balancing. This does not imply that one has to partition the dCache instance. It is perfectly valid to use a single partition for the complete instance.

This section documents the use of the partition manager, how to create partitions, set parameters and how to associate links with partitions. In the following sections the available partition types and their configuration parameters are described.

[return to top]

Overview

There are various parameters that affect the load balancing policy. Some of them are generic and apply to any load balancing policy, but many are specific to a particular policy. To avoid limiting the complete dCache instance to a single configuration, the choice of load balancing policy and the various parameters apply to partitions of the instance. The load balancing algorithm and the available parameters is determined by the partition type.

Each PSU link can be associated with a different partion and the policy and parameters of that partition will be used to choose a pool from the set of candidate pools. The only partition that exists without being explicitly created is the partition called default. This partition is used by all links that do not explicitly identify a partition. Other partitions can be created or modified as needed.

The default partition has a hard-coded partition type called classic. This type implements the one load balancing policy that was available in dCache before version 2.0. The classic partition type is described later. Other partitions have one of a number of available types. The system is pluggable, meaning that third party plugins can be loaded at runtime and add additional partition types, thus providing the ultimate control over load balancing in dCache. Documentation on how to develop plugins is beyond the scope of this chapter.

To ease the management of partition parameters, a common set of shared parameters can be defined outside all partitions. Any parameter not explicitly set on a partition inherits the value from the common set. If not defined in the common set, a default value determined by the partition type is used. Currently, the common set of parameters happens to be the same as the parameters of the default partition, however this is only due to compatibility constraints and may change in future versions.

[return to top]

Managing Partitions

For each partition you can choose the load balancing policy. You do this by chosing the type of the partition.

Currently four different partition types are supported:

classic:

This is the pool selection algorithm used in the versions of dCache prior to version 2.0. See the section called “Classic Partitions” for a detailed description.

random:

This pool selection algorithm selects a pool randomly from the set of available pools.

lru:

This pool selection algorithm selects the pool that has not been used the longest.

wass:

This pool selection algorithm selects pools randomly weighted by available space, while incorporating age and amount of garbage collectible files and information about load.

This is the partition type of the default partition. See How to Pick a Pool for more details.

wrandom:

This pool selection algorithm selects read pools randomly and write pools with a weighted probability (based on (free+removable/total) per pool.

The wrandom partition type is a special case of the wass partition type: by choosing the correct parameters, wass can be made to perform like wrandom.

The advantage of wass over wrandom is that it is more flexible and tries to take into account a file’s age. This is something like a distributed LRU cache: the system tries to delete older removable files to keep newer removable files. This is advantageous if files read recently are more likely to be re-read than older files.

Commands related to dCache partitioning:

  • pm types

    Lists available partition types. New partition types can be added through plugins.

  • pm create [-type=<partitionType>] <partitionName>

    Creates a new partition. If no partition type is specified, then a wass partition is created.

  • pm set [<partitionName>] -<parameterName> =<value>|off

    Sets a parameter <parameterName> to a new value.

    If <partitionName> is omitted, the common shared set of parameters is updated. The value is used by any partition for which the parameter is not explicitly set.

    If a parameter is set to off then this parameter is no longer defined and is inherited from the common shared set of parameters, or a partition type specific default value is used if the parameter is not defined in the common set.

  • pm ls [-l] [<partitionName>]

    Lists a single or all partitions, including the type of each partition. If a partition name or the -l option are used, then the partition parameters are shown too. Inherited and default values are identified as such.

  • pm destroy <partitionName>

    Removes a partition from dCache. Any links configured to use this partition will fall back to the default partition.

[return to top]

Using Partitions

A partition, so far, is just a set of parameters which may or may not differ from the default set. To let a partition relate to a part of the dCache, links are used. Each link may be assigned to exactly one partition. If not set, or the assigned partition doesn’t exist, the link defaults to the default partition.

psu set link [<linkName>] -section=<partitionName> [<other-options>...]

Whenever this link is chosen for pool selection, the associated parameters of the assigned partition will become active for further processing.

Warning

Depending on the way links are setup it may very well happen that more than just one link is triggered for a particular dCache request. This is not illegal but leads to an ambiguity in selecting an appropriate dCache partition. If only one of the selected links has a partition assigned, this partition is chosen. Otherwise, if different links point to different partitions, the result is indeterminate. This issue is not yet solved and we recommend to clean up the poolmanager configuration to eliminate links with the same preferences for the same type of requests.

In the Web Interface you can find a web page listing partitions and more information. You will find a page summarizing the partition status of the system. This is essentially the output of the command pm ls -l.

Example:

For your dCache on dcache.example.org the address is

http://dcache.example.org:2288/poolInfo/parameterHandler/set/matrix/*

[return to top]

Examples

For the subsequent examples we assume a basic poolmanager setup :

Example:

#
# define the units
#
psu create unit -protocol   */*
psu create unit -protocol   xrootd/*
psu create unit -net        0.0.0.0/0.0.0.0
psu create unit -net        131.169.0.0/255.255.0.0
psu create unit -store      *@*
#
#  define unit groups
#
psu create ugroup  any-protocol
psu create ugroup  any-store
psu create ugroup  world-net
psu create ugroup  xrootd
#
psu addto ugroup any-protocol */*
psu addto ugroup any-store    *@*
psu addto ugroup world-net    0.0.0.0/0.0.0.0
psu addto ugroup desy-net     131.169.0.0/255.255.0.0
psu addto ugroup xrootd       xrootd/*
#
#  define the pools
#
psu create pool pool1
psu create pool pool2
psu create pool pool3
psu create pool pool4

#
#  define the pool groups
#
psu create pgroup default-pools
psu create pgroup special-pools
#
psu addto pgroup default-pools pool1
psu addto pgroup default-pools pool2
#
psu addto pgroup special-pools pool3
psu addto pgroup special-pools pool4
#

[return to top]

Disallowing pool to pool transfers for special pool groups based on the access protocol

For a special set of pools, where we only allow the xrootd protocol, we don’t want the datasets to be replicated on high load while for the rest of the pools we allow replication on hot spot detection.

Example:

#
pm create xrootd-section
#
pm set default        -p2p=0.4
pm set xrootd-section -p2p=0.0
#
psu create link default-link any-protocol any-store world-net
psu add    link default-link default-pools
psu set    link default-link -readpref=10 -cachepref=10 -writepref=0
#
psu create link xrootd-link xrootd any-store world-net
psu add    link xrootd-link special-pools
psu set    link xrootd-link -readpref=10 -cachepref=10 -writepref=0
psu set    link xrootd-link -section=xrootd-section
#        

[return to top]

Choosing pools randomly for incoming traffic only

For a set of pools we select pools following the default setting of cpu and space related cost factors. For incoming traffic from outside, though, we select the same pools, but in a randomly distributed fashion. Please note that this is not really a physical partitioning of the dCache system, but rather a virtual one, applied to the same set of pools.

Example:

#
pm create incoming-section
#
pm set default          -cpucostfactor=0.2 -spacecostfactor=1.0
pm set incoming-section -cpucostfactor=0.0 -spacecostfactor=0.0
#
psu create link default-link any-protocol any-store desy-net
psu add    link default-link default-pools
psu set    link default-link -readpref=10 -cachepref=10 -writepref=10
#
psu create link default-link any-protocol any-store world-net
psu add    link default-link default-pools
psu set    link default-link -readpref=10 -cachepref=10 -writepref=0
#
psu create link incoming-link any-protocol any-store world-net
psu add    link incoming-link default-pools
psu set    link incoming-link -readpref=10 -cachepref=10 -writepref=10
psu set    link incoming-link -section=incoming-section
#

[return to top]

Classic Partitions

The classic partition type implements the load balancing policy known from dCache releases before version 2.0. This partition type is still the default. This section describes this load balancing policy and the available configuration parameters.

Example:

To create a classic partition use the command: pm create -type=classic <partitionName>

[return to top]

Load Balancing Policy

From the allowable pools as determined by the pool selection unit, the pool manager determines the pool used for storing or reading a file by calculating a cost value for each pool. The pool with the lowest cost is used.

If a client requests to read a file which is stored on more than one allowable pool, the performance costs are calculated for these pools. In short, this cost value describes how much the pool is currently occupied with transfers.

If a pool has to be selected for storing a file, which is either written by a client or restored from a tape backend, this performance cost is combined with a space cost value to a total cost value for the decision. The space cost describes how much it hurts to free space on the pool for the file.

The cost module is responsible for calculating the cost values for all pools. The pools regularly send all necessary information about space usage and request queue lengths to the cost module. It can be regarded as a cache for all this information. This way it is not necessary to send get cost requests to the pools for each client request. The cost module interpolates the expected costs until a new precise information package is coming from the pools. This mechanism prevents clumping of requests.

Calculating the cost for a data transfer is done in two steps. First, the cost module merges all information about space and transfer queues of the pools to calculate the performance and space costs separately. Second, in the case of a write or stage request, these two numbers are merged to build the total cost for each pool. The first step is isolated within a separate loadable class. The second step is done by the partition.

[return to top]

The Performance Cost

The load of a pool is determined by comparing the current number of active and waiting transfers to the maximum number of concurrent transfers allowed. This is done separately for each of the transfer types (store, restore, pool-to-pool client, pool-to-pool server, and client request) with the following equation:

perfCost(per Type) = ( activeTransfers + waitingTransfers ) / maxAllowed .

The maximum number of concurrent transfers (maxAllowed) can be configured with the commands st set max active (store), rh set max active (restore), mover set max active (client request), mover set max active -queue=p2p (pool-to-pool server), and pp set max active (pool-to-pool client).

Then the average is taken for each mover type where maxAllowed is not zero. For a pool where store, restore and client transfers are allowed, e.g.,

perfCost(total) = ( perfCost(store) + perfCost(restore) + perfCost(client) ) / 3 ,

and for a read only pool:

perfCost(total) = ( perfCost(restore) + perfCost(client) ) / 2 .

For a well balanced system, the performance cost should not exceed 1.0.

[return to top]

The Space Cost

In this section only the new scheme for calculating the space cost will be described. Be aware, that the old scheme will be used if the breakeven parameter of a pool is larger or equal 1.0.

The cost value used for determining a pool for storing a file depends either on the free space on the pool or on the age of the least recently used (LRU) file, which whould have to be deleted.

The space cost is calculated as follows:

If freeSpace > gapPara    then spaceCost = 3 * newFileSize / freeSpace
If freeSpace <= gapPara and lruAge < 60 then spaceCost = 1 + costForMinute
If freeSpace <= gapPara and lruAge >= 60 then spaceCost = 1 + costForMinute * 60 / lruAge

where the variable names have the following meanings:

freeSpace

The free space left on the pool

newFileSize

The size of the file to be written to one of the pools, and at least 50MB.

lruAge

The age of the least recently used file on the pool.

gapPara

The gap parameter. Default is 4 GiB. The size of free space below which it will be assumed that the pool is full and consequently the least recently used file has to be removed. If, on the other hand, the free space is greater than gapPara, it will be expensive to store a file on the pool which exceeds the free space.

It can be set per pool with the set gap command. This has to be done in the pool cell and not in the pool manager cell. Nevertheless it only influences the cost calculation scheme within the pool manager and not the bahaviour of the pool itself.

costForMinute

A parameter which fixes the space cost of a one-minute-old LRU file to (1 + costForMinute). It can be set with the set breakeven, where

costForMinute = breakeven * 7 * 24 * 60.

I.e. the the space cost of a one-week-old LRU file will be (1 + breakeven). Note again, that all this only applies if breakeven < 1.0

The prescription above can be stated a little differently as follows:

If freeSpace > gapPara then spaceCost = 3 * newFileSize / freeSpace
If freeSpace <= gapPara then spaceCost = 1 + breakeven * 7 * 24 * 60 * 60 / lruAge ,

where newFileSize is at least 50MB and lruAge at least one minute.

[return to top]

Rationale

As the last version of the formula suggests, a pool can be in two states: Either freeSpace > gapPara or freeSpace <= gapPara - either there is free space left to store files without deleting cached files or there isn’t.

Therefore, gapPara should be around the size of the smallest files which frequently might be written to the pool. If files smaller than gapPara appear very seldom or never, the pool might get stuck in the first of the two cases with a high cost.

If the LRU file is smaller than the new file, other files might have to be deleted. If these are much younger than the LRU file, this space cost calculation scheme might not lead to a selection of the optimal pool. However, in pratice this happens very seldomly and this scheme turns out to be very efficient.

[return to top]

The Total Cost

The total cost is a linear combination of the performance and space cost. I.e. totalCost = ccf * perfCost + scf * spaceCost , where ccf and scf are configurable with the command set pool decision. E.g.,

(PoolManager) admin > set pool decision -spacecostfactor=3 -cpucostfactor=1

will give the space cost three times the weight of the performance cost.

[return to top]

Parameters of Classic Partitions

Classic partitions have a large number of tunable parameters. These parameters are set using the pm set command.

Example:

To set the space cost factor on the default partition to 0.3, use the following command:

                  pm set default -spacecostfactor=0.3
              
CommandMeaningType

pm set [<partitionName>] -spacecostfactor=<scf>

Sets the space cost factor for the partition.

The default value is 1.0.

float

pm set [<partitionName>] -cpucostfactor=<ccf>

Sets the cpu cost factor for the partition.

The default value is 1.0.

float

pm set [<partitionName>] -idle=<idle-value>

The concept of the idle value will be turned on if <idle-value> > 0.0.

A pool is idle if its performance cost is smaller than the <idle-value>. Otherwise it is not idle.

If one or more pools that satisfy the read request are idle then only one of them is chosen for a particular file irrespective of total cost. I.e. if the same file is requested more than once it will always be taken from the same pool. This allowes the copies on the other pools to age and be garbage collected.

The default value is 0.0, which disables this feature.

float

pm set [<partitionName>] -p2p=<p2p-value>

Sets the static replication threshold for the partition.

If the performance cost on the best pool exceeds <p2p-value> and the value for <slope> = 0.0 then this pool is called hot and a pool to pool replication may be triggered.

The default value is 0.0, which disables this feature.

float

pm set [<partitionName>] -alert=<value>

Sets the alert value for the partition.

If the best pool’s performance cost exceeds the p2p value and the alert value then no pool to pool copy is triggered and a message will be logged stating that no pool to pool copy will be made.

The default value is 0.0, which disables this feature.

float

pm set [<partitionName>] -panic=<value>

Sets the panic cost cut level for the partition.

If the performance cost of the best pool exceeds the panic cost cut level the request will fail.

The default value is 0.0, which disables this feature.

float

pm set [<partitionName>] -fallback=<value>

Sets the fallback cost cut level for the partition.

If the best pool’s performance cost exceeds the fallback cost cut level then a pool of the next level will be chosen. This means for example that instead of choosing a pool with readpref = 20 a pool with readpref < 20 will be chosen.

The default value is 0.0, which disables this feature.

float

pm set [<partitionName>] -slope=<slope>

Sets the dynamic replication threshold value for the partition.

If <slope>> 0.01 then the product of best pool’s performance cost and <slope> is used as threshold for pool to pool replication.

If the performance cost on the best pool exceeds this threshold then this pool is called hot.

The default value is 0.0, which disables this feature.

float

pm set [<partitionName>] -p2p-allowed=<value>

This value can be specified if an HSM is attached to the dCache.

If a partition has no HSM connected, then this option is overridden. This means that no matter which value is set for p2p-allowed the pool to pool replication will always be allowed.

By setting <value> = off the values for p2p-allowed, p2p-oncost and p2p-fortransfer will take over the value of the default partition.

If <value> = yes then pool to pool replication is allowed.

As a side effect of setting <value> = no the values for p2p-oncost and p2p-fortransfer will also be set to no.

The default value is yes.

boolean

pm set [<partitionName>] -p2p-oncost=<value>

Determines whether pool to pool replication is allowed if the best pool for a read request is hot.

The default value is no.

boolean

pm set [<partitionName>] -p2p-fortransfer=<value>

If the best pool is hot and the requested file will be copied either from the hot pool or from tape to another pool, then the requested file will be read from the pool where it just had been copied to if <value> = yes. If <value> = no then the requested file will be read from the hot pool.

The default value is no.

boolean

pm set [<partitionName>] -stage-allowed=<value>

Set the stage allowed value to yes if a tape system is connected and to no otherwise.

As a side effect, setting the value for stage-allowed to no changes the value for stage-oncost to no.

The default value is no.

boolean

pm set [<partitionName>] -stage-oncost=<value>

If the best pool is hot, p2p-oncost is disabled and an HSM is connected to a pool then this parameter determines whether to stage the requested file to a different pool.

The default value is no.

boolean

pm set [<partitionName>] -max-copies=<copies>

Sets the maximal number of replicas of one file. If the maximum is reached no more replicas will be created.

The default value is 500.

integer