When file is transfered into the dCache its replica is copied into one of the pools. Since this is the only replica and normally required range is higher (e.g., (2,3) ), this file will be replicated to other pools. When some pool goes down the replica count for the files in that pool may fall below the valid range and these files will be replicated. Replicas of the file with replica count below the valid range and which need replication are called deficient replicas.
Later on some of the failed pools can come up and bring online more valid replicas. If there are too many replicas for some file these extra replicas are called redundant replicas and they will be “reduced”. Extra replicas will be deleted from pools.
Resilience Manager (RM) counts number of replicas for each file in the pools which can be used online (see Pool States below) and keeps number of replicas within the valid range (min, max).
RM keeps information about pool state, list of the replicas ( file ID, pool ) and current copy/delete operations in persistent database.
For each replica RM keeps list of pools where it can be found. For the pools pool state is kept in DB. There is table which keeps ongoing operations (replication, deletion) for replica.
This is description of pool states as it is in v1.0 of Risilience Manager. Some of the states and transitions will be changed in the next release.
normal operation. Replicas in this state are readable and can be counted. Files can be written (copied) to this pool.
dCache pool is stopped by operator or crashed. On startup, pool comes briefly to the online state, and then it goes “down” to do pool “Inventory” — to cleanup files which broke when pool crashed during transfer. When pool comes online again, RM will update list of replicas in the pool and store it in the DB.
Replicas in pools which are “down” are not “counted”, so when pool crashes it reduces number of “online” replicas for some files. The crash of the pool (pool departure) may trigger replication of multiple files.
Pool recovery (arrival) may rigger massive deletition of file replicas, not necessarily in this pool.
There are special situations when operator wants to change pool state and he/she does not want to trigger massive replication. Or vice versa he/she wants to take pool permanently out of operation and wants to make sure that files in the pool will not be locked out and will be available later.
replicas in this pool are counted whether this pool is up or down. It does done matter fore replication purpose if offline pool goes down or up. Rationale — operator wants to bring pool down briefly and he/she knows that replicas in the pool are safe. This state is introduced to avoid unnecessary massive replication. When pool comes online from offline state replicas in the pool will be inventoried to make sure we know the real list of replicas in the pool.
operator needs to set pool or set of pools down permanently and wants to make sure that there no replicas “locked out” when all known replicas of the file are in the pools which are unavailable. Thus whether pool is really up or down replicas in it are not counted.
- drainoff, offline-prepare
transient states between online and down or offline states respectively. If there are files which can be “locked out” in down or offline states, they will be evacuated — at least one replica for each locked file will be copied out. It is unlikely that file will be locked out when singly pool goes down — normally few replicas are online. When several pools go down or set drainoff or offline file lockout may happens.
Currently replicas counted separately in groups of offline-prepare and drainoff pools.
RM needs the single copy of the replica to be copied out and then you can turn pool down, the other replicas will be made from the replica available online. To confirm that it is safe to turn pool down there is command to check number of files which can be locked in this pool.
v1.0 — these states called “transient” but pool does not automatically turned down
The number of the pools in the system may be large and it may be inconvenient to keep configuration of the system predefined in some file. On startup complete configuration is unknown and RM tries to keep number of replicas in the valid range as pools arrive and departure and files are copied in. On the other hand when groups of pools arrive or departure it leads to massive replica cloning or reduction. It is beneficial to suspend ajustments until system arrives to more or less stable configuration.
When RM starts it cleans up DB. Then it waits for some time to give a chance to the pools to get connected. RM tries do not start too early and give a chance to most of the pools in the system to connect. Otherwise unnecessary massive replication will start. When configuration is unknown RM waits for some time until “quorum” of the pools get connected. Currently this is implemented by some delay to start adjustments to get chance to the pools to connect.
Normally (during Cold Start) all information in DB is cleaned up and recreated again by polling pools which are online shortly after some minimum delay after RM starts. RM starts to track pools state (pool up/down messages and polling list of online pools) and updates list of replicas in the pools which came online. This process lasts for about 10-15 minutes to make sure all pools come up online and/or get connected. Pools which once get connected to RM are in online or down state.
It can be annoying to wait for some large period of time until all known “good” pools get connected. There is “Hot Restart” option to accelerate restart of the system after the crash of the head node.
On Hot Restart RM retrieves information about pools state before the crash from DB and saves pools state to some internal structure. When pool gets connected RM checks the old pool state and registers old pools state in DB again if the state was offline, offline-prepare or “drainoff” state. RM also checks if the pool was online before the crash. When all pools which were “online” get connected once, RM supposes it recovered it’s old configuration and RM starts adjustments. RM operates in the “fluid world”. It does not required that pools stay online. The point is when all online pools get connected online we can start adjustments. If some pools went down during connection process they are already accounted and adjustment will take care of it.
Example: Suppose we had have ten pools in the system where eight pools were online and two were offline. RM does not care about two offline pools get connected to start adjustments. For the other eight pools which were online, suppose one pool get connected and then it falls down while the other pools try to connect. RM considers this pool in known state, and when other seven pools get connected it can start adjustments and does not wait any more. If system was in equilibrium state before the crash, RM may find some deficient replicas because of the crashed pool and start replication right away.
RM has few threads running at the same time. Adjuster keeps count of the replicas within the valid range, the other threads help to do this.
Adjuster. Information about all replicas is kept in DB. Adjuster makes several queries in DB during adjustment cycle to get the list of files for which replicas must be reduced or replicated:
redundant replicas, Nrep > max
unique replicas in drainoff pools
unique replicas in offline-prepare pools
deficient replicas, Nrep < min
Number of replicas is counted in pools which are online or offline. Offline-prepare or drainoff pools considered read-only and can be used as a source pool for replication. Last replica of the file in the system must not be removed.
The information in DB updated when new replica is added or removed from the pool. When some pool changes it’s state all replicas in the pool became available or unavailable. This changes the number of accessible replicas for the file. The current list is marked as invalid and RM restarts adjustment cycle from the beginning. When nothing happens for some time adjustment cycle is triggered by timeout to make sure RM did not miss anything because some messages get lost.
When it is found that replica needs replication or reduction the worker thread starts to do the job asynchronously. Number of Worker threads is limited to the max [default=6], separately for reducers and replicators. If no workers are available adjuster will wait for the worker thread. Worker thread starts operation by sending message to dCache and waits until operation finishes or timeout expires. The timeout is different for reduction (replica removal) and replication, the replication timout shall be larger to account for the time to transfer the largest file between the pools. When the worker thread starts operation it marks replica as “having the operation” in action table in DB, and this replica will be excluded from other operations in the system until operation done or timeout expire. When there are few replicas for the same file found to be replicated (or reduced), RM schedules one replica for replication and proceeds with processing the other files. When Adjuster reaches the end of the list, it may return to the processing of the other replicas of the first file without delay considering the previous operation with the file complete.
Sometimes Adjuster gets error on operation with replica and in some cases if it does the same operation with the same replica again this “unresolved” error happens again and again blocking RM to keep from processing other replicas. To avoid such loops and “dynamic deadlock” RM can put the replica which encountered the problem into “exclude” state. To return this replica into operation administrator shall manually “release” this replica.
When pool changes its state RM receives a message which can be lost or is not sent in some cases like pool crash. To make sure RM has correct information about pool states it runs PoolsWatchDog thread. WatchDog polls pools states and compares it to the result of the previous poll to find out which pools departed from or arrived into the system. Then it sleeps for some time and does the check again. When there were no changes in the pool configuration WatchDog throttles messages “no pool configuration change” in the log file — but it is still running.
Cyclical threads — Adjuster and WatchDog write and timestamps it’s current state in DB. It is displayed on Web page so it is possible to check if it is running. Excluded files are listed there too.
If you are advanced user and have proper privileges and you
know how to issue command to admin interface you may connect
ReplicaManager cell and issue the following
commands. You may find more commands in online help which are
for debug only — do not use them as they can stop RM operating
||set pool state|
||show pool state|
check if pool drained off (has unique pndfsIds). Reports number of replicas in this pool. Zero if no locked replicas.
exclude <pnfsId> from adjustments
removes transaction/’BAD’ status for pnfsId
enable/disable DEBUG messages in the log file
“Hybrid” dCache operates on combination of “normal” pools (backuped to the tape or “scratch” pools) and the set of resilient pools. Resilience manager takes care only for the subset of pools configured in the Pool Group named “ResilientPools” and ignores all other pools. Currently resilient pool group name is hardcoded as “ResilientPools”, and you shall create replica manager cell to use in hybrid dCache by instantiating class diskCacheV111.replicaManager.ReplicaManagerV2 (note “V2” in the classname).
psu create pgroup ResilientPools
psu addto pgroup ResilientPools <myPoolName001> psu addto pgroup ResilientPools <myPoolName002> psu addto pgroup ResilientPools <myPoolName003>
Pools included in the resilient pool groop can also be included in other pool groups.
Default argument values as for
ReplicaManager.java,v 1.22 2004/08/25 22:32:07 cvs Exp
You do not need to put these arguments in the batch file until you want to change these defaults and you know what are you doing. For normal operation you may want to chose “-ColdStart” or “-hotRestart” (is default) mode of startup and (min,max) for desired range of number of replicas of the file.
Valid range for the replicas count in “available” pools.
|-debug=false | true||
Disable / enable debug messages in the log file
Startup will be accelerated, when all “known” pools registered in DB as “online” before the crash, will re-connect again during hot restart. Opposite to -coldStart.
Good for the first time or big changes in pool configuration. Will create new pool configuration in DB. Opposite to -hotRestart.
|-delayDBStartTO=1200||on Cold Start:|
DB init thread sleep this time to get chance to pools to get connected to prevent massive replications when not all pools connected yet when the replication starts.
Normally Adjuster waits for DB init thread to finish. If by some abnormal reason it can not find DB thread then it will sleep for this delay. It should be slightly more then “delayDBStartTO”.
Configure host:port where DB server is running and
DB table name. For DB on remote host you shall
DB driver. Replica Manager was tested with Postgres DB only.
Configure different DB user
Configure different DB path
Number of worker threads to do the replication, the same number of worker threads used for reduction. Must be more for larger system but avoid situation when requests get queued in the pool.
Timeout for pool-to-pool replica copy transfer.
Timeout to delete replica from the pool.
Adjuster cycle period. If nothing changed, sleep for this time, and restart adjustment cycle to query DB and check do we have work to do?
Pools Watch Dog pool period. Poll the pools with this period to find if some pool went south without sending notice (messages). Can not be too short because pool can have high load and do not send pings for some time. Can not be less than pool ping period.