1.9.4 Release Notes

The main focus areas of the 1.9.4 release are access control for staging from tape, a more scalable xrootd redirector, and a few performance improvements in Chimera.

Upgrade Instructions

Incompatibilities

Please consider the following changes when upgrading from a version before 1.9.4-1:

When upgrading the SRM door from 1.9.1-10, 1.9.2-9, or 1.9.4-1, the SRM database needs to be cleaned manually. See details below.

Compatibility

It is safe to mix pools of releases 1.9.1 to 1.9.4, and they can be used with any version of the head nodes. Head nodes and doors must be upgraded to 1.9.4 together and cannot be mixed with head nodes or doors of 1.9.1, 1.9.2, or 1.9.3. Components of different 1.9.4 releases can be mixed freely. 1.9.0 is no longer supported.

Compatibility Matrix

We distinguish between pool components and head nodes. Any component which is not a pool is considered a head node, including doors. The following table shows compatibility between different releases.

1.9.4-7 Head1.9.4-7 Pool
Head 1.9.1-1..7,9..11noyes
1.9.1-8noyes[1]
1.9.2-1..5,8..11noyes
1.9.2-6,7noyes[1]
1.9.3-1..4noyes
1.9.4-1..7yes yes
Pool 1.9.1-1..7,9..11yesyes
1.9.1-8yes[1]yes
1.9.2-1..5,8..11yesyes
1.9.2-6,7yes[1]yes
1.9.3-1..4yes yes
1.9.4-1..7yes yes
  1. The migration module will not work for -target=pgroup and -target=link.

1.9.4-7

Pools

Fixed a problem with error handling after stage failures. The bug caused an errounous checksum verification to be performed after failed staging.

FTP

Fixed several race conditions with the proxy component for active transfers.

Detailed changelog 1.9.4-6 to 1.9.4-7

1.9.4-6

SRM Door

A race condition in a meta data printing routine was fixed. This race condition could lead to ArrayIndexOutOfBoundsExceptions in SRM and other components.

The handling of the SRM_PARTIAL_SUCCESS return code was fixed in the server side srmCopy implementation. The bug caused a compatibility issue with Castor.

A race condition that could lead to GSI authentication failures in the SRM has been fixed.

Pool

Logging of uncaught exceptions has been improved in several components. In older versions, critical errors could in some cases go unnoticed because the error was not logged.

A race condition was fixed in the pool code. On pools with -replicaOnArrival enabled, this race condition has on some sites been reported to cause 10% upload failures. The symptoms are that the upload hangs at the end of the transfer. We recommend that all sites using the -replicaOnArrival option upgrade their pools.

A couple of NullPointerException fixes have been made in the pool migration module.

An issue with recovery of broken SI files in the pool's control/ directory has been fixed. That issue prevented pools with such files from starting. With dCache 1.9.5-12 the SI file is recovered from PnfsManager.

Another problem affecting recovery of meta data on pools was that the recovery logic did not respect the lfs=volatile setting. This has been fixed to correctly mark files on such pools as cached.

PoolManager

PoolManager write balancing was broken since 1.9.3-1. This problem would mean that many concurrent writes had a tendency to go to the same pool. With 1.9.5-9 the old behaviour should now be restored.

Admin Shell

The SSH admin shell of dCache had an issue with how it closed the connection to the SSH client. This caused trouble in scripts, as the output from the session would sometimes be incomplete. This issue has now been fixed.

DCAP Doors

dCache supports running multiple DCAP doors on the same host. Support for this has however been broken since dCache 1.9.3-1 due to bugs in the init script. This has now been fixed.

GridFTP and GSIDCAP Doors

A race condition that could lead to GSI authentication failures in GridFTP.

gPlazma

A couple of NullPointerException fixes have been made in GPlazma.

Detailed changelog 1.9.4-5 to 1.9.4-6

1.9.4-5

In all releases since 1.9.1-1, the pool would in some cases ignore certain pool parameters. In particular the tag.hostname parameter would in some cases be dropped, but other parameters could be affected as well. This has been fixed in 1.9.4-5.

The robustness of the pool to pool transfer component embedded in pools has been improved. In particular the case where there is no available TCP port in the configured port range no longer causes the component to die. The transfer still fails, however subsequent transfers may now succeed if a TCP port becomes available.

A memory leak related to logging in pools has been fixed.

Passive mode DCAP reliability has been improved. In rare cases two concurrent transfers could cross and cause file corruption.

For non-SRM transfers, the message Already have 1 records(s) would be printed in the log file. The transfers would complete and space management was correct. dCache 1.9.4-5 eliminates the log message for non-SRM transfers.

The performance of the SAML or XACML plugins for GPlazma has been improved.

Several minor bug fixes have been made in the message passing layer. Those fixes improve the robustness of dCache.

The stage protection that was introduced in 1.9.4-1 would break access through SRM when enabled. This problem has now been fixed.

Detailed changelog 1.9.4-4 to 1.9.4-5

1.9.4-4

Detailed changelog 1.9.4-3 to 1.9.4-4

1.9.4-3

Updates to FTP doors: Fixed Kerberos FTP door, which broke in 1.9.2.

Updates to srm: The update space reservation command of SpaceManager can now extend the lifetime of a reservation. Fixed a medium risk vulnerability in SpaceManager.

Updates to pools: Fixed race condition in the migration module which in case of circular moves could cause data loss. Fixed restore of zero length files from HSM.

Updates to infoDomain: Fixed a bug which caused some derived values to not be updated.

Updates to Chimera: Eliminated warnings of missing PinboardAppender in Chimera client tools. The init script no longer starts dirDomain when Chimera is used (it is not needed).

Updates to httpdDomain: Fixed compatibility with Safari.

Updates to PinManager: Fixed repinning issues when pools are down.

Updates to GSIDCAP: Fixed VOMS role handling.

Updates to PoolManager: The trigger mechanism for hot-pool replication has been enhanced by integrating an algorithm contributed by Jon Bakken, FNAL. The algorithm ranks pools based on their CPU cost. The n-th percentile pool cost is chosen, where the n-th percentile is the cost of the pool within that ranking: 0% selects the lowest pool cost, 50% selects the median cost and 100% selects the highest pool cost. This cost is used as the threshold for establishing pool-to-pool "on cost" transfers. Specifying a on-cost value as a number not ending with "%" will result in the old behaviour; all current dCache deployments will have such a value. Specifying a value ending with "%" will result in the percentile cost being calculated dynamically and the resulting value used as the threshold for on-cost pool-to-pool transfers.

Detaild changelog 1.9.4-2 to 1.9.4-3

1.9.4-2

When upgrading the SRM door from 1.9.1-10, 1.9.2-9, or 1.9.4-1, the content of the srmrequestcredentials table has to be deleted. This table is present in the database used by the SRM door. Use the psql utility to connect to the database and issue the commands:

\c dcache
delete from srmrequestcredentials;

For instance like this:

Welcome to psql 8.3.7, the PostgreSQL interactive terminal.

Type:  \copyright for distribution terms
       \h for help with SQL commands
       \? for help with psql commands
       \g or terminate with semicolon to execute query
       \q to quit

postgres=# \c dcache
You are now connected to database "dcache".
dcache=# delete from srmrequestcredentials ;
DELETE 149
dcache=# 

Updates to FTP doors: Fixed a race condition.

Updates to chimeraDomain and Chimera: Fixed PostgreSQL 8.1 compatibility. Eliminated warnings of missing PinboardAppender in Chimera client tools.

Updates to replicaDomain: Replica manager was broken in 1.9.4-1 and is now fixed.

Updates to info provider: The execution bit of info-based-infoProvider.sh is not set.

Updates to srm: Fixed a race condition. Fixed a security issue. Improved database connection handling in space manager. Fixes a database leak.

Updates to pools: Fixed race conditions in DCAP, XROOTD and pool-to-pool support.

Detaild changelog 1.9.4-1 to 1.9.4-2

1.9.4-1

Access control for staging

Initially dCache has been designed to be a disk cache in front of a Tape Storage System, moving files onto the tape-backend and restoring them when needed. Those operations are handled transparently to the user. The downside of this approach is that a simple read of a file, not being on disk, automatically triggers a tape operation. As tape operations are expensive and may interfere with storing RAW data, coming from the Tier 0, this feature had to be reviewed. As a result, it has been agreed with the experiments that no non-production user should be allowed to trigger such a tape operation. dCache is now implementing a first version of such a protective mechanism. A dCache system administrator may specify a set of DN/FQAN's which are allowed to trigger tape read accesses for files not being available on disk. Users, requesting tape-only files, and not being on that white list, will receive a permission error and no tape operation is launched.

To enable stage protection, add the following line to config/dCacheSetup:

stageConfigurationFilePath=${ourHomeDir}/config/StageConfiguration.conf

The file config/StageConfiguration.conf will contain the white list. Lines starting with a hash symbol (#) are discarded as comments. Other lines may contain one or two regular expressions enclosed in double quotes. The first matches the DN, and the second the FQAN. The regular expression syntax follows the syntax defined for the Java Pattern class.

The following is an example:

# Allow all ATLAS users who have the role 'production'
".*" "/atlas/Role=production"

In the current release, the white list must be available to all doors and to the pin manager. This requirement will likely change in a future release.

Chimera

It is vital that the Chimera schema is updated when upgrading to dCache 1.9.4. This can be done using the following command:

psql -f /opt/d-cache/libexec/chimera/sql/create.sql chimera
psql -f /opt/d-cache/libexec/chimera/sql/pgsql-procedures.sql chimera

As most of the structures already exist, the above scripts will generate error messages about existing relations, entries, and triggers. These are safe to ignore.

Several performance improvements have been made in Chimera. Most notably the path to PNFS ID translation should now be faster when using PostgreSQL. A functionality issue relating to the use of symbolic links has been fixed. This issue caused the path to PNFS ID translation to fail when symbolic link were used.

The Chimera NFS server has seen a couple of performance improvements too. More importantly, a conformance issue was fixed, which solves a problem with the find utility when used with Chimera.

Xrootd

The xrootd door (which serves the role of an xrootd redirector) has been reimplemented. The new implementation should be more scalable, more robust and consume fewer resources. The xrootd mover on the pool (which serves the role of an xrootd data server) has not been reimplemented.

The new xrootd door has a different Java package name. Hence it is essential that dCacheSetup is updated to refer to the correct authorization plugin. In config/dCacheSetup, the line

xrootdAuthzPlugin=org.dcache.xrootd.security.plugins.tokenauthz.TokenAuthorizationFactory

has to be changed to

xrootdAuthzPlugin=org.dcache.xrootd2.security.plugins.tokenauthz.TokenAuthorizationFactory

Without this change, all transfers will fail. If the above line is not present, then the xrootd door has not been configured for token based authorization and in that case no configuration changes are required.

The new door has no configurable upper limit on the number of active connections.

Detailed changelog 1.9.3-1 to 1.9.4-1