1.9.4 Release Notes

The main focus areas of the 1.9.4 release are access control for staging from tape, a more scalable xrootd redirector, and a few performance improvements in Chimera.

Upgrade Instructions

Incompatibilities

Please consider the following changes when upgrading from a version before 1.9.4-1:

chimera: The database schema must be updated when upgrading to dCache 1.9.4. See the instructions below for details.
xrootd: The xrootd door has been replaced with a new implementation. dCacheSetup needs to be updated to reference the new authorization plugin.
java: As of version 1.9.4-1, dCache requires Java 6.

When upgrading the SRM door from 1.9.1-10, 1.9.2-9, or 1.9.4-1, the SRM database needs to be cleaned manually. See details below.

Compatibility

It is safe to mix pools of releases 1.9.1 to 1.9.4, and they can be used with any version of the head nodes. Head nodes and doors must be upgraded to 1.9.4 together and cannot be mixed with head nodes or doors of 1.9.1, 1.9.2, or 1.9.3. Components of different 1.9.4 releases can be mixed freely. 1.9.0 is no longer supported.

Compatibility Matrix

We distinguish between pool components and head nodes. Any component which is not a pool is considered a head node, including doors. The following table shows compatibility between different releases.

		1.9.4-7 Head	1.9.4-7 Pool
Head	1.9.1-1..7,9..11	no	yes
	1.9.1-8	no	yes^[1]
	1.9.2-1..5,8..11	no	yes
	1.9.2-6,7	no	yes^[1]
	1.9.3-1..4	no	yes
	1.9.4-1..7	yes	yes
Pool	1.9.1-1..7,9..11	yes	yes
	1.9.1-8	yes^[1]	yes
	1.9.2-1..5,8..11	yes	yes
	1.9.2-6,7	yes^[1]	yes
	1.9.3-1..4	yes	yes
	1.9.4-1..7	yes	yes

The migration module will not work for -target=pgroup and -target=link.

1.9.4-7

Pools

Fixed a problem with error handling after stage failures. The bug caused an errounous checksum verification to be performed after failed staging.

FTP

Fixed several race conditions with the proxy component for active transfers.

Detailed changelog 1.9.4-6 to 1.9.4-7

[r13379] pool: Fix stage error handling
[r13855] ftp: Fix race conditions and inconsistencies in ActiveAdapter

1.9.4-6

SRM Door

A race condition in a meta data printing routine was fixed. This race condition could lead to ArrayIndexOutOfBoundsExceptions in SRM and other components.

The handling of the SRM_PARTIAL_SUCCESS return code was fixed in the server side srmCopy implementation. The bug caused a compatibility issue with Castor.

A race condition that could lead to GSI authentication failures in the SRM has been fixed.

Pool

Logging of uncaught exceptions has been improved in several components. In older versions, critical errors could in some cases go unnoticed because the error was not logged.

A race condition was fixed in the pool code. On pools with -replicaOnArrival enabled, this race condition has on some sites been reported to cause 10% upload failures. The symptoms are that the upload hangs at the end of the transfer. We recommend that all sites using the -replicaOnArrival option upgrade their pools.

A couple of NullPointerException fixes have been made in the pool migration module.

An issue with recovery of broken SI files in the pool's control/ directory has been fixed. That issue prevented pools with such files from starting. With dCache 1.9.5-12 the SI file is recovered from PnfsManager.

Another problem affecting recovery of meta data on pools was that the recovery logic did not respect the lfs=volatile setting. This has been fixed to correctly mark files on such pools as cached.

PoolManager

PoolManager write balancing was broken since 1.9.3-1. This problem would mean that many concurrent writes had a tendency to go to the same pool. With 1.9.5-9 the old behaviour should now be restored.

Admin Shell

The SSH admin shell of dCache had an issue with how it closed the connection to the SSH client. This caused trouble in scripts, as the output from the session would sometimes be incomplete. This issue has now been fixed.

DCAP Doors

dCache supports running multiple DCAP doors on the same host. Support for this has however been broken since dCache 1.9.3-1 due to bugs in the init script. This has now been fixed.

GridFTP and GSIDCAP Doors

A race condition that could lead to GSI authentication failures in GridFTP.

gPlazma

A couple of NullPointerException fixes have been made in GPlazma.

Detailed changelog 1.9.4-5 to 1.9.4-6

[r12825] add CellVersion support to AbstractCell
[r12839] http://rb.dcache.org/r/1087/, srm: allow SRM_PARTIAL_SUCCESS as a request status
[r12843] core: Fix race condition in FileMetaData.toString
[r12861] fix misspelling billingDbUer -> billingDbUser in httpd.batch, dCacheSetup.template
[r12864] pool: Use FireAndForgetTask for our thread pools
[r12866] additional fix for misspelling, keeping existing configuration
[r12873] http://rb.dcache.org/r/1104/, srm: fix a bug that could lead to ArrayIndexOutOfBounds exception in multi file copy request
[r12879] PoolManager: Fix the magic feature
[r12881] pool: Don't log a kill of a missing mover as an error
[r12886] PoolManager: Ensure that magic feature doesn't make cost negative
[r12900] general: fix compiler warnings about UTF-8 literals
[r12950] update info-provider docs (README-GLUE) with fault-finding information
[r12962] Pool: Fix ConcurrentModificationException on upload
[r12987] cells: Fix socket closure in admin shell
[r13037] external: Downgrade to JGlobus 1.4
[r13060] Reverting r13037
[r13084] pool: Fix initialization failure in migration module
[r13085] external: Add fix for CRL race condition in JGlobus
[r13157] pool: Recover corrupted SI file
[r13183] http://rb.dcache.org/r/1345/
[r13187] pool: Respect LFS mode during meta data recovery
[r13269] srm: fix startup-ping so failures are reported
[r13296] scripts: Fix handling of multiple dcap doors

1.9.4-5

In all releases since 1.9.1-1, the pool would in some cases ignore certain pool parameters. In particular the tag.hostname parameter would in some cases be dropped, but other parameters could be affected as well. This has been fixed in 1.9.4-5.

The robustness of the pool to pool transfer component embedded in pools has been improved. In particular the case where there is no available TCP port in the configured port range no longer causes the component to die. The transfer still fails, however subsequent transfers may now succeed if a TCP port becomes available.

A memory leak related to logging in pools has been fixed.

Passive mode DCAP reliability has been improved. In rare cases two concurrent transfers could cross and cause file corruption.

For non-SRM transfers, the message Already have 1 records(s) would be printed in the log file. The transfers would complete and space management was correct. dCache 1.9.4-5 eliminates the log message for non-SRM transfers.

The performance of the SAML or XACML plugins for GPlazma has been improved.

Several minor bug fixes have been made in the message passing layer. Those fixes improve the robustness of dCache.

The stage protection that was introduced in 1.9.4-1 would break access through SRM when enabled. This problem has now been fixed.

Detailed changelog 1.9.4-4 to 1.9.4-5

[r12597] cells: Protect cell against SerializationException
[r12645] pool: Make P2P Acceptor more robust
[r12647] cells: Clean up NDC references on thread termination
[r12651] PoolManager: Stop ping handler on error
[r12662] PnfsManager: Make permission update more robust
[r12684] UniversalSpringCell: Fix argument parsing in the presence of -${<number>}
[r12693] get rid of "Already have 1 record(s)" for non-SRM transfers
[r12728] http://rb.dcache.org/r/1000/, gPlazma: prevent unessesary mapping services callouts by synchronizing the gPlazma code which uses/updates caches of authorizations
[r12733] cells: Avoid InterruptedException in CellNucleus.addToEventQueue
[r12775] This patch fixes PinManager to avoid the problem of using SRM door with stage protection
[r12788] dcap: use UUID as a challenge

1.9.4-4

PinManager: Improved robustness against failures.
pool: Fixed race condition in migration module.
xrootd: Fixed processing of statx request in the door.
xrootd: Fixed timeout handling for writes. This solves the problem with zero length files being left in an error state on pools after xrootd timeouts.
xrootd: Delete name space entry in case the redirection fails for a write request.
gPlazma: Fixed NullPointerException.
infoDomain: Improved robustness against communication failures.
srm: Fixed deadlock that could happen when fetching expired SRM requests from the database.
dCacheConfigure: Generates nodes_config file using the SERVICE field rather the deprecated service flags.
ReplicaManager: Improved performance for installations with a larger number of pools.
PoolManager: Enabled watchdog by default. We hope this solves the problem with disk-only files being suspended even after the pool goes online again.
dcap: Kill mover when client disconnects from the door.
srm: Avoid memory leak when running without a thread manager.
xrootd: Initialize request ID when creating a new mover. This suppresses warnings on the pools when the same file is read by multiple clients. It also allows resend requests to the pool be collapsed (if configured on the pool).

Detailed changelog 1.9.4-3 to 1.9.4-4

[r12216] http://rb.dcache.org/r/643/, PinManager: Unpin Expired pin requests in PINNING state, resolves rt #4362, #4437, #4731
[r12254] pool: Fix race condition in migration module
[r12270] xrootd: Fix parsing bug in statx request
[r12286] xrootd: Fixed timeout handling and cleanup for write
[r12298] replaced say with logger.debug and esay with logger.error where appropriate
[r12312] Allow access to the info service XML Conduit from only the loopback device
[r12396] From http://rb.dcache.org/r/789
[r12432] Alter Info service to use AbstractCell instead of CellAdaptor
[r12439] Remove hard-coded registration of the info cell as well-known
[r12449] srm: Fix infinite job reload loop
[r12457] Fixing typos in install.sh which broke mounting correctly
[r12491] srm: fix regression introduced by r12451
[r12511] Changing to the new node.config format for dCacheConfigure.sh
[r12518] Backport the patch 739
[r12551] PoolManager: Always enable watchdog
[r12556] PoolManager: Fix PoolUp handling to take lost message into account
[r12562] dcap: remote pending movers on client disconnect
[r12567] http://rb.dcache.org/r/880/,
[r12573] xrootd: Initialize request ID in pool request message

1.9.4-3

Updates to FTP doors: Fixed Kerberos FTP door, which broke in 1.9.2.

Updates to srm: The update space reservation command of SpaceManager can now extend the lifetime of a reservation. Fixed a medium risk vulnerability in SpaceManager.

Updates to pools: Fixed race condition in the migration module which in case of circular moves could cause data loss. Fixed restore of zero length files from HSM.

Updates to infoDomain: Fixed a bug which caused some derived values to not be updated.

Updates to Chimera: Eliminated warnings of missing PinboardAppender in Chimera client tools. The init script no longer starts dirDomain when Chimera is used (it is not needed).

Updates to httpdDomain: Fixed compatibility with Safari.

Updates to PinManager: Fixed repinning issues when pools are down.

Updates to GSIDCAP: Fixed VOMS role handling.

Updates to PoolManager: The trigger mechanism for hot-pool replication has been enhanced by integrating an algorithm contributed by Jon Bakken, FNAL. The algorithm ranks pools based on their CPU cost. The n-th percentile pool cost is chosen, where the n-th percentile is the cost of the pool within that ranking: 0% selects the lowest pool cost, 50% selects the median cost and 100% selects the highest pool cost. This cost is used as the threshold for establishing pool-to-pool "on cost" transfers. Specifying a on-cost value as a number not ending with "%" will result in the old behaviour; all current dCache deployments will have such a value. Specifying a value ending with "%" will result in the percentile cost being calculated dynamically and the resulting value used as the threshold for on-cost pool-to-pool transfers.

Detaild changelog 1.9.4-2 to 1.9.4-3

[r11975] http://rb.dcache.org/r/447, reanimare kerberos ftp door, by propper initialization of _pathRoot and _curDirV
[r11990] http://rb.dcache.org/r/476/ - extend "update space reservation" admin
[r11993] xrootd: Stability fixes for xrootd
[r12001] http://rb.dcache.org/r/489 replace "while" with "if"
[r12013] pool: Make FTP block size configurable
[r12021] pool: Fixed a race condition in migration module
[r12024] Forgot to commit this file
[r12038] Add information about how to decommission the old info provider
[r12043] Fix bug in info service where secondary information isn't always updated
[r12059] http://rb.dcache.org/r/522/ - drop direct 2 remaining direct sql queries
[r12066] Add cells jar to satisfy log4j's default usage of PinboardAppender
[r12077] httpd: Set content type for CSS
[r12089] PnfsManager: Fix integer overflow
[r12102] http://rb.dcache.org/r/536, PinManag refactor, inttroduction of the PinManagerJob for incupsulation of the pin operations state and parameters
[r12116] Add support for Jon's p2p dynamic threshold
[r12125] gplazma - Edited httpd.batch file to show gplazma cell on cellinfo page
[r12131] Fix web server so start up doesn't emit meaningless error message
[r12142] init: Don't start dirDomain on Chimera installations
[r12169] Add spaces to log lines
[r12183] http://rb.dcache.org/r/578/, PinManager: detect if the operations on the existing pin are failing and attempt to create a new pin instead
[r12193] Fix HSM restore of zero length files
[r12199] dcap: gsi: fix voms role handlig

1.9.4-2

When upgrading the SRM door from 1.9.1-10, 1.9.2-9, or 1.9.4-1, the content of the srmrequestcredentials table has to be deleted. This table is present in the database used by the SRM door. Use the psql utility to connect to the database and issue the commands:

\c dcache
delete from srmrequestcredentials;

For instance like this:

Welcome to psql 8.3.7, the PostgreSQL interactive terminal.

Type:  \copyright for distribution terms
       \h for help with SQL commands
       \? for help with psql commands
       \g or terminate with semicolon to execute query
       \q to quit

postgres=# \c dcache
You are now connected to database "dcache".
dcache=# delete from srmrequestcredentials ;
DELETE 149
dcache=#

Updates to FTP doors: Fixed a race condition.

Updates to chimeraDomain and Chimera: Fixed PostgreSQL 8.1 compatibility. Eliminated warnings of missing PinboardAppender in Chimera client tools.

Updates to replicaDomain: Replica manager was broken in 1.9.4-1 and is now fixed.

Updates to info provider: The execution bit of info-based-infoProvider.sh is not set.

Updates to srm: Fixed a race condition. Fixed a security issue. Improved database connection handling in space manager. Fixes a database leak.

Updates to pools: Fixed race conditions in DCAP, XROOTD and pool-to-pool support.

Detaild changelog 1.9.4-1 to 1.9.4-2

[r11858] build: fix missing execution bit on info-based-infoProvider
[r11867] ftpdoor: Fixes race condition in use of SimpleDateFormat
[r11869] pool: Fix potential integer overflow issue
[r11876] pool: Fixed a couple of race conditions in DCAP and XROOTD movers
[r11881] pool: Fix correctness issue in p2p component
[r11887] xrootd: Fixed several bugs in token authorization plugin
[r11895] code: pool manager message requires reply
[r11898] chimera: sql: make path2inode stored procedure compatible with postgres 8.1
[r11913] Instead I marked the field volatile
[r11936] http://rb.dcache.org/r/434/, use prepared statements throughout srm sql code
[r11943] spacemanager: robust connection handling
[r11950] chimera: nfs: fix startup script

1.9.4-1

Access control for staging

Initially dCache has been designed to be a disk cache in front of a Tape Storage System, moving files onto the tape-backend and restoring them when needed. Those operations are handled transparently to the user. The downside of this approach is that a simple read of a file, not being on disk, automatically triggers a tape operation. As tape operations are expensive and may interfere with storing RAW data, coming from the Tier 0, this feature had to be reviewed. As a result, it has been agreed with the experiments that no non-production user should be allowed to trigger such a tape operation. dCache is now implementing a first version of such a protective mechanism. A dCache system administrator may specify a set of DN/FQAN's which are allowed to trigger tape read accesses for files not being available on disk. Users, requesting tape-only files, and not being on that white list, will receive a permission error and no tape operation is launched.

To enable stage protection, add the following line to config/dCacheSetup:

stageConfigurationFilePath=${ourHomeDir}/config/StageConfiguration.conf

The file config/StageConfiguration.conf will contain the white list. Lines starting with a hash symbol (#) are discarded as comments. Other lines may contain one or two regular expressions enclosed in double quotes. The first matches the DN, and the second the FQAN. The regular expression syntax follows the syntax defined for the Java Pattern class.

The following is an example:

# Allow all ATLAS users who have the role 'production'
".*" "/atlas/Role=production"

In the current release, the white list must be available to all doors and to the pin manager. This requirement will likely change in a future release.

Chimera

It is vital that the Chimera schema is updated when upgrading to dCache 1.9.4. This can be done using the following command:

psql -f /opt/d-cache/libexec/chimera/sql/create.sql chimera
psql -f /opt/d-cache/libexec/chimera/sql/pgsql-procedures.sql chimera

As most of the structures already exist, the above scripts will generate error messages about existing relations, entries, and triggers. These are safe to ignore.

Several performance improvements have been made in Chimera. Most notably the path to PNFS ID translation should now be faster when using PostgreSQL. A functionality issue relating to the use of symbolic links has been fixed. This issue caused the path to PNFS ID translation to fail when symbolic link were used.

The Chimera NFS server has seen a couple of performance improvements too. More importantly, a conformance issue was fixed, which solves a problem with the find utility when used with Chimera.

Xrootd

The xrootd door (which serves the role of an xrootd redirector) has been reimplemented. The new implementation should be more scalable, more robust and consume fewer resources. The xrootd mover on the pool (which serves the role of an xrootd data server) has not been reimplemented.

The new xrootd door has a different Java package name. Hence it is essential that dCacheSetup is updated to refer to the correct authorization plugin. In config/dCacheSetup, the line

xrootdAuthzPlugin=org.dcache.xrootd.security.plugins.tokenauthz.TokenAuthorizationFactory

has to be changed to

xrootdAuthzPlugin=org.dcache.xrootd2.security.plugins.tokenauthz.TokenAuthorizationFactory

Without this change, all transfers will fail. If the above line is not present, then the xrootd door has not been configured for token based authorization and in that case no configuration changes are required.

The new door has no configurable upper limit on the number of active connections.

Detailed changelog 1.9.3-1 to 1.9.4-1

[r11702] httpd: Fix NPE in case door info is missing
[r11704] common: Don't log InvocationTargetException in MonitoringProxy
[r11707] scripts: Check pnfsManager variable for custom node type
[r11712] info-provider: fix the Version property value for SRM GlueControlProtocol objects
[r11715] namespace: chimera: convert Chimera's FileNotFound into dCache's FileNotFound
[r11719] use single image for background
[r11730] doors: Fix timeout in PnfsManagerFileMetaDataSource
[r11734] build: fix procedure to create a tag and bump version number
[r11735] common: Fix message dispatching for forwarded messages
[r11737] srm: Added missing newline in srm_setup.env
[r11740] pnfsmanager: remove obsolete code
[r11782] scripts: Use correct xrootd setup file
[r11783] xrootd: Don't close file in mover
[r11785] config: Updated documentation in node_config.template
[r11787] Allow "pool create" to work on non-existing directories
[r11788] dcap: mover: fix checksum calculation on transfer
[r11798] xrootd: Lower log level of xrootd mover messages
[r11802] Added flags to make dcap compatable with solaris 10
[r11808] Commit first part of TapeProtection patch, concerning PoolManager
[r11816] poolmanager: Simplifies code by using the message dispatcher
[r11817] chimera: new version of chimera core and nfs
[r11818] pool: Log exception on file not found
[r11819] xrootd: Netty based implementation of Xrootd door
[r11821] JUnit test for CheckStagePermision class
[r11822] core: tape protection: add simple access control for stage requests
[r11823] fixes handling of apostrophes in DN
[r11827] chimera: sql: added path2inode stored procedure used by new chimera-core.jar
[r11828] chimera: sql: fixed line wrapping
[r11830] Tape Protection: Stager in dCap Door
[r11831] chimera: core: Handle empty path in path2inode
[r11840] TapeProtection on pinning
[r11841] better handling of cancelUseSpace call - do not print ERROR on failure
[r11848] pool: Refactors the migration server and first part of checksum handling
[r11849] deploy: check that jdk6 is used