1.9.6 Release Notes

The main focus areas of the 1.9.6 release are the addition of a WebDAV door, complete ACL support in the SRM, the ability to easily move cells between domains, and experimental Terracotta support in the SRM door.

Upgrade Instructions

Incompatibilities

Please consider the following changes when upgrading from a version before 1.9.6-1:

Compatibility

It is safe to mix pools of releases 1.9.4, 1.9.5 and 1.9.6. Head nodes and doors must be upgraded to 1.9.6 together and cannot be mixed with head nodes or doors of releases before 1.9.6. Components of different 1.9.6 releases can be mixed freely.

Compatibility Matrix

We distinguish between pool components and head nodes. Any component which is not a pool is considered a head node, including doors. The following table shows compatibility between different releases.

1.9.6-5 Head1.9.6-5 Pool
Head 1.9.1-1..7,9..11nono
1.9.1-8nono
1.9.2-1..5,8..11nono
1.9.2-6,7nono
1.9.3-1..4nono
1.9.4-1..5nono
1.9.5-1..15no no
1.9.6-1..5yes yes
Pool 1.9.1-1..7,9..11nono
1.9.1-8nono
1.9.2-1..5,8..11nono
1.9.2-6,7nono
1.9.3-1..4nono
1.9.4-1..5nono
1.9.5-1..15yes[1]yes[1]
1.9.6-1..5yesyes
  1. dCache 1.9.6-1 head nodes were not compatible with dCache 1.9.5 pools.

Changes in 1.9.6-5

WebDAV

Propagate namespace permission errors: user will see an informative message instead of "500 Internal error".

PnfsManager

Fix interaction between ACLs and POSIX-like permissions that resulted in some valid requests being rejected for certain dCache configurations (if PnfsManager is the Policy Enforcement Point).

Chimera cleaner

Fix message timeout handling: messages that timed-out were previously ignored.

Space Manager

Fix reserve admin command so it works if no retention policy is specified.

SRM

Fix handling of srmSetPermission operation to allow interoperability with DPM.

Provide Space Manager with all VOMS groups when a user attempts to reserve or release space. (NB: doesn't affect implicit space reservation)

FTP client

Check whether the remote FTP server supports uploading checksums before attempting to upload a file's checksum. This fixes an interoperability issue with DPM.

dcap door

The dcap door now rejects certain malformed URLs created by incorrect client usage of dcap client software.

Web-admin

Cells are initially hidden in the Cell Services page until they start to respond. If a cell should go off-line then it will continue to be displayed and marked as being off-line (as happens prior to 1.6.5-17).

Migrating namespace

Fix bug that resulted in the storage-info migration test to fail.

Info provider

Fix bug where some sites would publish only a subset of their SRM reservations.

JMS

Fix a couple of minor timing issues that resulted in some harmless errors during starting up and incorrect behaviour when resolving well-known cells under rare conditions.

Detailed changelog 1.9.6-4 to 1.9.6-5

Changes in 1.9.6-4

DCAP

Fixed read permission check when permissionPolicyEnforcementPoint is configured to PnfsManager. We urge sites using earlier builds of 1.9.6 to upgrade to at least 1.9.6-4.

Info Provider

Glue requires that published SA objects (which represent some storage) has ChunkKey, which is used to describe the relationship between the SA and its containing storage element (SE) object. These attributes were missing for nearline and offline accounting SA objects.

Pools

Fixed the meta2yaml script. The script didn't work for shells other than bash.

FTP

Added support for running multiple FTP doors on the same host. The mechanism for doing this is the same as for running multiple DCAP doors on the same host.

SRM

Starting with 1.9.6-1, the mounted name space is no longer needed by SRM. The init script however still insisted on mounting the name space. Starting with 1.9.6-4, the init script no longer mounts the name space on the SRM node.

Fixed an interoperability issue that affected srmCopy transfers between dCache and DPM.

WebDAV

Fixed the batch script for the WebDAV door. It accidentally contained leftovers from an SVN merge conflicts.

Detailed changelog 1.9.6-3 to 1.9.6-4

Changes in 1.9.6-3

SRM

The DB schema detection code for the srmLs related tables has been fixed. The bug caused those tables to be recreated on every restart.

A deadlock in processing of file download requests has been fixed. The bug could when triggered cause all download request processing to halt until the next restart.

Fixed a bug in the init script that caused the init script to fail to stop or restart the SRM. The bug was introduced in 1.9.5-13.

PoolManager

An error propagation bug was fixed. The bug caused failure to resolve client host names to be retried forever.

Fixed a bug in which Pool Manager in rare occasions could create an orphaned file.

DCAP

Failure logging in the DCAP door was improved. A number of non-critical stack traces have been removed.

Fixed an issue with stage authentication.

Pool

The migration module now ignores files that have been deleted in the name space. Before such files would be retried until the file was removed by the cleaner.

The fsync failure handling that was changed in 1.9.5-12 has been changed again. In 1.9.5-12 failure to fsync caused a pool to be disabled. We have discovered a number of common sources to fsync failures that are not to be considered critical. Hence in 1.9.5-13 we changed the behaviour such that failure to fsync causes a transfer to fail, but the pool is not disabled.

A bug related to HSM script failures has been fixed. The bug would cause the pool to essentially ignore an error code returned by the HSM script when trying to stage a file. The stage would still fail if checksum verification was turned on (which it is by default), but it would cause the error message in the log files to be misleading.

Fixed several interoperability issues and bugs in the xrootd mover. If xrootd is used we strongly recommend upgrading to 1.9.5-15 or 1.9.6-3.

WebDAV

File upload on multi-homed doors has been fixed. Without this fix pools would possibly connect to the wrong IP. The configuration parameters webdavAddress and webdavInternalAddress have been added to solve this problem.

Error reporting to billing has been improved.

A race condition in handling of mover errors has been resolved.

The HTML rendering of a directory listing is now styled using CSS. The styling can be customized. Please read the comments in etc/dCacheSetup.template for details on how to customize the styling.

Authentication through the gPlazma module has been fixed. Authentication through the gPlazma cell worked fine, however when using the module all attempts to connect to the WebDAV door would result in an empty page.

Info

Fixed a potential starvation issue in the info service. This should improve responsiveness of the info service on systems that push lots of updates into the info service.

PnfsManager

Fixed a problem in which the Chimera cleaner would wait indefinitely for a lost reply from a pool. When the problem occurred the cleaner had to be restarted.

Added HSM support to Chimera cleaner. When enabled in config/dCacheSetup then Chimera cleaner now sends remove requests to HSM attached pools for files on tape. The pool will call the HSM integration script with the remove command. This mimics the behaviour of the HSM cleaner for PNFS.

Improved handling of flush notification. This improves the error reporting when using HSM integration scripts using the URI approach.

Chimera

Fixed a problem with setting file mode for files created through the mounted NFS 3 file system. The file mode for such files was previously hardcoded. This problem did not affect files created through dCache.

Admin

Fixed compatibility with the GUI.

Space Manager

Fixed update space reservation command such that lifetime="-1" sets reservation to be non-expiring.

Pin Manager

Added the bulk unpin command.

All

Error reporting in the dCache batch processor has been improved. The line number and file name is now reported on fatal errors.

A bug in the dCache init script (bin/dcache) related to stopping domains has been fixed. The bug would cause domains that were in the middle of an automatic restart to be ignored by the stop command. Also the start and restart commands would not recognize such domains as already running and would start a second instance. Part of the fix is that the bin/dcache status command now shows when a domain is restarting. One consequence of the fix is that we now store two PID files; one for the Java process and one for the shell script that handles the restart logic. The scripts shipped with dCache have been updated to handle this change, but if you have custom scripts that rely on the PID files, then those need to be updated.

A script compatibility issue with Solaris has been fixed. The bug caused the automatic restart of domains to fail on Solaris.

The init script now uses an environment variable to pass the CLASSPATH to the Java Virtual Machine. This solves a problem where the command line length exceeded the maximum length allowed. It also makes the output of utilities like ps nicer to read as the excessively long CLASSPATH is no longer included.

The log4j related commands previously exposed through the System cell are now available in all cells. The commands still affect the logging of the complete domain.

Detailed changelog 1.9.6-2 to 1.9.6-3

Changes in 1.9.6-2

Important fixes

An XROOTD write corruption issue has been fixed in the pool. This issue was introduced in version 1.9.5-1 and we recommend that all data written through XROOTD to an affected version of the pool code is verified for integrity. Notice that the dCache checksum cannot be used to verify the integrity of the file, as the checksum is computed from corrupted data. We urge all sites running a prior version of 1.9.5 and using XROOTD for writing into dCache to immediately upgrade to at least version 1.9.5-11 or 1.9.6-2.

Compatibility

The SrmSpaceManager and the transfer managers have been taken out of the srm.cell file and moved into the spacemanager.cell and transfermanagers.cell files. Both of these new files are executed from config/srm.batch and the cells are still started inside the SRM domain, however care should be taken on upgrade if config/srm.batch has been customized.

dCache 1.9.6-1 was unfortunately not compatible with pools from earlier versions. This has now been resolved and dCache 1.9.6-2 head nodes can work with dCache 1.9.5 pools. Pools from earlier versions are not supported.

Head nodes and doors

The Kerberos FTP door has been broken since the release of dCache 1.9.5-1. The problem has been resolved in dCache 1.9.5-11 and 1.9.6-2.

The SRM list implementation has been reconstructured to no longer require the mounted file system. Thus starting with dCache 1.9.6-2, the only node requiring access to the mounted file system is PnfsManager when using PNFS. With Chimera no dCache node requires access to the mounted file system. The performance characteristics of the new implementation are different. Single file listings and non-verbose listings should perform as before, while verbose listing on some setups appear faster and other setups a bit slower. If you observe drastic changes in srmLs performance, please notify support at dcache dot org.

A new version of PNFS was released in November 2009. The new version supports registering all deleted files in a special table in the PNFS database. dCache 1.9.5-12 has new options to configure the PnfsManager to let dCache consult this table when it cannot find a PNFS entry for a file. If found in this table, dCache pools will delete such orphaned files. The configuration options are pnfsDeleteRegistration, pnfsDeleteRegistrationDbUser, pnfsDeleteRegistrationDbPass. Please consult etc/dCacheSetup.template for details.

Several possile sources of the spurious "Already have one record" write failures have been fixed. Please inform us if you observe any changes in behaviour after upgrading to dCache 1.9.6-2.

The command to stage files administratively through the PinManager was broken since dCache 1.9.5-1. This problem has been fixed.

Two configuration parameters of thread manager are now configurable. The configuration parameters are threadManagerThreads and threadManagerTimeout and can be redefined in config/dCacheSetup.

Several deadlocks in the SRM have been resolved.

dCache supports running multiple DCAP doors on the same host. Support for this has however been broken since dCache 1.9.3-1 due to bugs in the init script. This has now been fixed.

The feature to override configuration parameters on a per cell level by adding a key value per list in the batch files did unfortunately not work in dCache 1.9.6-1. In dCache 1.9.6-2 this should work as described in the 1.9.6 release notes.

Pools

An issue with recovery of broken SI files in the pool's control/ directory has been fixed. That issue prevented pools with such files from starting. With dCache 1.9.5-12 the SI file is recovered from PnfsManager.

Another problem affecting recovery of meta data on pools was that the recovery logic did not respect the lfs=volatile setting. This has been fixed to correctly mark files on such pools as cached.

Starting with dCache 1.9.3-1, dCache would call fsync(2) after upload to synchronize the new file's in-core state with the storage device. Failure during sync would be logged, but was not considered a fatal error. After talking to several sites, we have decided to consider failure to sync to be a fatal error. Starting with dCache 1.9.5-12 the transfer will be considered failed and the pool will be disabled.

The callout to the HSM script on dCache pools has been updated such that all HSM locations of a file are passed to the script.

Detailed changelog 1.9.6-1 to 1.9.6-2

Changed in 1.9.6-1

WebDAV

From Wikipedia: "Web-based Distributed Authoring and Versioning, or WebDAV, is a set of extensions to the Hypertext Transfer Protocol (HTTP) that allows computer-users to edit and manage files collaboratively on remote World Wide Web servers. RFC 4918 defines the extensions." Wikipedia further notes: "The WebDAV protocol allows interactivity, making the Web a readable and writable medium, in line with Tim Berners-Lee's original vision. It allows users to create, change and move documents on a remote server (typically a web server or "web share"). This has obvious uses when authoring the documents that a web server serves, but it can also be used for storing files on the web, so that the files can be accessed from anywhere.

WebDAV is supported by all modern operating systems out of the box. This includes Windows XP, Windows Vista, Windows 7, Mac OS X, and Gnome and KDE shells for Linux and Unixes.

dCache 1.9.6 introduces the WebDAV door. The WebDAV door supports both unauthenticated HTTP and HTTPS and client certificate based authentication over HTTPS. The current version does not support HTTP basic authentication, but this will likely be added in dCache 1.9.7. The WebDAV door supersedes the old HTTP door and we therefore no longer ship httpdoor.batch with dCache.

To enable the WebDAV door, specify webdav as a service in etc/node_config. You may want to adjust the default parameters for the WebDAV door in config/dCacheSetup. Consult etc/dCacheSetup.template for available parameters and documentation on those parameters.

To enable WebDAV over HTTPS, set webdavProtocol=https in config/dCacheSetup. The host and CA certificates need to be imported to be readable by the Java SSL libraries. To do this, run the commands bin/dcache import hostcert and bin/dcache import cacerts. Consult the man page of the dcache script and the documentation in etc/dCacheSetup.template for available options.

dCache domains and batch files

A dCache installationen consists of a number of dCache domains, each containing a number of cells. Each cell represent a service. dCache domains are defined through batch files in the config/ directory. We have urged sites not to modify these batch files as doing so would make upgrading more difficult. However many sites wish to change which cells are running in which domain. Until now, there was no way to move a cell from one domain to another without modifying the batch files.

To solve this problem, we have refactored all batch files in dCache 1.9.6. Each batch file has been split into a number of .fragment and .cell files. Those .fragment and .cell files are not user editable. The batch files in config/ now consist of exec instructions that include the .fragment and .cell files. By rearranging those exec commands in the batch files, one can easily move a cell from one domain to another. To show an example, the definition of the dCacheDomain looks like this:

onerror shutdown

exec -shell file:${ourHomeDir}/share/cells/logging.fragment
exec -shell file:${ourHomeDir}/share/cells/setup.fragment
exec -shell file:${ourHomeDir}/share/cells/tunnel.fragment
exec -shell file:${ourHomeDir}/share/cells/poolmanager.cell
exec -shell file:${ourHomeDir}/share/cells/dummy-prestager.cell
exec -shell file:${ourHomeDir}/share/cells/broadcast.cell
exec -shell file:${ourHomeDir}/share/cells/loginbroker.cell

All domains must execute the first three fragment files, but the remaining execution of the cell files can be moved between domains.

To make it easier to have several instances of the same cell, for instance a door, we have restructured the parameter parsing such that configuration parameters can be overriden on the exec line in the batch file. To show an example, this is the default batch file for a domain containing a GridFTP door:

onerror shutdown

exec -shell file:${ourHomeDir}/share/cells/logging.fragment
exec -shell file:${ourHomeDir}/share/cells/setup.fragment
exec -shell file:${ourHomeDir}/share/cells/tunnel.fragment
exec -shell file:${ourHomeDir}/share/cells/gridftp.cell GFTP-${thisHostname}

To start two GridFTP cells within the same domain, we could change this to:

onerror shutdown

exec -shell file:${ourHomeDir}/share/cells/logging.fragment
exec -shell file:${ourHomeDir}/share/cells/setup.fragment
exec -shell file:${ourHomeDir}/share/cells/tunnel.fragment
exec -shell file:${ourHomeDir}/share/cells/gridftp.cell GFTP1-${thisHostname} \
    gsiFtpPortNumber=2811
exec -shell file:${ourHomeDir}/share/cells/gridftp.cell GFTP2-${thisHostname} \
    gsiFtpPortNumber=2812 gsiftpIoQueue=local

This would start one GridFTP cell called GFTP1-<hostname> on port 2811 and another GridFTP cell called GFTP2-<hostname> on port 2812. The first would use the default mover queue on pools, the second would use the queue called local. All parameters definable in etc/dCacheSetup can be overriden in this way (this is currently not true for PoolManager and pools, but that will be fixed soon).

SRM on Terracotta

Terracotta is a distribute shared memory platform. With Terracotta SRM can be scaled to run on multiple hosts. dCache 1.9.6 introduces Terracotta support for SRM. This is currently considered an experimental feature which we are testing in preproduction environments. If you are interested in this feature, please contact support at dcache.org for detailed instructions.

SRM and Access Control Lists

dCache 1.9.4 added ACL support to FTP and DCAP. dCache 1.9.5 added partial ACL support to the SRM door. In the dCache 1.9.6 release ACL support has been extended to all SRM operations, including srmCopy, srmPrepareToGet and srmPrepareToPut. ACL support needs to be configured in PnfsManager by setting aclEnabled to true in config/dCacheSetup.

Related to ACL support is that all permission checks for SRM are now delegated to PnfsManager.

Other changes

We have further cleaned up logging in SRM, SpaceManager and gPlazma. The log files should be less noisy now.

NFS 4.1 compliance has been improved a lot in dCache 1.9.6.

Persistent command history in the admin shell now has to be explicitly enabled by defining adminHistoryFile in config/dCacheSetup.

Detailed changelog 1.9.5-1 to 1.9.6-1