dCache 2.10 Release Notes

Executive summary

Improved SRM scheduling and protection of other components.
Reimplemented webadmin active transfers page.
Improved support for HTTP and HTTPS 3rd-party transfers.
Added support for 3rd-party transfers through WebDAV.
Improved support for HTTP 3rd-party transfers through SRM.
Improved default HTML rendering in WebDAV.
Added JSON support for info.

Incompatibilities

The NFS namespace cache configuration parameter nfs.namespace-cache.unit is renamed to nfs.namespace-cache.time.unit; this makes it compliant with dCache configuration guidelines.
Kerberos uses values in /etc/krb5.conf by default. If both kerberos.realm and kerberos.key-distribution-center-list are set then custom values are used.
In billing log files, transfer-specific information is written as a colon-separated list of items. For HTTP and xrootd, the format has changed, with a colon now separating the protocol version information and the client IP address. The output for HTTP and xrootd is now consistent with information recorded for other protocol transfers.

Release 2.10.62

poolmanager

PoolManager was updated to properly handle the dcache.authz.staging.pep and dcache.authz.staging parameters. This allows to enable stage protection properly.

Changelog 2.10.61..2.10.62

29cbe36: [maven-release-plugin] prepare release 2.10.62
4247d9a: PoolManager : stage protection, fix error in stage.fragment
332e0ff: [maven-release-plugin] prepare for next development iteration

Release 2.10.61

pnfsmanager

Fixed a problem in pnfsmanager that prevented the use of chimera.db.password.file.

Changelog 2.10.60..2.10.61

53a6ba9: [maven-release-plugin] prepare release 2.10.61
4700ae2: pnfsmanager: Fix chimera.db.password.file support
a381b7e: [maven-release-plugin] prepare for next development iteration

Release 2.10.60

Changes affecting multiple services

Fixed an issue with the dcache heap dump command when called with a simple file name as the output path. In this case the dump could in some cases be written to a different directory while the script claimed the dump had failed. The dcache dump heap command has a --force option for cases in which the JVM is unresponsive. This option was ignored for processes not running as root. This is fixed now.

cells

A bouncing message bug in System cell is fixed.

pool

Fix race condition in request scheduler.

Changelog 2.10.59..2.10.60

40e95fd: [maven-release-plugin] prepare release 2.10.60
1a19fdd: dcache: fix heap dump to simple file names
619d06f: script: Make dump heap –force work for non-root processes
ead548d: srm: Do not expose TURL before request is ready
6196710: pool: Fix race condition in request scheduler
b560f4e: billing: additional fixes to insert triggers
1aba4b3: system-test: update disposable-CA generated credentials
fef3b09: cells: Avoid bouncing message on no-route errors in System cell
770e269: [maven-release-plugin] prepare for next development iteration

Release 2.10.59

many

When representing checksums in the admin interface and configuration files, checksums are now presented in an improved format.

Changelog 2.10.58..2.10.59

6f9dd3f: [maven-release-plugin] prepare release 2.10.59
78df3e6: common: fix ChecksumType.toString()
967f410: [maven-release-plugin] prepare for next development iteration

Release 2.10.58

cells

LoginManager would occasionally generate error messages similar to “Discarding listening on $LOCATION 53684’ because its age of 18721640 ms exceeds its time to live of 4500 ms.”. This was due to erroneous reuse of old message envelopes in the internal messaging. This change fixes that problem.

This change addresses a potential problem in which messages sent between cells in the same domain could appear older than they are and thus would risk being discarded due to the time-to-live being expired.

srm

A Tier–1 site reported problems with a major WLCG VO’s read requests. Investigating the source of the problems showed that the srm_ifce library, used by the (outdated) GFAL v1 and the (supported) GFAL v2 SRM libraries, drastically limits the permitted lifetime of requests without providing admins any way to configure this.

For sites seeing errors related to desiredTotalRequestTime being exceeded, this change provides the new configuration option srm.request.maximum-client-assumed-bandwidth in srm.properties as a work-around.

Sites not observing such errors do not need to change anything with regard to this value.

Changelog 2.10.57..2.10.58

a73a6ed: [maven-release-plugin] prepare release 2.10.58
831dad1: srm: add short request lifetime work-around
838e4f9: chimera: Fix regression in inheriting ACLs on directory creation (HSQLDB)
9cf030e: cells: Improve robustness of message time to live
bd4bcbc: cells: Fix erroneous reuse of message envelope in location manager registration
337ba79: [maven-release-plugin] prepare for next development iteration

Release 2.10.57

pnfsmanager

Billing entries for SRM uploads recently lost the storage class part of the entry. This update fixes that issue.

poolmanager

This modification fixes a potential race condition in pool manager.

webdav

This modification corrects the error reporting under WebDAV. When attempting to delete a non-existing file, unauthenticated users receive a 401 Unauthorized response, while authenticated users receive a 404 Not Found response.

Changelog 2.10.56..2.10.57

66254ce: [maven-release-plugin] prepare release 2.10.57
b3aa010: pnfsmanager: Fix regression in SRM billing entries
18d6a05: poolmanager: Fix race condition in pool selection unit
bd0791c: webdav: fix 404 error if attempting to delete a nonexistent file
aa6eef6: [maven-release-plugin] prepare for next development iteration

Release 2.10.56

pnfsmanager

With this release, PnfsManager adds safety checks rejecting invalid upload paths that SRM might erroneously supply. This release hardens an installation against possible bugs triggering data loss.

This release adds a check that detects failed or incomplete SRM uploads and prevents the file from being committed to its final path. Common symptoms of this bug were zero sized files that experiment catalogues registered as successfully uploaded.

srm

This release adds a check to detect broken uploads using SRM during the final stage of file transmission. While it causes transfers to take a little more time, resilience against upload failures is increased.

Changelog 2.10.55..2.10.56

60d6f92: [maven-release-plugin] prepare release 2.10.56
0f3e6d7: srm: Check for broken files during srmPutDone
f9b2635: pnfsmanager: Check file size and upload completion when committing temporary upload paths
a99b248: pnfsmanager: Protect against erroneous upload paths
f5b9fc4: [maven-release-plugin] prepare for next development iteration

Release 2.10.55

Changes affecting multiple services

Several cases of slow performance were reported while deleting directory in Chimera. This is now fixed.

pool

When command execution to migrate files between pools (e.g. migration concurrency or migration copy) is interrupted due to the failure to find migration job the returned error message is considered as a bug. This is now fixed so that a new message is returned indicating that the job being requested does not exist.

srm

When file upload is cancelled the value of temporary upload path tracked by SRM could be a value different from a regular path, either because it was changed outside of dCache, or it contains entries from a very old version of dCache. This could result in data loss while canceling upload. The current release fixed a potential data loss scenario.

statistics

The statistics service creates static HTML pages that describe dCache usage over time as simple files that the webadmin service can serve. This includes information about pools and store-units. The problem is that the statistics webpages do not show information about any pool or store-unit that contains a / in the name. This is now fixed. A side-effect is that the history of any pool or store-unit containing a ^ in the name is lost.

Changelog 2.10.54..2.10.55

940029f: [maven-release-plugin] prepare release 2.10.55
f3a19ff: chimera: Alter statistics target for t_tags(itagid)
6a2f876: srm: Add safe-guard against invalid file ID in put requests
e05f2b0: pool: Don’t consider failure to find migration job a bug
d9503fe: statistics: encode ‘/’ in filenames
194d2af: [maven-release-plugin] prepare for next development iteration

Release 2.10.54

nfs

Pinning files is now a non-blocking operation. For files stored on tape, this should result in a more responsive system behaviour, avoiding NFS blocking in situations with many concurrent pin requests.

Changelog 2.10.53..2.10.54

c27ed41: [maven-release-plugin] prepare release 2.10.54
7b3c317: adapted commit 3eab402754b814f681a14d296d808031b05f2737 for 2.10 branch
e83c140: nfs: use noitify instead of blocking sendAndWait when sending pin/unpin messages via touch “.(get)(<file_name>)(pin)” command
c00a14b: [maven-release-plugin] prepare for next development iteration

Release 2.10.53

Changes affecting multiple services

Sometimes when a cell start up was interrupted an error message was logged as a bug. This is now fixed.

info-provider

The GLUE infomation provider supplies information about the dCache instance, which is important for the clients in WLCG area. Because in dCache different doors can have different roots, clients may need to adjust their path when accessing dCache through different doors. The info-provider is updated so that a new path root property is provided. This allows clients to modify paths, as necessary. Note that the SRM door already supports this translation when redirecting clients for transfers.

Changelog 2.10.52..2.10.53

627dfda: [maven-release-plugin] prepare release 2.10.53
1e40f5f: info-provider: publish door root path
902503e: cells: Suppress illegal state exception during initialization
8e7fbb8: [maven-release-plugin] prepare for next development iteration

Release 2.10.52

pool

Revert Netty version back to v3.9.9. A previous release upgraded the version of Netty, but this appears to have introduced problems for some HTTP transfers where the pools run out of memory.

spacemanager

Spacemanager backs off when it encounters a problem writing to the database. Previously, if the problem was due to deadlocks then the two tasks involved are delayed by the same amount, which means it is possible that subsequent attempt will also deadlock. This release randomises the delay to reduce the likelihood of this problem occuring.

Changelog 2.10.51..2.10.52

770e638: [maven-release-plugin] prepare release 2.10.52
985cbae: pool: revert Netty back to v3.9.9
b3edbfc: spacemanager: Randomize backoff in case of transient errors
82c9902: [maven-release-plugin] prepare for next development iteration

Release 2.10.51

xrootd

Fix dCache handling of open requests where uploads were considered downloads.

Changelog 2.10.50..2.10.51

cf8c162: [maven-release-plugin] prepare release 2.10.51
bd1bf66: xrootd: Fix classification of uploads
94f6494: [maven-release-plugin] prepare for next development iteration

Release 2.10.50

Changes affecting multiple services

Don’t log Error while reading from tunnel: java.nio.channels.AsynchronousCloseException when a domain shuts down.

Changelog 2.10.49..2.10.50

a13edb1: [maven-release-plugin] prepare release 2.10.50
2e55290: cells: Don’t log AsynchronousCloseException when tunnel closes
3baf68f: [maven-release-plugin] prepare for next development iteration

Release 2.10.49

Changes affecting multiple services

Update the Spring, Milton, AspectJ, Jetty and DataNucleus-core libraries to latest version. All dCache services are affected.

pool

If a 3rd-party transfer fails then the pool may log and report incomplete information on why this happened. This release fixes this problem.

Changelog 2.10.48..2.10.49

b107659: [maven-release-plugin] prepare release 2.10.49
52b890a: 2.10: upgrade third party dependencies
8fa4aa4: http–3rd-party: ensure IOException logged with toString
f5bba65: info: fix test to be less critical on timing
5047e84: [maven-release-plugin] prepare for next development iteration

Release 2.10.48

nfs

Report EOF when client undertakes mixed read/write workload and attempts to read beyond currently written data.

poolmanager

Previous releases of dCache contained a bug where replicas generated by pool-to-pool copies failed to include the access latency and retention policy. While not directly affecting dCache operations, the result is that this information is no longer reliable.

spacemanager

Fix listing by PNFS-ID. Glob support is removed as it was non-functional.

Changelog 2.10.47..2.10.48

6cd514d: [maven-release-plugin] prepare release 2.10.48
4422057: poolmanager: Fix missing access latency and retention policy on pool to pool copy
05f9430: spacemanager: Fix listing by pnfs id
f6aada3: [maven-release-plugin] prepare for next development iteration
761dd30: nfs-proxy: keep track of written bytes

Release 2.10.47

Changes affecting multiple services

This release fixes a caching issue where changes to inode metadata (e.g., ownership or permissions) for / (the root directory of Chimera) are not visible until the service is restarted. This affects NFS doors and pnfsmanager service.

Changelog 2.10.46..2.10.47

69ef1b6: [maven-release-plugin] prepare release 2.10.47
c54b8b5: chimera: Prevent filling of stat cache of root inode
15012ac: [maven-release-plugin] prepare for next development iteration

Release 2.10.46

Changes affecting multiple services

The chimera library, used by PnfsManager and NFS, contains possible race conditions that can lead to a NullPointerException. These are updated so that Chimera gives the correct error message under these circumstances.

The chimera library, used by the pnfsmanager and nfs services, contains a bug where two near-simultaneous attempts to delete a hitherto empty directory and write a file into the same directory will both succeed but leave an orphaned file: it exists in the t_dirs table but the parent does not exist in t_inodes table. This seems to be triggered when an ftp door fails with no write pool configured. This release fixes this problem.

pnfsmanager

Fix the error message (logged by the domain hosting pnfsmanager) if an attempt to finalise an SRM upload fails within pnfsmanager, or if an attempt to cancel an SRM upload fails within pnfsmanager.

webdav

Update to the latest version of milton.

xrootd

Update the alice-token plugin to allow the host name check to succeed on dual-stack (IPv4 and IPv6) machines.

Changelog 2.10.45..2.10.46

7b4db8b: [maven-release-plugin] prepare release 2.10.46
06416d1: xrootd: Update alice token plugin to fix IPv6 compatibility
6540fed: chimera: Detect races in directory deletion
5e4d00b: chimera: Detect races during move
6d150cb: webdav: update to latest milton
5b065e8: PnfsManager: remove copy-n-paste error in error message
96a2000: system-test: Fix grid-security settings
2386c80: rpm: enforce SL5 compatibility when building RPM packages
948023b: [maven-release-plugin] prepare for next development iteration

Release 2.10.45

Changes affecting multiple services

Eliminate race condition that can lead to a NullPointerException if a cell does not shut down cleanly.

The dcache script no longer checks whether the hostkey.pem file is in PKCS#8 format when dcache stop is invoked. Previously this could lead to orphaned dCache domains; for example, when upgrading dCache RPM.

admin

Fix potential IndexOutOfBoundsException should the response from the acm cell be malformed.

pnfsmanager

Contribution from Kurchatov: update the ChimeraCleaner to work-around the Java compiler’s inability to produce compatible Java 7 binaries when compiling with JDK 8. Note: sites using official dCache.org packages do NOT need to upgrade as a result of this change.

pool

Fix an intended pool-to-pool transfer optimisation: the receiving pool failed to reuse a delayed mover, should the pool-to-pool request timeout and be retried.

Fix pools so that they do not log NullPointerException if a pool receives a request to restore a file from an HSM to which it has no access.

Scripts

Fix how the writedata command in the chimera shell accepted data: the command-line argument was ignored if supplied and data was taken from stdin, if no argument was supplied then the command would fail with NullPointerException. Note: this command does NOT write data into dCache, but into Chimera.

srm

Fix the context information included when logging failures to write job information to the database.

Changelog 2.10.44..2.10.45

f0cd764: [maven-release-plugin] prepare release 2.10.45
03207f3: srm-client, dcache: fixed passing incompatible arguments to functions
049be02: dcache: removed unecessary use of non-short-circuit logic
6212e25: pool: Fix NPE when restoring file
21b618f: ChimeraCleaner: reallow to be compiled/run on different Java versions
bc2c0da: scripts: do not check for PKCS#8 formatted hostkey.pem on shutdown
3f10a11: chimera: Null value passed to non-null parameter in org.dcache.chimera.cli.Shell$WriteCommand.call()
434b345: srm: Use correct logging context when saving jobs
5d75167: cells: Fix NPE during shutdown
3d7d49e: [maven-release-plugin] prepare for next development iteration

Release 2.10.44

Changes affecting multiple services

When starting up, all doors (dcap, ftp, nfs, srm, webdav, xrootd) and pools advertise their presence to other dCache components before they are able to handle incoming requests. This can lead to subsequent queries timing out as the service finishes starting up. With this version of dCache, doors and pools only advertise their presence once they can handle incoming requests.

pool

This release updates how dCache configures the Berkeley DB when used for storing pool metadata. In addition, dCache will now no longer disable the pool when suffering a Berkeley DB-related problem if the Berkeley DB environment is still valid. Combined, these two changed should greatly reduce the occurances of pools disabling themselves when under heavy IO load.

Changelog 2.10.43..2.10.44

92d00ca: [maven-release-plugin] prepare release 2.10.44
56cd33a: pool: Refine Berkeley DB failure handling
7253888: Don’t announce cells to other services until they have started
2e2680a: [maven-release-plugin] prepare for next development iteration

Release 2.10.43

Changes affecting multiple services

The ftp, webdav and xrootd doors will delete the target file if an upload was unsuccessful. The copy manager (part of the transfermanagers service) has a similar behaviour if an internal copy is unsuccessful. If this delete was unsuccessful (e.g., the client deleted the file itself) previous dCache versions would log this at ERROR level. With this dCache version, such occurrences are logged at DEBUG level.

nfs

Fix race condition that can occur when a pool is first accepting pNFS transfers if multiple requests are processed almost simultaneously.

webdav

The webdav door has separate configuration allowing the admin to configure the door-local path that contains site-local files and the URI prefix to access those files. Earlier versions of dCache mistakenly used the former for the latter, which this release fixes.

Changelog 2.10.42..2.10.43

1d3e3ad: [maven-release-plugin] prepare release 2.10.43
f74cb79: Revert “webdav: Add robots.txt”
1b30b12: webdav: Respect webdav.static-content.uri property
cf20217: doors: Do not log failure to delete absent files on upload failures:
d4d8283: nfs4: fix race in request processing
afddd5b: webdav: Add robots.txt
e066291: [maven-release-plugin] prepare for next development iteration

Release 2.10.42

dcap

Doors describe their root path to SRM so it can calculate appropriate TURLs. Previous versions of dCache had dcap doors register incorrect paths, which this release fixes.

dCache configuration allows an admin to control if certain ciphers are allowed. In particular, this allows sites to remove support for problematic ciphers or hashing algorithms. This release fixes a problem where the GSI-dcap door failed to honour such settings.

pool

The replica-manager periodically requests a list of file replicas that a pool is hosting. In previous versions of dCache, if the pool finds a broken file then an error is returned to replica-manager. The replica-manager then considered the entire pool as being offline. With this version od dCache, such errors are logged on the pool. The replica-manager will not consider the pool as hosting that file’s data, but will otherwise consider the pool online.

The different HSM operations (flush, stage and remove) have internal timeouts after which the pool considers the request as failed. In previous versions of dCache, the default pool setup includes a four hour timeout for flush and stage but neglected to set a default for delete. This omission caused delete operations to time-out very quickly. With this release, delete operations also have a default of four hours.

Fix the Ruby implementation of the hsmcp script (hsmcp.rb) so it can parse new command-line arguments that include concurrency options.

The concurrency for active HSM operations is configurable and may be adjusted dynamically. In earlier versions of dCache, decreasing the concurrency only became effective when that operation started to idle. This has been fixed so the limits start to have an effect as operations complete.

Each movers can have one of three priority (LOW, MEDIUM, HIGH) and a selection discipline (FIFO or LIFO). The documented behaviour was for queue with names that start with - have LIFO discipline and those that start with any other character have FIFO discipline. Due to a bug, the order was wrong, with the priorities inverted and the two disciplines swapped, so LOW priority movers were started before MEDIUM level and MEDIUM were started before HIGH. This release fixes this so HIGH priority movers are selected preferentially over MEDIUM and, MEDIUM priority movers are chosen over LOW priority; however, it was decided to keep the disciplines as in previous versions and updated the documentation accordingly. There are several reasons for this: first, there is no difference between LIFO and FIFO when movers are not queued; second, neither discipline will help if the pool is persistently overloaded; third, LIFO discipline (although unfair) is documented as providing a better overall throughput during a time-limited overload; fourth, by default dCache has been running with LIFO discipline since v1.9.11 (released 2011–01–13) without any apparent problems.

HTTP third-party transfers report back if there was a problem verifying that the transfer was successful. One possible problem is that the remote server failed to supply checksum information. Reporting of such situations is now fixed.

webdav

In previous versions of dCache, should a user cancel a transfer shortly after a mover is created then there was a risk that the mover is abandoned. This is fixed with this release.

xrootd

In earlier versions of dCache, if the xrootd door times out for a write request while waiting for the mover to send the redirection information then the mover is abandoned. This is fixed with this release.

Changelog 2.10.41..2.10.42

ade3c9b: [maven-release-plugin] prepare release 2.10.42
4710de1: hsmcp: update to match new HSM interface.
abf2356: (2.10) old replica manager: prevent pool being listed as offline when there are files with corrupt metadata
524026d: webdav: Kill abandoned movers
64b14ef: xrootd: Kill mover on aborted write
7d25a32: pool: Fix transfer prioritization
f3082be: pool: Add nearline storage default timeouts
b9ab9bd: pool: Let script nearline storage provider scale down when lowering limits
9b110dc: dcap: Fix broken argument parsing
a07edc5: dcap: Fix socket factory argument parsing
75c5e85: pool: Fix error reporting in remote HTTP mover
1c08f8c: prepare for next development iteration

Release 2.10.41

srm

Fix security vulnerability in srm EGI-SVG–2015–9495 (restricted).

Release 2.10.40

Changes affecting multiple services

Fix a performance regression when deleting directories; the fix affects the pnfsmanager and nfs services.

In many cases, poolmanager would timeout after ten seconds when asked which pool to use for a transfer. This behaviour was not intended. The consequence of this bug is protocol specific: for some protocols, the door retries internally while other doors propagate this error to the client. Another consequence was the increased risk of the domain hosting poolmanager running out of memory, particularly when staging files. This release fixes the underlying problem. It is recommended that all doors be upgraded.

spacemanager

Fix bug that can result in leaked entries in space-manager file management from failed uploads. The problem is most likely triggered when a client cancels an FTP upload at the same time as the correponding SRM upload request expires. The problem may also be triggered by communication failure with PnfsManager and the user deleting the failed upload before the pool retries.

Changelog 2.10.39..2.10.40

e420596: [maven-release-plugin] prepare release 2.10.40
c67c041: spacemanager: Fix race condition leading to leaked reservation entries
5dbfa08: chimera: Resolve performance regression in directory deletion
b022089: doors: Fix pool selection timeout handling
fc881e4: Fix timeout math to avoid overflow
b77a047: Preparing for next release cycle

Release 2.10.39

ftp

Fix security vulnerability in gsi and kerberos authenticated ftp.

Release 2.10.38

Changes affecting multiple services

The System cell of each domain contains a version command that allows discovery of which dCache version is running. This release fixes this command. Note: there is no problem with the dcache script’s version command.

nfs

The NFS protocol provides access to additional infomation through dot commands. This release fixes the nameof and pathof commands for non-ASCII filenames.

pnfsmanager

Fix that dCache respects the setgid bit on a parent directory when the user uploads a file via the SRM protocol. Important: the srm node should be updated at the same time.

srm

Enforce authorisation of requests to finalise or cancel an upload. When initiating an upload, the user’s uid and primary gid are taken as the request owner-uid and owner-gid respectively. Only users that have the same uid as the request’s owner-uid or are a member of the request’s owner-gid are allowed to cancel or finalise an upload. Important: the srm must be updated if the pnfsmanager is updated.

webadmin

This release fixes a bug with the periodic building of billing plots. Previously, if the billing service took too long to reply then there would be no further updates to the billing plots.

Changelog 2.10.37..2.10.38

a2c9f0c: [maven-release-plugin] prepare release 2.10.38
5feb45b: rpm: remove “commented out” macros lines from spec file
511e4fe: chimera: fix nameof and pathof for paths containing unicode
dacd102: chimera: Let SRM respect setgid on upload
ec811ad: srm: Add authorization to put done and abort requests
f2f1f44: module: cells
bc0f510: (2.10) dcache-webadmin: add TimeoutCacheException to catch clause in billing service
ecc0aed: [maven-release-plugin] prepare for next development iteration

Release 2.10.37

Changes affecting multiple services

Specifying the DISABLE_BROKEN_DH flag in the dcache.authn.ciphers configuration property disables all Diffie-Hellman ciphers if Java 7 is used; if dCache is run with Java 8 then this flag has no effect. Disabling DH ciphers is necessary because Java 7 contains a broken implementation of Diffie-Hellman, which was fixed with the release of Java 7 update 51. This dCache release updates the behaviour of the DISABLE_BROKEN_DH flag to allow Diffie-Hellman ciphers if dCache is run within Java 7 update 51 or later.

The description of the DISABLE_EC and DISABLE_RC4 flags have been expanded and updated.

infoprovider

An earlier patch added support for publishing a single dCache instance with multiple SRM endpoints. This provided incompatible with sites that use a DNS alias for their official SRM endpoint, so that change is reverted with this release. Support for publishing multiple SRM endpoints is available with dCache v2.13.

nfs

Add support for the TEST_STATEID and FREE_STATEID RPC methods. These are used by the Linux kernel during recovery procedure. The previous lack of support for these methods can lead to the leaking of stateids, which can lead to NFS4ERR_RESOURCE : Too many states being logged.

pool

Improve the error message (logged by the pool and in billing) should the FTP mover fail to connect to the client.

This release updates the error the pool reports to an NFS client when the client attempts to read or write and the pool cannot find the mover. This situation is mostly likely caused by restarting the pool. When it receives the modified response, the client will fall back to using proxy-IO. This allows NFS clients that were reading a file to survive a pool restart.

Changelog 2.10.36..2.10.37

09b0d9d: [maven-release-plugin] prepare release 2.10.37
22bc8dd: pool: fix ftp mover to provide better logging when failing to connect
364d627: crypto: refine handling of broken ciphers
2a76e9a: Revert “infoprovider: remove single SRM instance limitation”
2d3dfa2: libs: update to nfs4j–0.9.7
354a7d6: pool: report IO error if we cant find NFS mover
7957f75: [maven-release-plugin] prepare for next development iteration

Release 2.10.36

Changes affecting multiple services

The event logger records when messages are received and sent by cells. Some cell messages send string commands; if so, the log entry contains that string. In previous releases, such string commands were mistakenly double-escaped. This is fixed with this release.

ftp

Fix debug output to include the flavour of GSS implementation; for example, GssFtpDoorV1::secure_reply: going to authorize using k5

pnfsmanager

When a user attempts to delete a symbolic link using a non-NFS door, previous versions of dCache would resolve the symbolic link to determine whether the user is allowed to delete the symblic link: only if the user is allowed to delete the symblic link’s target would the symbolic link be removed. With this release, the check verifies if the user is allowed to delete the symbolic link instead.

pool

This release updates the JVM command-line to make it explicit that compressed object references are in use. This allows the Berkeley DB library to calculate a more accurate cache size, potentially improving pool performance.

In previuos releases, any attempt is made to query a pool’s info (e.g., via the admin interface) while the pool is initialising will block until the initialisation has completed. This has a knock-on effect of blocking all subsequent messages. With this release, requesting information about a pool will not block during initialisation.

Fix high memory usage during pool initialisation if pool has any precious files.

xrootd

Earlier releases will record a stack-trace if xrootd recieves a malformed request. This is now fixed.

Changelog 2.10.35..2.10.36

46cc3fd: [maven-release-plugin] prepare release 2.10.36
052822f: cell: Don’t quote string command in event logger
ca11c50: pool: Explicitly enable compressed oops to calculate correct cache size
70a70e1: pool: Fix locking bug causing high memory usage during pool initialization
c781e18: chimera: Fix path resolution on delete
f3007c3: xrootd: Fix ‘xrootd logs stack-trace on malformed request’
1b70957: move execution of the superclass method before any concrete class initializations
b87523b: pool: do not list a repository during initialization
a5c220a: [maven-release-plugin] prepare for next development iteration

Release 2.10.35

pool

The Berkeley DB based metadata storage can sometimes fail. Should this happen, the pool must be restarted. In previously releases, such problems were logged with an unclear message and a stack-trace; the pool would continue to operate but would fail all subsequent transfers; nothing made it clear the pool must be restarted. With this release, such Berkeley DB problems will be logged with a concise error message and the pool will be disable, making the restart requirement explicit.

srm

In previous versions of dCache, the ls operation in SRM occationally returned incomplete or inconsistent results. This is now fixed.

Changelog 2.10.34..2.10.35

472479e: [maven-release-plugin] prepare release 2.10.35
70b12c0: srm: fix race condition in ls response
de527be: pool: Disable pool on meta data failures
1f7f733: [maven-release-plugin] prepare for next development iteration

Release 2.10.34

Changes affecting multiple services

In previous releases, dCache required a layout file be present, even if that file was empty. This has two negative impacts: the dcache stop (typically invoked automatically when updating a package) would not work, nodes where only scripts (e.g., info-provider) are used would require unecessary configuration. With this release, a missing file generates a warning but does not prevent the dcache stop command or scripts from working. This warning may be suppressed by setting dcache.layout.uri to an empty string.

httpd

The tape related queues (flush and restore) have no maximum limit, yet both the old web information and new webadmin show a dummy maximum value for these queues. This meaningless maximum value is no longer shown.

info

The info service collects information about, amongst other things, the interface(s) a door listens on. This is made available in different formats. The URL-formatted version, used by info-provider, always used an IP address even when a name was known. This is now fixed.

infoprovider

It is possible to run multiple SRM endpoints in dCache, provided certain restrictions are upheld. With this release, the info-provider publishes multiple SRM endpoints correctly.

nfs

Some shells, when attempt to overwrite a tag’s content using NFS, do so in a way that Chimera previously failed to support. This failure was reported back to the user as a remote I/O error. This release fixes this problem.

pool

Reduce latency when a pool processes a request to start a read mover. This improves dCache responsiveness when a client opens a file for reading.

The NFS mover uses the file’s size when processing read and write requests. For read operations, the file size cannot change. This release takes advantage of this to reduce the load on the underlying filesystem.

poolmanager

Poolmanager may attempt to create additional copies of a file, only to discover such attempts fail because of other constraints. This leads to the log file containing entries like P2P denied: already too many copies and P2P denied: No pool candidates available/configured/left for p2p or file already everywhere. With this release, such entries are logged at info level: they no longer appear in the log file, but are available via the poolmanager’s pinboard.

webdav

Fix proxied uploads when the client does not send the file size, either directly or via SRM; in particular, this fixes compatibility with ARC.

Changelog 2.10.33..2.10.34

ea62128: [maven-release-plugin] prepare release 2.10.34
136a382: systemtest: fix install command in credentials command
085ef11: infoprovider: remove single SRM instance limitation
6dcb36e: info: fix url-name to publish hostname
3d69717: chimera: throw FileExistChimeraException if tag already exists
7f5b622: (2.10) webadmin: do not display numerical value for max restores or stores
700915f: boot: Don’t fail on missing layout file
c7fc663: httpd: Do now show maximum for restore and flush queues
2291a98: poolmanager: Reduce log level of p2p denial
63d5f7b: webdav: Fix proxied upload with chunked encoding
c981400: pool: simplify duplicate request handling
85cf92e: [maven-release-plugin] prepare for next development iteration
efaff21: pool: reduce load on back-end file system

Release 2.10.33

Changes affecting multiple services

A new security configuration option allows the dCache admin to ban all SSL/TLS ciphers that use the RC4 cipher. RFC–7465 states services MUST NOT accept an RC4-based cipher suite. Adding the DISABLE_RC4 option to the dcache.authn.ciphers makes dCache compliant with RFC–7465. This option is not enabled by default to avoid possible regression with clients that require the RC4 cipher. This property affects dcap (with GSI), ftp (with GSI), srm, webadmin, webdav (with SSL/TLS), xrootd (GSI plugin) services.

pnfsmanager

Fixed ACL inheritance when uploading data through SRM. In earlier versions of dCache, a file uploaded through SRM failed to inherit any inheritable ACEs from the parent directory.

This release brings some modest performance improvements when creating upload directories. This improvement is available automatically only to sites that have not yet upgraded to 2.10 (or later). Sites already running 2.10 or later can enjoy the same improvements by deleting the upload directory (/upload by default) to allow dCache to recreate it. Important: deleting the upload directory will fail any current SRM uploads; it is recommended to do this during down-time.

pool

The pool’s migration module may be invoked with different pool selection modes: the -select option. The random selection option (-select=random) excludes pools that are full, but mistakenly considers replicas that could be deleted (i.e., non-sticky cached replicas) as part of the used space; this treats a pool as full even when the pool has removable files. With this release, pools that are full but contain some cached files are potential targets for random pool selection.

In earlier releases of dCache, the save command failed to record the stage, flush and remove timeouts for nearline storage (rh set timeout, st set timeout, rm set timeout respectively). This is now fixed.

This release introduces the pool.mover.nfs.port.min and pool.mover.nfs.port.max configuration properties. Previously, pools listened on a random port between dcache.net.lan.port.min and dcache.net.lan.port.max; the two new configuration properties take the two dcache... configuration properties as default values. Once the pool is listening on a particular port, it will try to listen on the same port after restarting. If this proves impossible, another port is selected and will be used subsequently. Important: using the same port is important as the pool listening on another TCP port can trigger high load on the client machine.

Scripts

Fix JAR selection when a short-lived java process is started. This is typically done when using one of the scripts.

spacemanager

Fix writing into a reservation when using a protocol that does not provide a username or FQAN; for example, when writing into dCache using NFS and with the WriteToken directory tag set. Previously writing would fail with a Message processing failed: null group message.

srm

The srm service can generate RemoteException stack traces when dCache is behaving correctly. These are now logged at debug level and without a stack trace.

Changelog 2.10.32..2.10.33

d443035: [maven-release-plugin] prepare release 2.10.33
52eac62: chimera: Fix merge conflict and Java 7 compatibility
1e39414: pool: introduce unique port number for nfs mover
f1ec21a: pool: dedicated port range for nfs
61caf5a: pool: Store nearline storage timeouts to pool setup file
e98b145: pnfsmanager: Create base upload directories without tags and acls
b5a9992: pnfsmanager: Inherit ACLs on upload with SRM
dcb821e: chimera: Add ACL insert triggers for HSQLDB
528d5ed: Fix limited class path generation
0ed000f: pool: Let random pool selection select pools with removable files
9248f25: crypto: allow banning of RC4 cipher suites
73c07e0: spacemanager: Allow unowned files in reservations
54a6df2: srm: Don’t log erroneous stack trace
9ae8661: [maven-release-plugin] prepare for next development iteration

Release 2.10.32

Changes affecting multiple services

This release upgrades the BouncyCastle version from v1.45 to v1.46. The main motivation is to improve concurrency, so obtaining better performance on multi-core machines. The update affects the xacml and voms plugins to gPlazma and any door configured to use (or that always uses) GSI authentication: dcap, ftp and srm. There is no cross-dependency; domains hosting these service may be updated independently.

pnfsmanager

Fix possible leaked upload directory if PnfsManager takes too long to create the upload directory.

scripts

The chimera script provides both an interactive shell and the ability to run a Chimera operation as a single command-line invocation. This release fixes an problem where the single Chimera operation fails to provide the output if it is too short.

srm

If a client releases a reservation using the SRM protocol while another SRM client is querying for information about that reservation then there is a risk that the reservation will appear to exist for some 30 seconds, despite the reservation being successfully released. This release fixes this problem.

Changelog 2.10.31..2.10.32

c2452f6: [maven-release-plugin] prepare release 2.10.32
4751d27: srm: Fix cache invalidation of space meta data
7d4d98d: libs: Update to voms-api-java 2.0.10.1
eeaf8cc: libs: use bouncycastle–1.46
f33549d: libs: update jglobus to 2.0.6.9d
19baddc: chimera: Fix single command invocation of chimera utility
277cde1: pnfsmanager: Resolve upload directory leak caused by missing reply flag
dd3b1ac: [maven-release-plugin] prepare for next development iteration

Release 2.10.31

Changes affecting multiple services

dCache uses a standard format to monitor the performance of various components: in the srm door to record how quickly SRM requests are processed (print srm counters command), in the nfs door and pool to monitor NFS performance (stats and nfs stats commands, respectively), the generic cell message monitoring (monitoring info command), and pnfsmanager service (the “Statistics” section in info). This release fixes a rounding error that prevents these statistics from including long-lived requests.

pinmanager

The pinmanager service has the ls command that allows the admin to limit the results to a specific pin or all pins against some PNFS-ID. This release fixes listing by pin id.

pnfsmanager

With this release, the path of automatically generated upload directories has changed slightly to improve SRM upload performance. Previously, these generated paths had the form <upload>/<unique-ID>, where <upload> is the value of the dcache.upload-directory configuration property (/upload by default) and <unique-ID> is some unique value (a UUID). With this release, these directories have the form <upload>/<processor-ID>/<unique-ID>, where <processor-ID> is some small integer value. Standard-conforming clients are unaffect by this change and no action are needed by the admin from this change.

Update pnfsmanager to avoid creating temporary directories if the srm has already discarded the request. This helps dCache recover more quickly when it is overloaded from SRM uploads.

pool

Fixed pool’s erroneous interpretation of the HSM timeout (4 hours, by default) as being from when a staging request was initially received, rather than from when it started processing the request (by starting the HSM script or via the new plug-in mechanism).

This release fixes how a pool’s invokates the HSM script. Previously, the pool mistakely omitted the additional arguments that an admin may configure the pool to include.

replicamanager

This release fixes the Can't clear the tables error reported when starting replicamanager service with certain PosgreSQL versions.

spacemanager

Fix the update space admin command so it can remove any ownership from a reservation. The reservation’s owner is allowed to release the reservation via the SRM protocol. If the reservation has no owner, it may only be released through the admin interface.

srm

The info admin command provides details about the srm service, including information specifically about SRM activity that can be processed synchronously or asynchronously: get, put, reserve-space, ls, and bring-onling. This information describes how many requests are in each of the possible states, one of which is Waiting for CPU. This release fixes the output of the info command to show the correct Waiting for CPU values.

This release drastically improve srm service performance when there are a large number (e.g., thousands) of queued requests; for example, this brings considerable improve for bulk bring-online requests.

This release also brings improved performance when the SRM has only a few queued requests; more specifically, when the number of queued requests is less than or equal the number of currently unoccupied max-inprogress slots. For example, if dCache is not processing any GET requests then the first srm.request.get.max-inprogress GET requests are processed more quickly.

Fix the srm service switching from synchronous to asynchronous processing when processing a bulk-requests with many files; in earlier releases, such large requests could prevent a request from falling back to asynchronous processing.

This release fixes the srm service so it cancels corresponding pinmanager requests when an SRM client aborts a bring-online request.

Changelog 2.10.30..2.10.31

a660869: [maven-release-plugin] prepare release 2.10.31
2ccd1b1: (2.10) replica manager: fix table truncation
27f8e91: common: Fix division by zero regression in gauges
4bec44b: common: Fix rounding error in request gauge
0208029: pnfsmanager: Discard upload path creation request on TTL expiration
a4ef57a: pnfsmanager: Use per-thread upload directory to reduce lock contention
6a51067: srm: Optimize scheduler performance
7f750bd: srm: Further optimize SRM scheduler
7638d21: srm: Abort pinning when cancelling bring-online requests
3a8298d: system-test: Enable MVCC and logging for HSQLDB
37ee6d7: srm: Fix sync to async mode timeout
b737a5e: pinmanager: Fix listing by id
fdcb0fa: pool: Add HSM options to hsm script remove callout
6511528: srm: Fix queue size reporting
3098128: pool: Fix timeout behavior of HSM requests
857e8b1: spacemanager: Allow spaces to become unowned
501eb58: [maven-release-plugin] prepare for next development iteration

Release 2.10.30

nfs

There was an error in how ACLs were interpreted by the nfs door where multiple ACEs for the same user were compacted ignoring the flags. With this release, ACEs against the same user but with different flags are honoured.

Changelog 2.10.29..2.10.30

8eeaaf3: [maven-release-plugin] prepare release 2.10.30
cd071b1: libs: update to nfs4j–0.9.6
57787cc: [maven-release-plugin] prepare for next development iteration

Release 2.10.29

Changes affecting multiple services

This release fixes an NFS problem triggered when a second process opens a file that is already opened by another process on the same computer. The NFS specification allows the second open to reuse the “layout” (== mover) from the first open. With this release, the door will detect this and reuse the existing mover; the mover is updated to allow this. The nfs door and all pools accessible via the NFS protocol must be updated to this release or newer.

admin

Restore compatibility with loginbroker information for pcells; in particular, the problem affected information provided by the srm door.

dcap

No longer log a stack-trace when transfering a file if the client is expected to connect to the pool (the “active client mode” option; the -A in the dccp command) and the pool is shutdown before the client connects.

srm

Some SRM requests support bulk requests, where multiple SURLs are processed in the same fashion. The assumption was that the SURLs in a bulk request are distinct. Bulkd requests have been observed that violate this assumption: a request with a SURL appears more than once. This causes problems when staging files from tape. With this release, a SURL that appears multiple times in a request is processed exactly once.

Changelog 2.10.28..2.10.29

6008a13: [maven-release-plugin] prepare release 2.10.29
d9eb548: dcap: fix stack-trace when shutting down pool waiting for connection
3d261ec: srm: Remove duplicate SURLs in bringonline and get requests
0097928: admin: Restore pcells compatibility with loginrbroker
8756386: nfs: share mover for the same client
79a956c: nfs: use NFSv4MoverHandler instead of Map in embedded NFS server
ea49803: system-test: add missing dCache disposible CA certificate
cb794a7: system-test: add regenerated host and user credentials
6275545: [maven-release-plugin] prepare for next development iteration

Release 2.10.28

Changes affecting multiple services

In previous versions of dCache deleting a directory with the last reference to a tag did not remove that tag. With the introduction of upload directories this problem became acute, as many such directories are created and deleted. With this release, deleting the directory with a tag’s last reference is very likely (but not guaranteed) to remove that tag’s inode. Further information about this will be made available via user-forum. The update affects both the nfs door and pnfsmanager; however, most sites will see the most benefit from updating pnfsmanager. Important #1 an updated nfs door will wait for database changes that updating pnfsmanager will enact. Important #2 the database changes enacted by pnfsmanager can take awhile (an hour or so for large dCache instances) as the change adds an index to the t_inodes database table.

The “standard” Linux flag to mark an ACE as inherit-only is i, yet previous versions of dCache accepted only ‘o’ as the inherit-only flag. With this release, both o and i are accepted. The chimera shell, pnfsmanager and nfs door should be updated.

dcap

Previously the dcap door provided the dcap client with the wrong errno (error number) should the client attempt to operate on a nonexisting file or directory: EIO was returned instead of ENOENT. This is now fixed.

nfs

Information about the filesystem, as reported by the df command, is expensive to calculate. Previously, the result was generated when a client requests the value and the result cached for an hour. This proved awkward as all clients blocked until the calculation completes. With this release, updating the cached value is a background activity: the cached value is returned until the update completes. Additionally, the cache lifetime is configurable via the nfs.fs-stat-cache.time and nfs.fs-stat-cache.time.unit properties.

pool

The file integrety checking (single file, entire pool one-off check and background checking) produced ambiguous output; for example, not being able to scan a file because it is still being uploaded counted towards an “error” count, but no corresponding log message is included. With this release, the output is less ambiguous and distinguishes between corruption, temporary and more permanent problems.

replicamanager

Added an admin command to query pool-manager for an updated list of pools that are a member of the resilient pool group. This allows adjustments to the set of pools participating in replication without restarting the replicamanager service.

Changelog 2.10.27..2.10.28

698724f: [maven-release-plugin] prepare release 2.10.28
1021708: replicamanager: add an admin command to re-fetch resilient pool group
f00a040: acl: fix compatibility with linux ace
af7b0ad: chimera: Delete unreferences tag inodes
6487ea2: pool: update scrubber messages to be less ambigous
a143cfa: dcap: fileAttributesNotAvailable must set pass ENOENT to the client
969eb14: [maven-release-plugin] prepare for next development iteration
f94ed28: chimera: do not maintain time-based cached value of FsStat
0f93f35: chimera: squash usedFiles() and usedSize() into a single query

Release 2.10.27

Changes affecting multiple services

In prior versions of dCache, the path field that billing logs for transfers contained the actual transfer path; i.e., for SRM-initiated uploads this was the auto-generated path from the TURL and not the user-supplied path from the SURL. This proved confusing so, with this release, the path field now has the user-supplied path (i.e., from the SURL). An additional field (transferPath) has the transfer path. While this is not logged by default, billing configuration (the billing.text.format.mover-info-message and similar properties) may be updated to include it. All doors used by srm clients and the billing service should be updated.

dcap

This version of dCache uses the connected socket when discovering the IP address of the client for “channel binding”, rather than a constant value. This is important for dual-stack machines (with both IPv4 and IPv6 addresses) that host dcap doors with either gsi- or kerberos-based authentication.

spacemanager

In earlier versions of dCache, if a file is deleted while being uploaded the space-manager will consider the transfer as successful (and so using some of the reservation’s capacity) whereas the pool would delete the file immediately after the upload is complete. Such “leaks” results in a reservation being reported as having more used capacity and less free capacity than it should.

srm

The srm service maintains a counter of the number of requests for each different type. With earlier versions of dCache, if the srm encounters jobs that have timed out while the service was no running, these counters became inaccurate. This is now fixed.

When an srm request times out, dCache may need to take some action to “clean up”; for example, removing the upload directory. If srm is configured to discard all requests on start-up these cleanup operations did not happen. This is fixed with this release. As a consequence, startup times will be longer.

If srm.persistence.enable.store-transient-state is set to false (the default value) then transfer requests do not survive an srm service restart if the limit on concurrent transfers prevented handing out the TURLs. With this release, such requests survive an srm restart.

Previously, if srm.persistence.enable.store-transient-state is set to false (as is the default) then information needed to clean up a job might not be stored in the database. If, after restarting the srm, the job times out or is aborted then there is insufficient information to clean up, resulting in upload directories not being cleaned, pins (for bring-online requests) not being cleared, copy requests not being cancelled and lifetime extensions being lost. These problems are fixed with this release.

Changelog 2.10.25..2.10.27

a94ac75: [maven-release-plugin] prepare release 2.10.27
eeace29: all: Fix several NPEs when submitting billing messages
daf84a3: [maven-release-plugin] prepare for next development iteration
12c07e1: [maven-release-plugin] prepare release 2.10.26
7c4c24d: srm: Fix scheduler counter initialization on restart
66f8951: doors: Log real path in billing
475fa58: srm: Allow the SRM to take action upon cleaned requests
4789735: srm: Force save jobs when adding information needed for cancellation
bb4acc2: srm: Force save when job becomes RQUEUED
d1ee02c: javatunnel: use connected socket to discover local inet address
7512d5c: [maven-release-plugin] prepare for next development iteration
c4c8640: spacemanager: Fix race condition leading to leaked files

Release 2.10.25

Changes affecting multiple services

When dCache suffers a sufficiently high inrush of requests that a cell’s request queue is exhausted, new requests to that cell are rejected. While overall dCache handles this situation correctly (degrading gracefully), the internal message counting and event logging were not updated correctly. This is now fixed.

Fix a bug that triggers a NullPointerException in webadmin. Although the problem is logged in webadmin, the cause is in the doors supplying the information. Therefore, the dcap and ftp doors are to be updated.

When publishing an IPv6 interface, do not include the zone information. Zone information is an extension to IPv6, which appends a ‘%’ plus some opaque identifier to the regular address. Not all clients understand this extension and reject the address as invalid. As zone information is not useful anyway, with this release zones are no longer published. This update is for all doors.

The gplazma, pinmanager pool and srm services can process several dCache-internal messages concurrently. Previously, shutting down these services failed to wait for concurrent activity to finish (within a reasonable deadline), which risked partially committed activity and logging errors when shutting down a busy system. This is now fixed.

Fix possible NullPointerException in httpd and admin services.

In most cases dCache will publish door URLs with a hostname; if the address cannot be resolved then the address is used instead. Previously IPv6 addresses were written incorrectly: without the square brackets. This is now fixed if the info service and the info-provider scripts are updated.

admin

Fix division-by-zero error when the SSH client reports zero width or height. This happens forcing allocation of a pseudo TTY without having a real TTY (see the -t option to OpenSSH client).

pinmanager

Fix pinmanager so it does not log a stack-trace if it failed to fetch fresh pool status information from poolmanager while unpinning a file.

pool

Previously, the metadata reconstruction for files where the upload was not completed and written without the client specifying a retention policy would trigger a stack-trace java.lang.IllegalStateException: Attribute is not defined: RETENTION_POLICY. This is now fixed.

Sites have reported pools becoming stuck with messages like UNEXPECTED_STATE_FATAL: Unexpected internal state, unable to continue., which is due to a problem with the Berkeley DB. This release includes an updated version that should fix these problems.

spacemanager

A number of sites have reported problems with the database resolving deadlocks. While such reports are expected and dCache behaves correctly when they happen, this release should reduce the likelihood of them appearing.

Changelog 2.10.24..2.10.25

52b4482: [maven-release-plugin] prepare release 2.10.25
026bd03: info/info-provider: publish valid IPv6 addresses
df0a450: loginbroker: strip off zone off published interface name
4aa0d5b: admin: Avoid division by zero when the client reports a zero sized terminal
42a1929: spacemanager: Optimize space record deletion
69aca22: Update netty and berkeley DB to latest version
3a92a16: dcache: Make cell communication use the correct timeout
186166c: httpd,admin: Fix NPE in transfer collectors
955ae65: doors: Fix race condition that causes NPE in webadmin
de754a4: cells: Orderly shutdown of multi-threaded cells
1f58c6f: cells: Fix event queue counting bug
d869b04: pinmanager: Don’t log stack trace when unable to fetch pool monitor
6f4358b: pool: Fix meta data reconstruction
74f8190: [maven-release-plugin] prepare for next development iteration

Release 2.10.24

Changes affecting multiple services

Avoid potential message loop when both sender and receipient domains are restarted while the message is in-flight.

admin

pcells distinguishes between a failure to send a message to a cell and that cell taking too long to respond. Support for that distinction was lost with 2.10; with this release, it is restored.

The tab-completion feature of admin parses the help-hint to discover what expansions are available. This has been updated to support more commands.

httpd

Fix compatibility with pcells. This requires a corresponding update to pcells.

Changelog 2.10.23..2.10.24

0a220c4: [maven-release-plugin] prepare release 2.10.24
9bab859: admin: Fix command completion
25eff9b: admin: Propagate NoRouteToCellException to pcells
ff3541f: cells: Restore CellExceptionMessage encoding
dadae07: httpd: Fix pcells compatibility
9f44f34: [maven-release-plugin] prepare for next development iteration

Release 2.10.23

admin

Restore support for pcells to gracefully handle timeouts.

Restore compatibility for pcells when querying space-manager.

cleaner

Fix bug reported as java.lang.ClassCastException: java.lang.String cannot be cast to [Ljava.lang.String;.

pool

Fix pool reconstruction. If, on startup, the pool detects it is in an inconsistent state it will attempt to recover from that. With this release, this should work.

srm

Update SRM default property values. From various reports, it is clear that the current default values are not ideal for many sites. The following properties are adjusted:

srm.limits.jetty-connector.backlog increased to support bursts of activity; a larger value may be appropriate but not available due to default Linux configuration.
srm.request.threads reduced as processing is asynchronous.
srm.request.ls.threads change default to be srm.request.ls.max-in-progress as ls requests are blocking.
srm.request.max-requests increased to satisfy user-demand; needed as clients typically request more concurrent TURLs that they make concurrent requests.
srm.request.max-transfers increased to same value as srm.request.max-requests. This way (by default) dCache never blocks requests pending a client returning a TURL.
srm.persistence.remove-expired-period increased to 10 minutes to reduce stress on the database.
srm.service.pnfsmanager.timeout decrease to two minutes as PnfsManager should respond within that time and clients will likely disconnect if they don’t hear a response within that time.
srm.service.spacemanager.timeout decrease to 30 seconds as the service should respond quickly.
srm.protocols.disallowed.get and srm.protocols.disallowed.put now include file protocol by default.
srm.protocols.loginbroker.timeout decrease to 20 seconds as this is a very light-weight service.

Support catching and logging some bugs that previously would be silently ignored.

Increase default value for srm.limits.db.queue to survive database activity when faced with bulk staging requests.

Fix a ConcurrentModificationException caused when SRM processes two requests close tegether.

Changelog 2.10.22..2.10.23

32230d6: [maven-release-plugin] prepare release 2.10.23
d817f6f: admin: Restore pcells compatibility
4d51df9: srm: log more SRM bugs
b453bb7: admin: Restore timeout semantics for pcells compatibility
25b076e: cleaner: Fix class cast exception
3702d41: srm: Increase default database queue
ba0fd18: srm: Use more sensible default values
431e4e0: srm: Fix ConcurrentModificationException in Axis
e3b85a2: pool: Fix pool entry reconstruction
c082039: [maven-release-plugin] prepare for next development iteration

Release 2.10.22

Changes affecting multiple services

This release improves the quality of information recorded when logging problems and fixing synchronization of lease time. This affects the nfs door and pools.

pool

When a client reads a file, the pool reads blocks of data from the local filesystem. When reading such a block, the pool could receive fewer bytes than requested. Previously, the pool assumed that this only happens when the end-of-file is reached; however, this is not guaranteed. Should this assumption be violated then the data sent to the client will be corrupt. In practise, the pool’s assumption is true for Linux and local filesystems; however, the code has been updated to remove this theoretical cause of corruption.

Fix a problem where a door can trigger the pool to post-process a file many times. Each trigger starts a new thread, resulting in very large system load.

Fix that, during startup, the pool would fail when recovering a broken file that has access-latency and retention-policy determined by spacemanager.

Extend the nearline SPI so it provides plugins with the file’s path and provides an easier way of report errors.

spacemanager

Improve the logging and handling of transitory errors.

Changelog 2.10.21..2.10.22

70fa3b0: [maven-release-plugin] prepare release 2.10.22
2b734fd: pool: Fix and align pool meta data recovery with current pnfs manager
58ca239: pool: ignore duplicated mover kill requests
48b3a47: spacemanager: Making logging and handling of transient errors more robust
0a95f67: pool: Fix read corruption in HTTP mover
c484e1a: pool: Extend nearline storage SPI with path and custom error codes
ed56243: libs: update to nfs4j–0.9.5
0e8edbd: [maven-release-plugin] prepare for next development iteration

Release 2.10.21

Changes affecting multiple services

If a bug is found in dCache then it should be logged. For the older admin commands, an uninformative message was logged.

The LocationManager service is started automatically in the broker domain (typically dCacheDomain). This answers requests from other domains with instructions on how to connect to the dCache cluster (typically a star topology, other domains connecting to the dCacheDomain). With this release, the FQDN is sent rather than the hostname name. The node hosting the broker domain should be updated.

The version of HikariCP has been updated to 2.0.1. This is to correct for a bug in v1.3.9 that triggers logging a stack-trace with:

Internal accounting inconsistency, totalConnections=-1

Previously, if a client attempts to write a file with Access Latency and Retention Policy that conflicts with the selected reservation a stack-trace was logged. This is now fixed: both the FTP door and spacemanager should be updated.

Previously, if a bug was discovered when starting up a service dCache would abort starting up with a non-informative log message. Now, the stack-trace is logged.

nfs

Log abandoned movers with the corresponding stateid.

pnfsmanager

When writing into dCache with SRM, the Access Latency (AL) and Retential Policy (RP) may be specified or omitted. Additionally, the client may specify a space reservation into which the file should be written. If the client specifies both, they must match.

In dCache, there are three mechanisms to support a client that specifies neither AL/RP nor space reservation: the directory can have AccessLatency and RetentialPolicy tags, the directory can have the WriteToken tag, there are a system-wide default AL/RP values.

Previously, dCache would reject uploads where the user-supplied AL/RP information does not match the AccessLatency/RetentialPolicy tags, despite the latter being intended as default values.

With this release, if the client specifies neither space token, AL or RP then the directory tags will be used. If a directory specifies both WriteToken and AccessLatency/RetentionPolicy tags, then these have to be consistent. If the directory conains a WriteToken tag and the client specifies AL/RP, then the client specified values have to be consistent with the WriteToken tag.

pool

When attempting to upgrade non-precious and non-cached files (e.g. a file marked broken), the receiving end of the migration module would answer twice: first (correctly) with a failure and then (incorrectly) with a success. This is now fixed.

spacemanager

With RDBMs, transactional deadlock rollbacks are normal behaviour, which happen when the database must choose between two conflicting and concurrent changes. Previously, spacemanager would aggressively retry when this happens. This has been observed to trigger performance degradation. This release includes several strategies to minimise the impact of this.

Previously, the default number of space-manager threads was the same as the number of database connections. This does not take into account that there is background activity that also needs access to the database. With this release, the number of threads has been lowered; having a large number of threads also increases the likelihood of seeing transactional deadlock rollbacks.

Fix the shutdown sequence of spacemanager. Previously, shutting down a busy spacemanager could lead to attempts to modify the database after all database connections were closed; such failures were logged.

Reduce logging on various DB errors; generic transient errors are now warnings and transactional deadlock rollbacks are logged at debug level.

Changelog 2.10.20..2.10.21

9ce8df8: [maven-release-plugin] prepare release 2.10.21
2f41134: LocationManager: Use fqdn instead of hostname
a4b8812: spacemanager: Controlled shutdown
7fc3600: spacemanager: Make request processing more robust
41305ce: spacemanager: Reduce log level on various transient DB errors
e05fb97: spacemanager: Minor simplification to link group updates
7d325d8: spacemanager: Don’t log stack-trace on AL/RP/Reservation conflict
d5a9928: spacemanager: Lower default for number of threads
40efb76: 2.10: Upgrade HikariCP to 2.0.1
8608a2d: nfs4: log abandoned movers with WARN
5e35b46: cells: log bugs found by CellShell
bd53579: cells: fix how bugs are reported from ac_ command.
4c6eb11: pool: Fix bug in migration module upgrade logic
4f1c69b: pnfsmanager: Fix upload to space token that conflicts with AL and RP tags
740b40c: [maven-release-plugin] prepare for next development iteration

Release 2.10.20

Changes affecting multiple services

Various scripts, including the dcache command, invoke the java command with a list of directories in which Java should look for support libraries. Previously, the current working directory was (mistakenly) included in that list. This could lead to odd behaviour; one particular example is running a dcache database command from the /etc/dcache directory. This release fixes this problem by excluding the current directory.

Normally, if running an admin command in some cell triggers a bug then the log file of the domain hosting that cell will contain a stack-trace. Previously, for certain admin commands (spacemanager and sweeper ls in the pool) this did not happen. This is fixed with this release of dCache.

When there is some problem in the communication between domains and error message is logged. Previously the explanation for the problem was logged as “null”. With this release, a more descriptive explanation is provided.

The help text within the chimera CLI and for the pool’s migration commands was badly formatted; this release fixes this.

spacemanager

By default, when logging messages some contextual information is included (in square brackets). This typically includes the cell name and the kind of activity that triggered the log message. Previously, some messages omitted all contextual information, which is fixed with this release of dCache.

Fixed the ls spaces and ls files commands so they do not fail if there is a reservation without an owner.

All space-reservations may have an owner: a username or an VOMS group; ownership may be further restricted by VOMS role. When created through the admin interface, a reservation’s ownership is optional. With previous versions of dCache, if a space-reservation has no owner then anyone can release it. With this release of dCache, a reservation without an owner may only be released through the admin interface.

Changelog 2.10.19..2.10.20

1cdaf24: [maven-release-plugin] prepare release 2.10.20
07b031f: spacemanager: Change release authorization for unowned reservations
275bbf8: Fix valueSpec help parser
a8ca6ed: spacemanager: Fix NPE in listing space reservations
e85713a: Maintain CDC of threads in decorated thread pool
806e8c0: spacemanager: Log unexpected exceptions
8d37d0a: tunnel: use toString if IOException#getMessage returns null
67d047c: Exclude cwd from classpath
9feb238: [maven-release-plugin] prepare for next development iteration

Release 2.10.19

Changes affecting multiple services

Fix potential database connection leak for nfs door and pnfsmanager service. This was triggered when attempting to create an already existing non-zero level via the .(access) or .(use) dot commands.

This patch fixes a race condition in Chimera that affects the nfs door and the pnfsmanager service. The effect is that, if two clients attempt to delete the same target (a file, link or directory) at the same time then the nlink count for the parent directory is decreased twice. “At the same time” means within the time taken to process the deletion; this is instance-specific but should be much less than 1 ms for well-configured systems. Sites can repare any incorrect nlinks with the following SQL:

UPDATE t_inodes SET inlink = (
    SELECT COUNT(*) FROM t_dirs  WHERE t_inodes.ipnfsid = t_dirs.iparent
) WHERE itype = 16384;

This is safe to run on a running production instance, but may take some time and will affect dCache’s responsiveness while running.

gplazma

The description for how to migrate away from using the forbidden useGPlazmaAuthorizationModule and useGPlazmaAuthorizationCell properties had caused confusion. The description has now been updated to be more explicit.

httpd

Fix filtering boxes and sorting on Pool Admin, Pool Usage, Poolgroups, Space Tokens and Tape Transfer Queue.

pnfsmanager

When a user uploads a file via SRM, a directory is created under the update directory (/upload by default). Should this fail, the upload will fail; however, this was not logged. With this release, pnfsmanager now logs why such failures happened.

pool

Previously, the NFS mover stopped when the client disconnected. This had two problems: a client that never connects to the pool leaves a mover that never dies and file transfers are not robust against transitory networking problems, despite the client attempting to reconnect. The latter problem is particularly bad when writing data as falling back to the door is not supported and such failures are not handled well by the Linux kernel.

With this release of dCache, if the client is not connected then the mover queries the door every 7.5 minutes to check it is still needed. The mover dies only when the client closes the file, the door declares the client is lost or the nfs door is stopped or restarted.

IMPORTANT Any pool upgraded to this version of dCache that is used for NFS transfers requires all NFS doors to be upgraded to this version or newer.

Currently, if there is a problem while a file is being uploaded using HTTP chunked transfer encoding then dCache will contain an incomplete file with no error mentioned in any log. With this release, such partially uploaded files will still exist (due to the partial upload semantics of pools) but the problem is logged on the pool and with billing.

spacemanager

Add a -blocking option to the update link groups admin command to ensure all subsequent commands will use the updated information.

srm

When uploading files, the client may choose not to specify an access latency, a retention policy or a space reservation for these files. Likewise, when reserving space, the client may choose not to specify an access latency for the reservation. Previously, the detailed view of the admin ls command showed null for these fields if the client did not specify them. With this release, those fields are omitted if they have no useful data.

Uploading a file with SRM involves three steps: preparing for the upload (srmPrepareToPut), uploading the file, marking the upload finished (srmPutDone). The third step can fail but previously the response from dCache is always Upload failed.. With this release, a meaningful error message is returned.

Changelog 2.10.18..2.10.19

53b9c39: [maven-release-plugin] prepare release 2.10.19
67f3a8b: pool: Propagate HTTP mover failures to pool
1e29206: chimera: fix potential transaction leak on error path§
7f17d90: gplazma: update error message for forbidden properties
84ad1e2: spacemanager: add blocking option to update link groups command
b45ebdd: srm: don’t list absent information in ls
6cc547a: fix broken merge
c550606: pnfsmanager: log problems when creating upload directory
e03d1f9: (2.10) webadmin: make jquery selector specific to individual tables
38628b4: (2.10) webadmin: restore missing components to respect jquery script options
adf3ae1: srm: include the reason why upload failed
b74d4a4: chimera: fix race condition on remove
044c33a: webadmin: ensure unique id attributes for all (currently) tested UI elements
c6c6e87: pool: update NFS transfer service to validate inactive movers
6e47601: [maven-release-plugin] prepare for next development iteration
a5634b0: libs: use nfs 0.9.x

Release 2.10.18

ftp

Fix default value for ftp.authz.readonly for plain (unencrypted) doors. This restores the default value to the dCache v2.6 default value of true.

Add the DN of a user in the access log file for “Grid FTP” access.

The response from the plain (unencrypted) ftp door if the user specifies the wrong password is badly formed. Althogh it is possible that some clients are robust against such incorrect responses, with this release the ftp door responds correctly.

nfs

Previously, when enabling the nfs door’s namespace cache, querying information about a freshly created file could provide stale information; for example, querying a recently uploaded file’s length could mistakenly show the file length is zero. With this release, the close after a successful write will invalidate that file’s entry in the nfs door’s cache, allowing it to report up-to-date information.

Prevent leaking memory if a client’s data transfer is proxied (i.e., no use of pNFS) and there is a communication error between the pool and the nfs door.

webdav

Fixes a bug where, if a double-slash is present, all parts of the path leading up to the double-slash are ignored; for example, with the bug, a path like /a/b//c/d is handled as if /c/d was specified. With this release, double-slashes are treated like single slashes; the above example is handled as if /a/b/c/d was specified.

Fix the NullPointerException triggered if client attempts to upload a file as a child of some existing file.

Changelog 2.10.17..2.10.18

ef63f5d: [maven-release-plugin] prepare release 2.10.18
c928fff: ftp: fix response if user fails to authenticate to weak FTP door
7064cd3: nfs-proxy: remove proxy adapter on IO errors
1d069d7: ftp: fix invalid default for ftp.authz.readonly property
4d54186: webdav: fix double-slash bug by upgrading to patched milton
7c754df: webdav: fix NullPointerException when PUT as a child of a file
7ade469: ftp: add user to access log
48178df: nfs4: invalidate vfs cache on successful write
fecbf9c: nfs4: add NFS file handle into NfsTransfer class
aaa07fe: libs: update to nfs4j–0.8.6
2f4a5e2: [maven-release-plugin] prepare for next development iteration

Release 2.10.17

httpd

Fix two minor issues when authenticating with the webadmin interface: “unauthorised access” and being redirected to the home page. The unauthorised access error can occur when selecting “Login” under the bird logo (top right corner); this is now fixed. The redirection problem occurs when selecting a tab that requires administrative privileges while not logged in; this redirects the browser to the login page. Previously, after a successful login, the browser was redirected to the home page. Now the browser is redirected to the selected tab.

info-provider

The previous bug-fix release of dCache included a regression in the info-provider. The SRM endpoint URL (which starts httpg://) omitted the port number. This is now fixed.

In previous releases, the info-provider assumed the broker domain is dCacheDomain. This assumption has been removed.

pool

Fix logging that the HTTP mover returned an error so that a generic message (“An unexpected server error has occurred.”) is logged if no more concrete message is available, rather than “null”.

srm

The info command in the srm service shows the current number of requests in each state for each request-type, along with the maximum allowed. Previously, the total failed to include requests in READY state.

webdav

A previous bug-fix release fixed how dCache responds when the client attempts to DELETE a non-existent file. Unfortunately, this triggered a different problem where such activity results in a stack-trace that starts java.lang.ClassCastException: java.lang.String cannot be cast to javax.security.auth.Subject. This second problem is now fixed.

Changelog 2.10.16..2.10.17

ba2064d: [maven-release-plugin] prepare release 2.10.17
6cfd529: Revert “info-provider: fix publishing SRM port number”
fb9e33d: info-provider: remove dCacheDomain assumption.
c77a2e2: pool: Fix logging in HTTP mover
9d617f6: srm: Fix calculation of total number of requests
23f966a: webdav: Alternative to fixing return code of DELETE of absent file
6767c95: (2.10) webadmin: fix login redirect bug
73d9d2d: [maven-release-plugin] prepare for next development iteration

Release 2.10.16

Changes affecting multiple services

Reduce memory usage by avoiding multiple instances of common strings. Although this improves many services, pools are most affected.

Previous dCache releases included some of the code necessary for pcells, even though dCache made no use of this code. As part of on-going consolidation effort, this code has now been removed from regular dCache releases.

The cell.name is a core property within dCache. Its omission (most easily, by forgetting to specify pool.name) prevents the dcache script for working. The script is now robust against such errors.

If the dcache check-config command discovers that a deprecated property is being configured, it looks for the alternative property that should be used instead. In earlier releases, under certain circumstances, the wrong alternative property is selected. This is now fixed.

It is possible that a sudden burst of activity from many clients exceeds dCache capacity to queue such requests. Although dCache is designed to degrade gracefully under such circumstances, there existed the possibility of certain requests becoming stuck or memory leaking. This is now fixed.

nfs

Upgrade to nfs4j v0.8.5. This fixes export file parsing if a host is mentioned multiple times; note localhost must now have an explicit entry in the exports file. This also fixes a deadlock when closing a file on a busy dCache instance.

Allow an nfs door to avoid being published to loginbroker. Not being published in loginbroker means the door is unavailable to SRM and is not published by info-provider.

pinmanager

If the maximum concurrency (pinmanager.cell.threads.max) exceeds the number of database connections (‘pinmanager.db.connections.max’) then there is a risk of pinmanager becoming deadlocked under heavy load. This release reduces the default maximum concurrency to avoid this.

pnfsmanager

Fix NullPointerException when a file is stored in a directory with an empty tag.

pool

Upon reloading the pool configuration an error is produced when an nearline storage was already defined in the existing configuration. This is now fixed.

A recent release fixed a bug that caused pool.mover.ftp.allow-incoming-connections to be ignored. Fixing that bug revealed another that caused the property to have the opposite effect. This is now fixed.

dCache pool configuration allows passing (fixed) arguments to the HSM script. In earlier releases, these arguments were supplied to the script in an arbitrary order. While HSM scripts should not depend on the order, they might; so, with this release, the order is preserved.

webdav

Although dCache behaves correctly if the client interrupts a proxied transfer; however, this is logged as a bug. This is now fixed.

Make HTTP third-party copy feedback and detecting when client disconnects more robust against internal dCache problems.

Changelog 2.10.14..2.10.16

e747823: [maven-release-plugin] prepare release 2.10.16
0f760f6: info-provider: fix publishing SRM port number
6d67a58: [maven-release-plugin] prepare for next development iteration
7306099: [maven-release-plugin] prepare release 2.10.15
dcf8d77: webdav: don’t log a stack-track when proxy transfer is interrupted
1144520: fix broken commit
bee43f4: webdav: make progress markers more robust
4932840: Fix compilation with Java 7
2b5cdda: cells: Fix message timeout in case of thread pool overflows
5516165: chimera: handle NULL field of directory tags
6277cd0: pinmanager: Adjust pinmanager defaults
e766e8e: Internalize common strings in StorageInfo
634bbe1: bootloader: Don’t fail on missing cell name
0ebd0af: check-config: fix warning for deprecated properties
6ef87e1: pool: Fix regression preventing configuration to be reloaded
2b27ad6: (2.10) gitignore additions for IntelliJ
6fd254d: libs: update to nfs4j–0.8.5
7e52c73: pool: Fix regression causing FTP movers to default to proxy mode
477f0c8: nfs: Allow not to publish to loginbroker
cd8b59e: pool: Preserve HSM option ordering
bbd5570: [maven-release-plugin] prepare for next development iteration
bf26db9: cells: Delete pcells related classes

Release 2.10.14

Changes affecting multiple services

Although dCache system configuration property names do not contain spaces, it is possible to define such properties. Previously, doing so breaks the dcache command. This is now fixed.

Update for the nfs door and the pools that allows faster restart. The door is also updated to be more responsive to namespace operations when suffering large number of proxy reads.

If the number of concurrent connections is set to -1 then the ftp and dcap doors will leak memory, eventually triggering an out-of-memory error that will restart the domain. This is now fixed.

admin

Fix erronous reporting of bugs when the user supplies incorrect arguments to an admin command.

dcap

Fix the permissions check for bring-online with plain dcap and a NFSv3 mounted dCache.

Fix regression against v2.6 and earlier dCache in how gsidcap doors are known to SRM and how they are published in BDII/GLUE.

ftp

The ftp door supports the HELP command as some clients use this commands output to discover if certain optional functionality is available. This release fixes the output from this command.

The ftp door’s access log records some responses without the corresponding command-line from the client; subsequent responses (if the response is multiple-lined) and all responses due to STOR and RETR commands were affected. This is now fixed.

nfs

The info command now lists the proxy adaptors. A proxy adaptor is created when the client rejects the pool and falls back to reading data from the door.

Improve shutdown of nfs door when the embedded portmapper is used.

pool

Improve which interface the ftp mover selects when the client is redirected to the pool. In particular, the door will only redirect the client if the IP protocol matches (IPv4 vs IPv6); for example, if a client connects with IPv6 and the pool has no IPv6 address then the door will now proxy the data connection.

The pool.mover.ftp.allow-incoming-connections property had no effect. This is now fixed.

Add the IoMode (whether the mover is accepting new data, or supplying existing data) to the status line describing the mover.

Improve shutdown of nearline storage subsystem; now dCache will attempt to cancel all ongoing activity before shutting down the pool. The previous asynchronous approach could lead to IllegalStateException being logged if the pool was busy at that time.

spacemanager

Adjust spacemanager schema migration to be aware of earlier dCache bugs and to work-around site-local indexes that clash with new indexes that dCache needs.

In previous versions of dCache, both spacemanager and pnfsmanager had default access-latency and retention-policy settings. The spacemanager default retention-policy is only used if the admin doesn’t specify a value when creating a reservation through the admin interface. The spacemanager default access-latency is also used as a default for the admin command; additionally, it is used when processing an SRM reserveSpace request that omits the (optional) access-latency information — the retention-policy is mandatory in such SRM requests.

With this release, the spacemanager no longer has default access-latency or retention-policy properties. Creating a space-reservation in the admin interface now requires specifying both the access-latency and retention-policy. When processing an SRM reserve space request without an access-latency, the spacemanager will check from which linkgroups the user is authorised to reserve space. It will choose an access-latency based on the request’s retention-policy. If the user is authorised to reserve space from link-groups such that both access-latency options (ONLINE and NEARLINE) are possible, space-manager will prefer to reserve ONLINE if the retention-policy is REPLICA and NEARLINE if the retention-policy is CUSTODIAL.

srm

Previously, if the SRM client requests listing a directory, specifies a non-zero offset and does not limit the response size then dCache would fail this request with an IllegalArgumentException. This is now fixed.

webdav

With the recent upgrade of the Milton library some new behaviour was introduced. One example is that, under certain circumstances (and to support certain clients) the Milton library returns a 401 (not authorised) when attempting to delete a non-existing file. Unfortunately, this change then broke ATLAS clients. This patch updates dCache so it returns 404 (not found) under these circumstances.

Fix the response when a client requests a byte-range beyond the end of a file. This is necessary for compatibly with ARC clients.

Changelog 2.10.13 to 2.10.14

e35c5ab: [maven-release-plugin] prepare release 2.10.14
37af85b: webdav: Fix reported content length for partial GETs
87eb009: ftp: ensure client command line is available in access log
70afdaa: webdav: Fix return code on DELETE of absent file
409228b: srm: fix semi-infinite ls range with non-zero offset
ab86627: pool: Partially fix interface selection for FTP mover
bc3e762: pool: Fix typo that breaks pool.mover.ftp.allow-incoming-connections
7b19902: shell: fix shell oracle for configuration keys with a space
23715ac: nfs: show current proxy-io transfers in the door
cfaa0e9: chimera-cli: fix chown
7540735: dcap: Fix regression in published protocol family
0aac485: ftp: fix help output
5175403: loginmanager: Fix leak caused by absent child limit
f540180: pool: Fix shutdown of nearline storage subsystem
83608c6: spacemanager: Fix null constraints and other schema migration issues
fe1da05: dcap: check for url in some commands
3901c80: admin: Fix error reporting
553fdce: pool: add Io mode into mover’s status line
a275f1e: spacemanager: Get rid of access latency and retention policy defaults
b079809: nfs: stop embedded portmap on shutdown
0063089: libs: update to nfs4j–0.8.4
0312ac0: [maven-release-plugin] prepare for next development iteration
907f1da: dcap: refactor PnfsSessionHandler to unify permission check and url handling

Release 2.10.13

ftp

Relax requirements for the EPSV and EPRT commands. Previous versions of dCache would reject all RFC 2428 commands from an IPv4 client. This has been relaxed with this release. Now, IPv4 clients can use EPRT. IPv4 clients can also use EPSV if delayed passive is enabled.

info

Info service includes a safety feature that prevents it from bombarding the rest of dCache with too many messages. With this release, the safety limit is reduced allowing info to send messages more often.

Reduce the delay between subsequent messages of the same type. This allows the info service to recover information more quickly after being restarted.

pool

The NFS specification allows the server to specify multiple addresses when telling the client where to connect; for example, specifying both an IPv4 and an IPv6 address, or both addresses for multi-homed machines. This requires the client to choose the appropriate interface. For Scientific Linux 6, the kernel client will always use the first supplied address in the list and fail if it cannot access the pool with that address. With this release, pools will order the list, using heuristics to select which IPv4 address is “correct” and list it first.

Prior to flushing a file to tape, the pool checks if the name-space entry still exists and deletes the replica (rather than flushing) if not. Version 2.10.7 introduced a regression where, if this happens, there is a IllegalStateException stack-trace. This is fixed with this release.

poolmanager

With previous dCache versions, the WAAS selection algorithm had a bug where it could (mistakenly) consider pools full if all pools had very fresh files. This is fixed with this release.

spacemanager

Fix various issues to improve the robustness of spacemanager: handle failed uploads correctly, handle files deleted during upload correctly, increase robustness against dCache loosing (internal) messages.

webdav

Upgrade to Milton v2.6. This fixes the buffering problem where a proxied vector read request results in the entire file being written to a tmp directory and not deleted. With this release, requests for 100 kiB or less data result in no data being written to disk; requests for more than 100 kiB are still written to disk, but only the data needed to satisfy the request is stored and the file is deleted once the response has been sent. Some issues persist: data isn’t deleted if there is a failure sending it to the client and the whole file is requested from the pool.

Changelog 2.10.12 to 2.10.13

04efb0b: [maven-release-plugin] prepare release 2.10.13
4431ddf: ftp: Relax requirements for EPSV and EPRT
1f7afdb: poolmanager: Fix full pool detection for WASS
2e1ecd5: ftp: Relax requirements for EPSV and EPRT
27783b3: Upgrade to Milton 2.6
b74db9d: info: Reduce safety limit to 50 ms
e4fe223: pool: reorder ip addresses returned to NFS client
a157ef8: pool: Fix ISE when flushing deleted file
53bbe92: info: Reduce delay between messages
c509fbc: spacemanager: Fix various error recovery scenarios
39dadfa: [maven-release-plugin] prepare for next development iteration

Release 2.10.12

Changes affecting multiple services

Restarting a domain within an active dCache instance can lead to a domain receiving messages for a cell as it is starting. Strict control is require to avoid the cell attempting to process messages before it is ready. This release fixes one place where this control was missed, which could lead to a NullPointerException. While this problem can affect any core dCache service, it was noticed with the spacemanager service.

httpd

dCache versions including and after 2.11.0, 2.10.9, 2.9.12, 2.8.16, 2.7.21 and 2.6.36 required sites to delete existing RRD files when upgrading; i.e., run the command rm -f /var/lib/dcache/plots/*.rrd when the domain hosting the httpd service is stopped. This release reverts that change, but requires sites that have already upgraded to repeat the rm command. Sites upgrading from an earlier dCache version do not need to delete anything.

Fix a potential NullPointerException in poolCollector.

nfs

Protect against a NullPointerException if the client attempts to read the contents of a file’s level where that level exists in the database but contains a Nil value. This does not happen under normal circumstances.

pool

Previously, any problems found when the xrootd client is writing to a pool were silently ignored: neither the pool, the client, or billing appreciated there was a problem. With this release, problems are logged in billing, reported back to the client. The client is also disconnected.

The Berkeley DB, which may be used to store file metadata on the pool, does not like being interrupted. The pool tries hard to avoid interrupting reading or writing; this release fixes one place that slipped through.

Fix regression in the output from the info command: it did not include statistics about the number of HSM requests and HSM timeouts. Also fixes how active HSM jobs are counted: cancelled jobs are still active until the underlying job as ended.

Improve the error message reported back to the xrootd client should its request have an invalid or missing UUID.

Changelog 2.10.11 to 2.10.12

9de625c: [maven-release-plugin] prepare release 2.10.12
4af8b23: imera: protect against NPE in FsSqlDriver#read
df5bab0: xrootd: Propagate mover errors to dCache
fb61ae4: pool: Restore flush and stage stats in info
3f9044d: pool: Improve xrootd error message on missing UUID
6762e04: Fix NPE in cell initialization
46940a6: Fix NPE in httpd service
036e80f: pool: Avoid interrupting Berkeley DB in migration module server
8257601: dcache-webadmin: revert rrd data source names
9c9bc38: [maven-release-plugin] prepare for next development iteration

Release 2.10.11

Changes affecting multiple services

dCache will discard queued internal messages where the sender is no longer expecting a reply (due to internal time out); this helps an overloaded system recover quickly. Such discarded messages are logged. This release provides better logging when this happens: string commands are logged correctly and both the time-to-live and the age have been added.

The xrootd door and, on the pool, the HTTP and xrootd movers logging included incorrect context information, always showing the first connection. This has been fixed.

alarms

Fix minor corruption at beginning of the alarms defaults file.

httpd

In previous versions of dCache, the webadmin war file is automatically unpacked. This has been problematic as a dCache upgrade did not always trigger updating the unpacked webadmin, resulting in dCache running the older webadmin. With this release, dCache no longer unpacks the war file; the webadminWarunpackdir and httpd.container.webapps.tmp-dir properties are now obsolete.

Changelog from 2.10.10 to 2.10.11

8448a04: [maven-release-plugin] prepare release 2.10.11
3cbd3f8: cells: Refine TTL discard message
f436cad: xrootd,pool: Fix xrootd and http logging context
5ca7b4b: alarms.properties fix accidental header corruption
6f37b1c: [maven-release-plugin] prepare for next development iteration
cca38b7: (2.10) dcache-webadmin: change Jetty setting so .war is not unpacked

Release 2.10.10

Changes affecting multiple services

Doors report to billing when a file is deleted. Previously, many doors neglect to include the PNFS-ID and sent only the path. This can be ambiguous so, with this release, all doors send the PNFS-ID in addition to the path. Some doors also failed to send the file size (if known) and client IP address. These, too, have been fixed. In summary: * dcap additionally sends: PNFS-ID, file-size and client address. * ftp additionally sends: PNFS-ID * srm additionally sends: client address. * webdav additionally sends: PNFS-ID and file size * xrootd additionally sends: PNFS-ID.

A bug was discovered that resulted in the nlink count becoming negative. While wrong, this had a knock-on effect that prevented pnfsmanager and nfs services from listing the affected directory. With this release, directory listing are robust against such problems.

alarms

Fix alarm definition that triggers when the pool discovers a file with the wrong checksum. Previously it mistakenly triggered if an upload was incomplete.

dcap

Create a session identifier for the duration of a client connecting. This session identifier is available in log messages and in the billing entries.

gplazma

Update ldap plugin to support both RFC 2307 and RFC 2307bis LDAP schema types. Additionally, the memory footprint of this plugin has been reduced and its performance improved.

nfs

Update nfs4j to v0.9.2. This brings some small performance benefits when listing a directory and for reading data through NFS v3.

Fix bug where, when a client read or write activity is proxied, the corresponding dcap mover wasn’t removed if client didn’t wait for the queued mover to start. Previously, such movers would accumulate on the pool.

pnfsmanager

Prevent attempts to remove the . and .. directories.

This release improves the shutdown procedure. It’s now faster and does not generate error messages if shutdown while dCache is active.

spacemanager

When uploading a file outside a space reservation (i.e., using implicit space reservation) on a loaded dCache instance, the spacemanager may report a database deadlock. These messages do not indicate a problem: dCache behaves correctly and the upload is not affected; however, there is an impact on performance. With this release, there should be no further deadlock messages and spacemanager should place less load on the database, with corresponding increase in performance.

webadmin

Display the command’s output in the cell admin page to use a monospace font.

Remove a javascript error due to missing clojure dependency.

Sending commands in cell admin page is fixed.

webdav

Fix webdav door so it no longer reports a stack-trace when the client issues a USERINFO request. The davfs2 client, in particular, issues these requests.

xrootd

Add a logging plugin. This allows an admin to investigate problems between an xrootd client and either the xrootd door or the pool. It is not enabled by default: sites must explicitly include the plugin in their configuration to achieve additional logging.

Create a session identifier for the duration of a client connecting. This session identifier is available in log messages and in the billing entries.

Changelog from 2.10.9 to 2.10.10

4c162a3: [maven-release-plugin] prepare release 2.10.10
592236f: doors: Add pnfs ID to billing remove entries
d4d868e: webadmin: Fix ClassCastException in cell admin
bec30a9: alarms: fix regex for checksum alarm
1ef0e2a: Fix compilation for Java 7
d741823: webadmin: remove redundant head element in alarms panel html
a46f2f1: dcache-webadmin: change output field of cell admin page to monospace font
7274f12: (2.11) dcache-webadmin: eliminate clojure dependency
525962e: libs: update to nfs4j–0.8.3
115cfe1: dcache-xrootd: Extend and improve xrootd access log plugin
c5c4775: gplazma-ldap: pull only required attribute
8e55ee2: gplazma-ldap: support uniqueMember based group membership query
6c51061: nfs-proxy: kill mover on close
5d7820b: chimera: protect list initialization from FS inconsistencies
8757c04: webadmin: tidy up unavailable page slightly
320324b: xrootd: Add access-log plugin
3e58f42: xrootd: Fix compilation of backported session initialization
d176e4c: fix WebDAV door logging stacktrace on USERINFO request
15ec5c8: dcap: Initialize CDC session
a6215e6: xrootd: Initialize session
3dbecd4: chimera: prevent attempts to remove ‘.’ and ‘..’
077c6d8: spacemanager: Avoid a transaction deadlock and reduce DB overhead
0b54294: pnfsmanager: Controlled shutdown of processing threads
fe21ad9: srm-client: Report file level errors
393fad1: [maven-release-plugin] prepare for next development iteration

Release 2.10.9

Fixes affecting multiple services

Update dcache check-config to print an error if site-local configuration contains scoped properties: one that contain a ‘/’ character.

Fix that, under certain circumstances, the FTP and dcap doors could log error messages against the wrong cell. Also makes shutting down the doors a more orderly process.

billing

Make shutdown quieter and faster.

httpd

Fix pool queue plots in webadmin.

nfs

Performance improvements.

pnfsmanager

Fix the message reported when moving (or renaming) a file or directory and certain problems were found: the destination directory isn’t a directory, the source doesn’t exist, or overwriting with different types (e.g., overwriting a directory with a file).

Fail requests earlier that plan to upload a file if that file’s path already exists as a directory. Previously, dCache would fail such uploads after the file was upload (during srmPutDone). With this release, dCache will fail during srmPrepareToPut instead.

poolmanager

If a site using WAAS specifies too large a Space Cost Factor then the algorithm used for write pool selection breaks down, logging “Unreachable statement.” Starting with this release, a warning is logged indicating the cause of the problem.

Made shutting down poolmanager more orderly and faster.

srm

The srm.enable.space-reservation.implicit property is now marked forbidden. This means a dCache instance with this property configured will not start. Consider the spacemanager.enable.unreserved-uploads-to-linkgroups property instead.

webdav

The third-party copy implementation supports the Credential HTTP header. If an incorrect value is specified then the error returned to the user provides incorrect list of valid values. This is now fixed.

Changelog from 2.10.8 to 2.10.9

bd39c7a: [maven-release-plugin] prepare release 2.10.9
cdb2028: webdav: correct error message for incorrect Credential header value
6cf3a1d: srm-client: delegate by default for https 3rd-party transfers
8c061b0: chimera: Add messages to JdbcFs#move exceptions
f37fb2a: poolmanager: Warn when spacecostfactor is too big
e7589b2: poolmanager: Warn when spacecostfactor is too big
ac6cd39: pnfsmanager: Check file type before overwrite
cf33a8b: srm-client: Report srmPutDone failure
d07c1a7: check-config: Produce error when using scoped properties
ec26c94: srm: Mark srm.enable.space-reservation.implicit forbidden
9a8fa95: loginmanager: Associate ScheduledExecutor with the cell’s thread group
c59f668: pom: update to nfs4j–0.8.2
0cf0d3f: (2.10) pool queue plots - fix partial package refactoring and neglected removal of jndi arguments
0f82d09: billing: Improve shutdown for db backend
80186eb: poolmanager: Shut down request container thread pool during shutdown
7bf0578: [maven-release-plugin] prepare for next development iteration

Release 2.10.8

Fixes affecting multiple services

Running the eval admin command so it returns a non-zero return-code would respond indicating that a bug had been found. This has been fixed.

An named environment may be executed and can contain a reference to itself. If this recursion is unchecked, eventually memory is exhausted and the domain would restart. Now an error is logged and the domain is not killed.

The RPM package now includes the file /etc/logrotate.d/dcache. This is a default configuration for the logrotate application so that domain logs are rotated weekly, with compression, retaining one year history.

Previously, some exceptions were logged with the wrong cell context or without any context. This has been fixed.

Relative upload directories

This release adds support for upload directories that are relative to the user’s root directory. If the dcache.user-directory configuration property is defined as a relative path, it is now interpreted relative to the account’s root directory.

Some sites use account-specific roots to guarantee isolation of users from each other. Previously, such sites had to run additional doors to allow SRM uploads. With this release, sites now have an alternative: to configure dCache so that uploads are make into user-specific upload directories.

There are both benefits and disadvantages to both approaches (deploying extra doors, user-specific upload directories). These are documented in the dcache.properties default file (see the dcache.upload-directory property). Sites are encouraged to read this carefully before configuring a relative upload directory.

Sites that do not use user-specific roots remain unaffected by this issue.

All doors and pnfsmanager needs to be updated for the change to be effective. If the feature is not used, then there is no limitation on upgrades.

admin

Under certain circumstances the admin interface would hide useful information about a bug that some admin command had triggered. That information is made available.

billing

Fix indexer’s inability to parse billing PoolHitInfoMessage events.

Improve error message if the indexer cannot parse a date argument.

Billing records issued by doors never included the file’s size. This is now fixed for all doors except NFS.

The indexer includes a work-around for the lack of 4-digit year in the billing records. Unfortunately, this work-around was broken and could lead to badly formatted dates being output. This is now fixed.

The indexer will skip unrecognized records. When the output format is YAML the raw billing line is included as a comment.

Fixes an issue where index would output incorrect JSON output for multiple days.

nfs

Improve client cache consistency when creating files at rates of 1 kHz or greater.

Fix reported rdev value.

pool

Fix internal copying of files (triggered by the SRM copy command) so they respect the LAN port range.

Fix migration module’s random pool selection so that it does not select pools that are full.

Fix NullPointerException triggered when using migration module’s proportional pool selection and all pools are full.

Make cancelling a migration task more robust against bugs: a cancelled task timer is started first so, if there is a problem, the task will timeout.

The rep ls command can calculate per-storage-class statistics. This release fixes a problem where files without a storage-info would trigger an IllegalStateException when calculating these statistics.

Fix sweeping of files without a storage-info. Previously an IllegalStateException was thrown; this could be triggered by the admin interface or by the sweeper.

Fix a problem where the pool would register free space before actually deleting the file from the file system; for a brief moment, the pool would appear to have more free space that is actually available.

Periodically, pools check that dCache’s internal accounting of total and free capacity does not exceed the OS supplied values for the partition (i.e., the output from the df command); if they do then the dCache internal accounting is adjusted to match. The dCache.org team have observed that, after a file is deleted, some seconds may elapse before the corresponding extra free capacity is reported by the OS. To protect pools from this effect, the total and free capacity check is suppressed for 60 seconds after a file was deleted.

spacemanager

If a transient deadlock resolution error occurs, unconditionally retry the operation and do not log it. Such errors are part of RDBMS design when resolving concurrent updates.

srm

Fix the srm service’s admin interface ls command so that listed jobs can optionally include only those that have failed, have completed or were cancelled.

webadmin

Webadmin periodically creates billing graphs by querying the billing service for the information it needs. Previously, if this query failed (e.g., the billing service was restarted or failed to start first) then no further billing graphs are generated and a domain restart is needed. As a result of this work, the properties poolqplots.refresh-interval, poolqplots.refresh-interval-unit, httpd.plots.pool-queue.refresh and httpd.plots.pool-queue.refresh.unit are no longer supported. The httpd.plots.pool-queue.min-time-step and httpd.plots.pool-queue.min-time-step.unit properties now also cover this configuration.

Changelog from 2.10.7 to 2.10.8

4ef77da: [maven-release-plugin] prepare release 2.10.8
e100b34: chimera: fix unit tests
97fa65a: pool: Fix pool size health check in case of asynchronous release of space
a11e7c6: spacemanager: Suppress transient deadlock resolution errors
7710255: pool: Fix race leading to false positices in pool size health checks
5673c4b: pool: Fix ISE in CacheEntryImpl#toString
d5851e8: pool: Fix ISE in ‘rep ls’
c6280fa: Made PoolQueuePlotData enum java 8 compatible.
e25597b: (2.10) webadmin: fix exit login in billing refresh loop
aa873e3: Removed comments.
0efdd4d: Marked refresh properties as obsolete.
f665f17: pool: Make migration task cancellation more robust
787be04: cell: Log exceptions within the correct cell context
d85629d: pool: Fix pool selection bugs in migration module
cf8768a: admin: Propopage NPE to user
86ca482: chimera: query for inode generation on readdir
c65b7fe: billing: Fix date handling for json and yaml output
261183b: billing: Add file size to request records
61002c8: billing: Improve error message on failur to parse date arguments
c0392c8: billing: Add billing format for PoolHitInfoMessage
1eb71fe: (2.10) webadmin: minor improvements to rrd4j-based pool-queue plots
403a709: pnfsmanager,doors: Add user relative upload directories
48b8e7b: rpm: include logrotate config file
921e569: srm-client: Output file level errors for sync requests
c7dd7c0: pool: Respect LAN port range for internal srmcp transfers
3d51f29: srm: Fix listing of failed, done and cancelled jobs
950f726: cell: Prevent interpreter stack overflow from killing the domain
42d2907: cell: Declassify eval failure as a bug
73ee6d8: [maven-release-plugin] prepare for next development iteration

Release 2.10.7

Fixes affecting multiple services

The nfs and srm both cache replies from gPlazma to increase the speed of authenticating and identifying the user. Previously transient errors when communicating with gPlazma were also cached, delaying the recovery time from such errors. Now, such errors are not cached.

Fix a problem where show pinboard command can trigger the error: java.lang.IllegalArgumentException: number to skip cannot be negative.

Update the comments within the configuration property files to reflect the new configuration property names.

Fix OpenMQ communications to use new configuration property names.

dcap

Fix opening a file when space-manager enabled.

ftp

Fix SRM uploads when the ftp door has a static root (the ftp.root property) and this static root is not an ancestor (i.e., a prefix) of the user’s root.

nfs

No longer trigger an NullPointerException when opening the .(parent)() dot-command file in Chimera’s root directory.

The dcache ports command now shows the NFS port based on the new configuration properties.

pool

Fix a race-condition when a user starts uploading a file that should go to tape, deletes the file before the write is completed, then cancels the file upload. Previously, there was a risk that the pool fails to delete the uploaded data.

Fix Berkley DB usage when an HSM operation is cancelled to avoid an error logged as:

InterruptedException may cause incorrect internal state, unable to
continue. Environment is invalid and must be closed.

Fix xrootd support for vector read. The problem was discovered with ROOT v6, which reports errors like Single readv transfer is too large.

Fix the dcap mover to use the new pool.mover.dcap.port configuration property.

httpd

Fix the statistics location to use statistics.location configuration property.

srm

Fix how durations are reported when reporting information about jobs (e.g., the ls admin command).

Fix cleaning of upload directories for expired uploads after service restart.

xrootd

Fix problems when calculating a response to the kXR_set request. If encountered, the problem results in the following being logged, where ‘nnn’ in ‘xrootd-disk-nnn’ is some integer number:

Uncaught exception in thread xrootd-disk-nnn java.lang.IllegalStateException: null

Changelog from 2.10.4 to 2.10.7

be5d6ea: [maven-release-plugin] prepare release 2.10.7
45afb2a: ftp: Fix root path validation for upload directory
9912145: [maven-release-plugin] prepare for next development iteration
b28e2f7: [maven-release-plugin] prepare release 2.10.6
9d1b4a8: xrootd: Upgrade to xrootd4j 1.3.5
d2185fa: [maven-release-plugin] prepare for next development iteration
66b5982: [maven-release-plugin] prepare release 2.10.5
e947426: pool: Prevent interruption of replica deletion during HSM flush
5c3f6b6: pool: Fix HSM cancellation
6bfdfa8: The gPlazma cache used by NFS and SRM caches both positive and negative replies. Unfortunately it also caches failures to communicate with gPlazma. This means that a transient timeout would be cached too, thus increasing the effect of the transient error.
6ddd40b: cells: fix IAE in ‘show pinboard’
c9c84ce: commons: fix TimeUtils nanos to duration string conversion
42e2a90: pool: Fix xrootd vector read limits
0e216e4: chimera: fix NPE on ‘.(parent)()’ for root inode
f5852f9: dcap: fix interaction with Spacemanager
d6a4e2a: srm-client: Fix logback configuration
af8e546: srm: Fix asynchroneous job storage leak
e513209: configuration: adjust references to deprecated properties
235237b: unittests: increase timeouts and fix race in DiskSpaceAllocatorTest
8d749c0: [maven-release-plugin] prepare for next development iteration

Release 2.10.4

Fixes affecting multiple services

In some cases, dCache would log the wrong cause of an internal failure. This has been fixed.

The pinboard has been improved in two ways:

The timestamp now includes AM/PM
The memory and CPU usage has been decreased.

Two problems are fixed that affected those services that make use of the grid trust store (/etc/grid-security/certificates): gplazma, srm, gsiftp, webdav and xrootd:

When a CA is found not to have a signing-policy file, subsequent attempts to discover a signing-policy file will automatically fail for one hour. Previously, if any CA was removed since the service started then certain CAs (those using UTF–8 encoded subject RDNs) would be placed in this not-found cache by mistake. This resulted users with certificates from such CAs would succeed the first time they used dCache but all subsequent attempts would fail until the cache expires or the trust-store is refreshed.
A minor memory leak is fixed. The amount of memory leaked was proportial to the number of different CAs in use.

Two problems are fixed that affected services providing GSI or plain SSL authentication using the jGlobus library:

dCache now supports TLS v1.2 clients when using GSI or plain SSL using the jGlobus library.
Signal end-of-stream when remote sends a CLOSE notification. The remote party normally sends a CLOSE notification before terminating the TCP connection, typically send from server to client. Previously, receiving such a notification was treated as an error.

Fixed thread safetly of timestamp formatting, principally used in logging and generating the output of admin commands.

billing

Fixes a bug where the bytes read graph can show inflated values.

httpd

Update service to use httpd.service.loginbroker configuration property rather than the generic dcache.service.loginbroker.

Introduce the httpd.setup configuration property to expose and allow configuration of the existing mechanism to support site-specific customisation of the httpd service batch file.

nfs

Fix permission check when client requests a list of some directory’s contents.

Add support for mounting when the server-side mount-point is a sym-link.

srm

Fix the lifetime of the response when client makes a successful request for a reservation with infinite lifetime.

Update the info admin command to produce scheduler information that is easier to understand and better reflects the new scheduling.

Fix incomplete restore of “ready queued” jobs. After restarting the srm service, any jobs in this state RQUEUED would not be handled correctly. This has been fixed.

Allow credentials to be reloaded; in previous versions of 2.10 this was ineffective and resulted in a null-pointer exception being logged instead.

Allow srm to start if there were active bring-online requests when the service was stopped.

Fix the set max ready get command in the admin interface.

Fix a problem where, if a client aborts an upload shortly after initiating the upload then the server blocks all activity for a timeout period. The SRM will eventually recover but, during this period, it appears deadlocked.

The SRM checks whether there is a door that statisfieds a users upload request, taking into account the door’s root directory. It could be that doors are configured so that the upload directory is not visible, preventing any uploads. This problem is logged so it is more obvious.

chimera

Speed up deletes by Chimera no longer explicitly removing non-zero levels when a file is deleted. The database handles this automatically.

The chimera shell failed to interpret the creation time flag when choosing which time should be displayed in directory listing.

Changelog from 2.10.3 to 2.10.4

5169a28: [maven-release-plugin] prepare release 2.10.4
72dedf9: srm: Log common misconfiguration preventing upload
4fa394e: common: Fix race condition in AtomicCounter unit test
eca3c50: srm: Fix deadlock like bug
a1f3d96: billing plots: use ‘transferred’ rather than ‘size’ for bytes
dc67a13: httpd: Use service specific property for list of login brokers
8595102: srm: Fix reporting of infinite space lifetime
06d205a: chimera: Fix creation time flag in ls command
a65d0e5: Fix CacheException references in error messages
adbae8e: srm: Align SRM info output with recent changes to scheduler configuration
0d0bbd6: chimera: do not explicitly remove enties in t_level_x
95165df: cells: Improve pinboard
7085fa2: libs: update jglobus to 2.0.6-rc8.d
a75a70d: srm: Fix incomplete restore of “ready queued” jobs
65de44e: Avoid thread-unsafe use of SimpleDateFormatter
916b322: srm: Fix credential store registration
a83e308: srm: Fix restore of bring-online requests on restart
c6c3e8f: srm: Fix set max ready get command
e6245c9: libs: update to jglobus–2.0.6-rc7.d
8b3c169: libs: update to nfs4j–0.8.1
04d1b1e: [maven-release-plugin] prepare for next development iteration
6d94f34: build: Update to findbugs 3

Release 2.10.3

Utilities

Add support for an alternative ACL format for the setfacl command in the chimera command. Now, an ACL may be specified either like USER:7:rlwfx:o:ALLOW:FFFF or like USER:7:i+rlwfx:o:FFFF.

pnfsmanager

Previously, the automatic schema management for Chimera failed to create an index on the iparent column of t_dirs table. The lack of this index results in deletions becoming progressively slower as dCache stores more files. If your dCache instance was created with dCache v2.3.0 or newer then it is likely affected by this problem. Instances created with an earlier version are not affected. The pnfsmanager service will check for the index when it starts and add the index if it is missing. Alternatively, the check may be performed manually using the dcache database update command.

pool

This release fixes several issues with the sweeper:

Previously, the sweeper consumed considerable amount of memory while freeing up space; in the worse case, this could result in the domain running out of memory and restarting; this has been fixed.
The sweeper ls command would make the pool unresponsive while the output is generated; now, running this command no longer prevents the pool from taking on more work.
As the sweeper purge, sweeper free and rep rmclass commands can take a long time to finish, the admin interface does not wait for them to finish; instead, a quick response is returned and the command continues in the background. Previously there was no indication when such commands had finished; now, the pool will log when the command has completed.
The pool logs regular sweeper operations at lower priority; this makes the pool logs less noisy when operating under normal conditions.

poolmanager

In previous versions, poolmanager would log a stack-trace if a user attempts to open a file for reading and no read pool could be found because a unit failed to match. Now the failure is logged as a normal message.

xrootd

The xrootd door now informs billing when a user uses the xrootd protocol to delete a file.

gplazma

The ldap plugin allows a door to request converting a uid or gid back to a username; currently, only the nfs door makes use of this functionality, when generating a directory list. In previous dCache releases, the plugin behaved incorrectly if the uid or gid is unknown; this is now fixed.

Changelog from 2.10.2 to 2.10.3

ab651cb: [maven-release-plugin] prepare release 2.10.3
577f16f: chimera: mark as-run if i_dirs_ipnfsid exists
0f1cea4: solaris: fix solaris package script and add work-around for pkgmk bug
e1dd257: acl: add heuristic to detect acl format
7f9ef6d: chimera: create missing index i_dirs_iparent
0bfaa7e: pool: Resolve high memory usage and other issues in sweeper
befad25: info: Fix HashMap ordering assumption in unit test
08f7865: Make JDK byte code verification bug workaround Java 8 compatible
fc19b7a: gplazma: Fix JVM implementation dependency in unit test
639653f: poolmanager: Suppress stack trace in case of unmatch units
92fe17a: xrootd: Add billing entry on delete
ca6c026: bugfix: fix reverseMap not throwing NoSuchPrincipalExceptions
8e7edbc: [maven-release-plugin] prepare for next development iteration

Release 2.10.2

Cell startup

Fix potential dead-lock when a dCache domain is starting. The problem is triggered when a cell receives a message while initialising. If dCache is started “from cold” then this cannot happen. However, if a domain that contains well-known cell restarts while the rest of dCache is operational then the problem could be triggered. For example, if a door domain ran out of memory then there is a small risk that the domain would not restart correctly. The problem was present since dCache v2.8.0 and is fixed with this release.

Changelog from 2.10.1 to 2.10.2

44daad1: [maven-release-plugin] prepare release 2.10.2
333a819: cells: Fix deadlock during startup
fa3b2f1: [maven-release-plugin] prepare for next development iteration

Release 2.10.1

cell communication

The decision on which cell communication system to initialise was based on the (old) broker.scheme configuration property and ignored the (new) dcache.broker.scheme configuration property; this has been fixed. Note that configuration using the old broker.scheme continues to be supported for backwards compatibility.

billing

Update to use the billing.text.dir configuration property instead of billingLogsDir; the latter is still supported for backwards compatibility.

dcap

When the client authenticates via GSI, the dcap door will check the user certificate (generally, a voms-proxy certificate) to extract the DN and primary FQAN. To do this, it needs to know where the trusted CAs and trusted VOMS server identities are stored; these are typically /etc/grid-security/certificates and /etc/grid-security/vomsdir respectively. Previously, dcap ignored the dcap.authn.capath and dcap.authn.vomsdir properties. With this release, these properties are honoured.

pool

If a stage request has been forwarded to the nearline subsystem and the file is deleted then the stage request is now cancelled.

Nearline storage configuration is now applied atomically rather than one attribute at a time.

The maximum number of concurrent nearline operations may be adjusted using the hsm set admin command. Previously, if the concurrency is increased then additional activity is only started as more requests are received, ignoring any existing backlog. This has been fixed so increasing the concurrency limit will immediately start more activity if there is a backlog of requests.

This release reduces the CPU processing required when starting a third-party HTTPS transfer.

When transferring files to some third-party service using HTTPS, the remote host must provide a trusted certificate. The trust material (CA certificates and CRLs) that are necessary for this are now refreshed without requiring the pool to be restarted.

poolmanager

Stage and pool-to-pool transfers, triggered by poolmanager, are monitored by poolmanager querying the source pool to check if the request is still active. In previous versions of dCache, for n transfers from a pool, this monitoring generated O(n*n) of network traffic. In particular, poolmanager’s monitoring of bulk staging requests can result in sufficient network traffic between the pool and poolmanager that other dCache activity is impacted. With this release, the monitoring traffic is reduced in general, most prominently for large concurrent activity, such as bulk stage requests.

The rc destroy command has been removed. The implementation was incomplete and the operation was dangerous.

If the door resubmits a request to poolmanager that triggers staging of a file, poolmanager uses additional memory while waiting for the stage to complete. For bulk staging, this additional memory usage can be significant, potentially resulting in poolmanager exhausting the available memory. This release updates poolmanager so a door resubmitting a stage request does not increase the memory footprint of poolmanager.

The poolmanager has an incomplete feature called ‘clumping’. This is where poolmanager can handle multiple select-pool-for-read requests for the same file as a single request. There is a hard-coded clumping limit that prevents too many files from being handled together; this allows poolmanager to choose an alternative pool when staging or replicating to make the file available. Once this limit is reached, subsequent open requests for the same file will fail in poolmanager. Doors handle this failure differently; the xrootd and webdav doors propagates this failure back to the client, while the dcap and ftp doors will resubmit the request to poolmanager. This release increases the clumping limit from 1 to 20; it also changes the behaviour of the doors so all will retry the pool-selection if the clumping limit is reached. This fixes the ‘request clumping limit reached’ failures.

This release fixes a race-condition between two clients opening the same file. Previously, if poolmanager starts processing the second open request as it is finishing processing the first then there is a tiny chance that the second request is lost.

srm

The srm service will check the user certificate (generally a voms-proxy certificate) to extract the DN and primary FQAN. To do this, the srm service needs to know where the trusted CAs and trusted VOMS server identities are stored; these are typically /etc/grid-security/certificates and /etc/grid-security/vomsdir respectively. Previously the SRM ignored the srm.authn.capath property and there was no way to configure the trusted VOMS server identity path. With this release, the srm.authn.capath is honoured and the srm.authn.vomsdir property is introduced.

When the SRM server sends an asychronous response to the client, the client must make subsequent queries to check if the request has completed. Part of each response is a hint suggesting for how long the client should wait before querying for a progress update. Prior to dCache v2.9, the srm service provided this hint where subsequent responses suggested a longer wait time. With v2.9, this feature was lost, resulting in the SRM client hammering the srm service. This has been fixed.

When the SRM client wants to initialise a download (using the srmPrepareToGet call) for a file that is already online there is a very small risk that the srm service will hang and provide no reply. This has been fixed.

When processing a srmBringOnline request, the server checks there is at least one transfer protocol that both the client and server support. In previous versions, this was checked when processing each file in the srmBringOnline request, failing the file if no common transfer protocol is found. This has been updated so the transfer protocols are checked once for the entire request. This new behaviour matches how the transfer protocols are checked for other types of SRM requests.

webdav

Update door to use the webdav.root configuration property instead of webdavRootPath; the latter continues to be supported for backwards compatibility.

Allow more control over webdav door behaviour; in particular, thread usage and connection handling. This allows sites to tune their door to match hardware capabilities and client activity. The following configuration properties are introduced webdav.limits.acceptors, webdav.limits.idle-time, webdav.limits.idle-time.unit, webdav.limits.low-resource-idle-time, webdav.limits.low-resource-idle-time.unit, webdav.limits.handshake-time, webdav.limits.handshake-time.unit, webdav.limits.backlog, webdav.limits.threads.max, webdav.limits.threads.min, webdav.limits.threads.idle-time, webdav.limits.threads.idle-time.unit, webdav.limits.queue-length, webdav.limits.graceful-shutdown, webdav.limits.graceful-shutdown.unit. See the webdav.properties defaults file for more details.

If the client reads or writes through the door and the door detects a lack of progress then it will cancel the transfer. Previously, the door did not kill the mover, leaving it to time out. With this release, the door will kill the mover when it cancels the transfer.

The door is updated so, when a client initialises a third-party transfer, the distinguished name and primary FQAN comes from the gplazma login result rather than being generated locally. This change means the node hosting the webdav door does not require a vomsdir setup.

xrootd

Previously the xrootd batch service required the user to specify a non-empty pool queue for uploading. This requirement has been relaxed.

Changelog from 2.10.0 to 2.10.1

b14aa7f: [maven-release-plugin] prepare release 2.10.1
33a7ec7: srm: srm,dcap: Use configured vomsdir and capath
cc0c6d6: build: add work-around for JDK bug and PowerMock
85e97f7: pool: Fix concurrency settings for script nearline storage
63cbc4e: poolmanager: Fix ‘request clumping limit reached’ failures
ca2afba: poolmanager: Avoid request leak
e970b90: srm: Fix race condition in srmPrepareToGet
b85329b: webdav: Fix certificate handling for HTTP third-party copy
9051f82: poolmanager: Fix excessive p2p and stage alive checks
04db77b: srm: Restore estimated wait time update
f0951c1: webdav: Kill mover in case transfer times out
87ad152: webdav: Make threading and TCP connection limits configurable
ce24ff6: pool: Fix several minor issues in nearline storage subsystem
4a2b9b5: srm,dcap: Use configured vomsdir and capath
3d3263f: Don’t use legacy properties
14dbed7: [maven-release-plugin] prepare for next development iteration

Release 2.10.0

Changes affecting several services

Libraries

Updated Spring to v4.0.5, AspectJ to 1.8.1, SMC to v6.3.0, Jetty to 8.1.15.v20140411, Wicket to v6.16.0, Guava to v17.0, HikariCP to v1.3.9, Netty to v3.9.2.Final, BerkleyDB v6.0.11, Apache commons-compress 1.8.1, Scala v2.11.1

Logging FTP client disconnects

The FTP protocol contains commands that a client uses to indicate that it is about to disconnect from the server. Not all clients use this; some simply disconnect the network connection, which is unfortunate as dCache cannot know whether or not this indicates a problem with the client.

dCache logs when the client disconnects unexpectly (without using the BYE command). Previous versions of dCache also logged this in pnfsmanager and spacemanager but, with v2.10, dCache no longer logs these events in either service; logging in the door continues.

Messaging

Most messages sent from one dCache cell to another are requesting some information or that some action should take place. These messages include an expiry date. The expiry date provides a hint when the response message will not be of interest. For example, when the srm sends a message to gplazma (to log the user in), the srm will fail the request if it doesn’t receive a response from gplazma within some timeout period (3 minutes, by default). Therefore, after 3 minutes, the srm is not interested in the reply from gplazma.

Services can make use of this information and reject messages where it cannot reply in time. This is most useful when dCache is recovering from being overloaded: dCache can recover faster because it can reject some of the backlog of work.

In previous versions of dCache, this behaviour was limited to pnfsmanager and the pool’s migraion module. With dCache v2.10, this behaviour applies to all services.

Certain messages request the current status of a service. Often, the result of these queries provides similar information. Currently, for certain pieces of information, objects are created afresh, despite there being relatively few distinct values. With 2.10, these same-valued Strings are represented by the same object, reducing the memory footprint of services, especially webdamin.

Kerberos

dCache will now use the kerberos configuration in /etc/krb5.conf by default. The existing configuration is still supported, but the system configuration file is now taken as the default.

http third-party transfers

dCache v2.10 sees greatly improved support for HTTP third-party transfers. These are transfers between dCache and some other HTTP server, so that the traffic does not go through the client.

In keeping with gsiftp, http third-party transfers are initiated on the pool. dCache now supports both pull and push transfers. A pull transfer is where the pool fetches a file from the remote server using the HTTP GET method. A push request is where the pool uploads a file to the remote server using the HTTP PUT method.

A transfer can use either an unencrypted transport (http://...) or make use of SSL/TLS encryption (https://...). If an SSL/TLS transport is used then the pool can use an X.509 credential when establishing the SSL/TLS connection.

Both transfer directions (push and pull) will attempt to check the integrity of the transferred data. For pull requests, all data integrity information comes from the headers supplied in the GET response. For push requests, the pool will make a subsequent HEAD request and use the supplied information to verify the data was sent correctly.

For push requests, if a PUT request is successful but the subsequent HEAD request reveals data corruption then the pool will attempt to delete the file with an HTTP DELETE request. If this DELETE fails then the error message is updated accordingly. For pull requests, no additional cleanup steps are needed.

File checksum values are discovered via the RFC 3230 extension to HTTP. Unfortunately, this is not widely supported by HTTP servers; therefore, dCache supports two levels of data integrity checking: weak and strong. Weak data integrity is satisfied when

the remote service and the dCache server agree on the content length,
none of the checksums supplied by the remote server (if any) disagree with the checksums known by dCache.

Strong data integrity is satisfied when, in addition to satisfying the weak conditions:

there is at least one checksum supplied by the remote server that agrees with a dCache-local checksum.

Weak data integrity can be satified by any HTTP server whereas strong data integrity requires an HTTP server with RFC 3230 support. A transfer satisfying weak integrity might not check the checksum values match; strong integrity requires that checksums exist and match.

Third-party HTTP transfers also allow the client to specify zero or more (arbitrary) headers when requesting a file be transferred. This allows the client to customise the HTTP request the pool sends to the remote server. Such headers could include authorisation information or some other information to steer how the remote server handles the HTTP request.

Third-party transfers may be initiated by the webdav service and the srm service. See the following sections on those services for details on how to trigger third-party copying.

Changes to services

ssh1

The ssh v1 admin interface has been removed; ssh v2 is now the only way to connect to the admin interface.

webadmin

The active transfers page has been reimplemented. The new implementation focuses on faster rendering, faster issuing of commands, and a reduced memory footprint.

webdav

The WebDAV door uses the StringTemplate language for generating directory listings for web-browsers and for error pages. With previous versions of dCache, a missing or unreadable template file resulted in a NullPointerException. With dCache v2.10, there is a hard-coded “broken dCache” page that describes the problem and suggests how to fix it.

dCache v2.9 saw the introduction of an activity log that recorded SRM and FTP commands and their corresponding responses in the log file Domain.access. This allows easy investigation of client-specific problems. With v2.10, support has been extended to include HTTP/WebDAV requests. The access file now includes a single line for each HTTP/WebDAV request. Logged information includes the mapped user, the request method, the URL, the numerical response code and explanation, the client IP address, and the user-agent.

The default HTML rendering of directory listings and error messages has been updated to use the Bootstrap framework, which supports a rich set of features. This new default provides a deliberately toned-down presentation. The default page now contains two links for each file: one that hints the browser should show the content and the other that the browser should download the file. As before, admins may customise the rendering of directory listings and error messages. To support clients with poor or no Internet connection, all required javascript libraries and Bootstrap files for the default layout are included with dCache. This release also fixes a rendering bug in the breadcrumbs.

third-party transfers

The webdav door now supports requesting third-party file transfers. This is an extension to the WebDAV COPY command, which is normally limited to internal copies. A client may request a file be transferred to a remote site by specifying the remote location as the Destination HTTP request header. Currently supported transports are GridFTP (destination URI starts gsiftp://) and HTTP. For HTTP both plain (http://) and with SSL/TLS (https://) are supported.

The Credential HTTP request header may be used by the client to describe whether or not the pool should use a delegated credential when transferring the file. The accepted values are none and gridsite. If none then no credential is to be used; if gridsite then a delegated credential is to be used. If the Credential header isn’t specified then a transport-specific default is used: for gsiftp and https transports the default is gridsite; for http transport the default is none. Some combinations are not supported: gsiftp with none and http with gridsite are not supported.

If a delegated credential is to be used, the client must delegate a credential to dCache using the delegation service, which is part of the srm service. If no useful certificate could be found, the webdav door will redirect the client to itself and include the X-Delegate-To HTTP header in its response. This header contains a space-separated list of delegation endpoints.

For ‘http’ and ‘https’ transfers, the client may choose whether to require weak or strong verification; see above for a definition of these. The RequireChecksumVerification header controls this behaviour; if this header has a value ‘true’, strong verification is required. If ‘false’, weak verification is sufficient. If the header is not specified, a configurable default value is used.

Again, for ‘http’ and ‘https’ transfers, the client may specify additional or replacement HTTP headers that the pool should use when making the transfer. These are specified as headers that start TransferHeader. This prefix is removed and the header used by the pool. For example, to have the pool use basic authentication with userid ‘Aladdin’ and password ‘open sesame’, the client would add the following header to its request:

TransferHeaderAuthentication: Basic QWxhZGRpbjpvcGVuIHNlc2FtZQ==

Once the transfer is accepted, the client will receive periodic progress information. If the client closes the TCP connection, the transfer is cancelled.

alarms

The alarms system, through the logback library, has always supported sending emails if an alert is found. However, enabling this feature required editing an XML file. With dCache v2.10, the standard dCache configuration now includes properties that allow the admin to configure this behaviour through dCache’s normal configuration properties.

space-manager

If a space reservation is over-subscribed, the free space is shown as a negative number. This breaks the layout, with that line being badly formatted. This is fixed with dCache v2.10

dcap

The dcap door has been made more robust when processing unexpected replies from other dCache services.

admin

The prompt shown in the admin interface has changed from green to bold. This is in keeping with other dCache CLIs.

For many services, if the ‘info’ command is called after a cell is accepting commands but before initialising has completed, a NullPointer error is produced internally and no information is provided. This has been fixed.

billing

Billing messages are no longer sent to the pinboard, so the ‘show pinboard’ admin command is now limited to showing information about the service itself.

Some of the information about a transfer is written as a colon-separated list of items. In previous versions of dCache, the protocol version and client hostname were not separated, resulting in entries like:

{Http-1.1zitpcx6184.desy.de...}

With dCache v2.10, this has been fixed for the http and xrootd transfers, so entries now are correctly separated:

{Http-1.1:zitpcx6184.desy.de...}

info

The info service, which collects and collates information from various dCache components, now supports presenting the gathered data in different formats. Previous versions of dCache only supported a dCache-specific XML format. With dCache v2.10 two additional formats are available.

There are two ways the client can choose the format. The HTTP client can specify the MIME type as the “Accept” request header; JSON-specific clients may do this automatically. The alternative is to append “?format=_some_format_” to the URI; e.g., http://httpd-service.example.org:2288/info/summary/pools?format=json. There are four supported MIME types (application/xml, text/xml, application/json and text/x-ascii-art) and three supported formats for ‘?format=’ (xml, json and pretty). The default format continues to be XML.

srm

dCache v2.10 sees a major overhaul in how the srm controls and limits its interaction with other dCache services. The following sections describe these changes.

Background

The SRM protocol has the concept of a client’s request not being processed immediately, with the request being queued prior to any work being conducted. This is most apparent for asynchronous requests, where the client reconnects periodically to discover what progress, if any, has taken place. The following also applies for synchronous requests (where the client waits to learn the result of the query); however, the client is oblivious of these details.

In general, a successful request will start out in a “queued” state. After dCache starts processing the request, the request will have an “in-progress” state. For transfer requests (GET or PUT), the request will be in a “ready” state when there is a TURL for the client to use, and in a “success” state after a successful transfer. For non-transfer requests (LS, BRING-ONLINE, …) the request will go into the “success” state once dCache finishes processing it, skipping the “ready” state.

To answer most client-issued requests, the srm service must trigger some activity in other dCache services. For the most part, the srm service simply issues requests to other services, collects the responses and converts them into a corresponding SRM reply.

The srm service has several execution pools to throttle processing of requests. These limits are applied to individual files, so, for example, from a single request to read 10 files, each of the files is scheduled as if there were 10 requests, each reading a single file.

Prior to dCache v2.10, the srm throttled itself by limiting the number of requests it can process concurrently. Once this limit is reached, additional requests are placed in a queue with a limited capacity. Once this queue limit is reached, subsequent requests are rejected. The effect is to throttle the frequency of messages being sent.

In practice, a single thread could easily saturate the rest of dCache, which meant that this was not a particularly effective way to throttle the srm service.

Furthermore, the queue limits would also affect requests that had already begun processing. Thus under high load, one could observe requests being failed even if they had already started processing. At least in theory, this could lead to a starvation problem, in which a lot of processing could take place while no requests would ever succeed.

New approach

Rather than rate-limiting the delivery of internal messages, dCache v2.10 provides the ability to limit the number of requests in each of the different states mentioned above (“queued”, “in-progress” and “ready”). This is taking into account “in flight” requests.

There are now three types of limit: max-requests, max-inprogress and max-transfers.

The max-requests limit controls the total number of requests that are queued or being actively worked on. Once this limit is reached, any more requests from an SRM client are rejected.

This limit allows the srm service to protect against running out of memory. Setting a value too low will result in the srm not fully utilising the available memory; setting it too high will risk running out of memory under heavy load.
The max-inprogress limit controls the concurrent activity within dCache; once this limit is reached, subsequent requests are either queued or rejected, depending on the max-requests limit.

This allows the srm service to stop itself from overloading core dCache services when processing SRM requests. The max-inprogress limit also controls the maximum number of staging requests for GET and BRING-ONLINE requests. For COPY requests, it sets the maximum number of current transfers.

Setting the value too low will result in the srm artificially limiting its performance, as requests will needlessly spend time queued. Setting the value too high will risk overloading core components when the srm service is under heavy load.
The max-transfers limit controls the number of TURLs handed out to clients. Once this limit is reached, the srm keeps subsequent requests that have a TURL in the “in-progress” state - at least from the point of view of the client. Internally, such requests are in the “ready queued” state and no longer count towards the max-inprogress limit. Once the SRM client completes transfers, those requests that have a TURL but are in the “ready queued” state can become “ready”.

This limit is mostly to allow dCache to protect pools by limiting the number of concurrent transfers. It will also provide some protection for the transfer doors (typically ftp) and poolmanager.

Since this limit type affects only transfers, setting the value too low will result in artificially poor transfer rates (with client requests spending a large amount of time in the “in-progress” state despite having a TURL) while other SRM requests are processed quickly. Setting the value too high risks overloading pools when dCache is under heavy load.

There are request-specific properties that may be configured for each of these three types of limit. These have names like srm.request.TYPE.{max-requests, max-inprogress, max-transfers}, where TYPE is one of get, put, ls, bring-online, copy or reserve-space.

For example:

srm.request.bring-online.max-inprogress controls the maximum number of concurrent staging requests.
srm.request.get.max-transfers controls the maximum number of concurrent downloads (maximum number of TURLs handed out at any time).
srm.request.ls.max-requests controls how many file metadata and directory listing queries to allow (either actively being processed or queued) before rejecting new requests.

The max-requests limit properties (srm.request.get.max-requests, srm.requests.put.max-requests, etc) all take a common default value controlled by the srm.request.max-requests property. The current default value for srm.request.max-requests is 10,000. Note that the srm.request.max-requests value applies to each SRM request type (GET, PUT, BRING-ONLINE, …) independently. Therefore, the overall maximum number of requests is six times the configured value.

The max-inprogress properties (srm.request.get.max-inprogress, srm.request.put.max-inprogress, etc.) all take individual default values. The current default values are to allow concurrent processing of 10,000 BRING-ONLINE requests, 1,000 GET and COPY requests, 50 PUT requests, 50 LS requests, and 10 RESERVE-SPACE requests.

The two max-transfers properties (srm.request.get.max-transfers and srm.request.put.max-transfers) take pre–2.10 srm configuration properties as default values.

As with prior versions of dCache, if a request suffers a transitory failure, the srm will retry the operation after waiting awhile. Prior to 2.10, such retries were treated specially. With 2.10, the retry is treated as if it were a fresh request, so is subject to the three limits described above.

srm protecting itself

The srm service continues to protect against trying to process too many requests concurrently. As with pre–2.10 srm, this is achieved by scheduling activity using a set of threadpools. These are controlled by the srm.request.type.threads configuration property, where type is one of {get, bring-online, put, copy, ls, reserve-space}. These all take srm.request.threads as a default value. The threadpool sizes also accept pre–2.10 configuration (for example, srm.request.put.threads takes the value of srmPutReqThreadPoolSize as a default value).

Notes for sites upgrading

We have tried to reduce the impact of this change by continuing to support some pre–2.10 configuration. However, since this change reflects a new processing model, sites should expect that some configuration changes are needed. Therefore, admins are recommended to familiarise themselves with the new scheduler options and adjust their configuration accordingly.

The dcache check-config command will describe which configuration properties are no longer supported, and so may indicate the need for alternative configuration.

HTTP third-party transfers

The srm service now supports http and https transfers in both pull (dCache pool makes an HTTP GET request) and pull (pool makes an HTTP PUT request) operations. The SRM client can steer the transfer by supplying ExtraInfo arguments. The ExtraInfo arguments are a set of arbitrary key-value pairs that the client can specify when making the request. The currently supported ExtraInfo keys are:

verified If set, takes a value of either true or false. This controls whether weak or strong verification is required. If set to false then weakly verified transfers are successful. If set to true then a transfer must be strongly verified to be successful.
header-. All ExtraInfo elements that start “header-” are converted into HTTP headers. The HTTP header key is the ExtraInfo key without the initial “header-”. The HTTP header value is the ExtraInfo value. These HTTP headers are used when the pool makes a request.

If the SRM client delegates a credential as part of the GSI connection, this credential is used to authenticate when the pool makes an SSL/TLS-based connection to the remote server. If the client doesn’t delegate then dCache will attempt the transfer without any client credential.

See the above section on HTTP/HTTPS third-party transfers for more details.

Other srm improvements

The dCache srm implementation places client requests into one of a number of internal states. Each of these internal states broadly correspond to a single SRM state, as described above (“queued”, “in-progress”, …). With dCache v2.10 this similarity has been made more explicit and, as a result, some incorrect mappings have been fixed.

Previously, the srm would log “illegal state transition” messages if a user aborts a request involving more than one file. This has been fixed.

If the srm service sends a message to some other service due to some client request (e.g., prepareToPut), and the response from this message takes too long, the srm service will fail the client request. It could happen that, after the client request has been failed, the srm service receives the reply message, stating the requested operation was successful. If this happens then the client will see dCache in an inconsistent state: the srm service told the SRM client its request failed, yet its request resulted in successful activity. In the case of preparing for uploading a file, pinning a file and creating a space-reservation, the srm service can safely “undo” the effect of the message by removing the temporary upload directory, unpinning the file, or releasing the space-reservation. This behaviour has been added with dCache v2.10.

The error message sent to the SRM client if the srm service receives certain malformed LS requests has been improved.

The output of the admin interface ‘ls’ command (which provides an overview of the current requests and allows more detailed information about any specific request) has been improved. The timestamps are easier to read; the history also shows the time spent in each internal state. Also, a lot of “noise” has been removed, making it easier to understand what is happening.

pool

Wake up sweeper if new entry is added.

Make pool-to-pool transfers more robust against two problems: busy networks during the initial phase of the internal transfer (when the mover is queued) and the source sending the URL for the transfer just as the destination was giving up.

The allocate-now (NFS-only) garbage-collection now also utilises all reclaimable space.

The classic dCache behaviour is to block when a client writes into a file and there is insufficient space to satisfy the request. For the NFS protocol, this blocking behaviour is seen by the client as the server being unresponsive and will trigger the client banning the pool and attempting to recover by sending subsequent IO to the door. For NFS v4.1 traffic, writing through the door is not supported, so the upload will fail. With dCache v2.10, when the pool runs out of space, this information is relayed back to the NFS client so it can behave correctly.

The number of files has been added to the output from the ‘info’ command.

poolmanager

The wrandom partition type has been removed. The same behaviour may be achieved with the WAAS partition type.

gplazma

The nis plugin will fail if no matching pincipal was found.

The ldap plugin default value has been updated.

The xacml plugin now supports the gums server returning GID values.

nfs

The nfs door may now be configured whether or not it is well-known.

The admin interface for the nfs door has been updated to include better on-line help.

The ‘pool reset id’ command has been added to the nfs door to reset pool’s device id. This command is used to recover from unknown failures: the Linux kernel may choose to ban a pool (due to some real or imagined protocol violation) and send IO through the NFS door. This command allows the door to simulate a pool being rebooted, triggering the client to reassess its opinion of the pool.

ftp

Additional, useful information has been added to the output of the info command for the cell handling a client’s command channel.

pnfsmanager

The error message reporting a problem when a user cancels an upload has been fixed.

When querying for a file’s checksum values, the chimera pnfsmanager plugin would try each supported checksum algorithm. In most cases, a file will have only a single checksum value, so many queries gave no additional information, but placed load on the PostgreSQL database and increased latency. Now such requests are handled with a single database query.

Scripts

The dcache-star command has been updated so it will create a valid (albeit empty) StAR record if there is a database connection failure.

The dcache billing command has been extended to support the --since and --until options. These allow reports to be generated within a specific period. The output has been enhanced so search results can be provided in JSON or YAML format, in addition to the existing raw and files format. Please be aware that the JSON and YAML formats are not finalised: future versions of dCache may make slight alterations as experience is garnered.

Changelog from 2.9 to 2.10.0

befd0f8: [maven-release-plugin] prepare release 2.10.0
f602324: srm: fix NPE when checking delegated credential
4b2812d: srm: fix deserialise of TExtraInfo
efe4315: srm: fix IllegalStateTransition for COPY requests
76a393b: nfs4: use only address string in ProtocolInfo#toString() method
d8fff55: Add support for webdav 3rd-party transfer using http and https
a150e36: webdav: allow users to choose between showing file’s content inline and downloading it
30c515b: Fix broken commit
62bb02a: srm,pool: improve support for 3rd-party HTTP copy
7c8a5ea: webdav: add support for remote copy
ed5bb82: chimera: fix liquibase changeset to fail peacefully
ab65294: fix .gitignore to work for Eclipse on MacOS X
f354005: libs: revet luquibase to 3.1.2
f28f204: Revert “Update to DataNucleus 4”
23c6d2b: srm: make timestamps easier to read
27fa708: srm: fix retry and failure messages for scheduled jobs
ff2397f: srm: fix status of all-failed container jobs
908ae5a: srm-client: by default only delegate for srmcp with 3rd-party copies
201af9e: alarms: Fix checksum alarm filter.
99e73b0: src: kill more of obsolete code
3445ee3: src: kill more obsolete code
87081b2: src: more obsolete code
caf476c: src: remove obsolete code
9e4251e: pool: Add file count to ‘info’ output
efa2de6: deb: Add python-psycopg2 as a recommended dependency
5e71072: Update to DataNucleus 4
75d3718: Update third party dependencies
38f00db: nfsv4: get client from current session for v4.1
cf399f9: delegation-shell: generate delegated credential of correct type
a8438a5: pool: add soft/hard space allocation modes
2a4efa6: pool: add Allocator#allocateNow()
d22ad32: pool: Account#allocateNow() should trigger sweeper to get some space
c22bba3: pool: Update repository javadoc
5da93bb: Suppress nfs’ file protocol from being published
ade506e: pool: Fix race condition that leads to orphaned files
87bcfb2: srm: improve error message
7fed889: dcache-webadmin: fix pool queue plot grid placeholder image
3740796: webadmin: fix regression in alarm deletion filter
d7f67a7: cells: remove System time dependency in unit-test
1722627: webadmin: Reimplement active transfers page
52d2f1c: Intern several strings on message deserialization
43ddb9f: webadmin: Simplify titel initialization of base page
5026cf9: pool: log with error if checksum scanner gets an IO error
b756c66: nfs: add a command to reset pool’s device id
4d14b40: nfs: use new admin command interface
b2575fa: nfs: remove double locking
6a6fcfd: ftp: Close passive port on failure
44ecaa3: ftp: Fix indentation
0764f2a: srm: Undo external operations if the request has expired
15fa2ee: srm: Cancel upload path if request was aborted
8b7bc70: ftp: fix LIST, NLST and MLSD cmd-channel output
f1bee28: pool: Fix regressions in new HSM subsystem
e73977e: Fix ArrayIndexOutOfBoundsException when processing multiple surls invoking srmget.
d12513c: Fix issue with incorrect determination of whether or not a string is encoded by replacing URLDecoder with import org.springframework.web.util.UriUtils as the former does not handle ‘+’ character.
40356b5: gplazma-xacml: Add handling of GID returned by GUMS in gPlazma2 XACML plugin
97bd31a: dcache alarms: add properties and fuller documentation for configuring SMTP appender
6f97e8d: system-test: fix web font corruption
cb46d8a: config: do not set default values for kdc and realm
5c63d0e: http: fix multipart response size
3fd5888: dcache-star: create valid xml even on db connection failure
93e94be: xrootd,nfs: Add support for space management
ab39955: pinmanager,srm,spacemanager,xrootd: Fix pool monitor expiration
3ac48ab: srm: Fix lock congestion
3f40abe: http: do not use depricated API
b0afa4f: chimera-provider: fix regressions introduced by API change
e671d5c: chimera: remove FileSystemProvider#getInodeChecksum
1a79834: dcache-chimera: query for all checksums in one go
0cb9af2: Update third party libs
e85f9cf: webdav: Bootstrap for directory listing
4137a00: xrootd: Quote IPv6 addresses in kXR_redirect replies
c2d5170: nfs: remove unused argument in NFSv41Door#getPool()
bd7a2fe: cells: Fix hidden NPE during initialization
1933780: xrootd: Fix error message for read patch check
39c2c0e: dcache-webadmin: fix login redirect bug which misleadingly results in Access Denied Page
cabce53: xrootd: Upgrade to xrootd4j 1.3.3
722ea6c: system-test: Make log configuration more similar to production config
8f2a7a0: cells: Minor cleaning of CellPath and CellAddressCore
41a3b9b: billing: Add date range and json/yaml output for billing search tool
c3a83d9: admin: Make prompt bold
4a838d2: scripts: allow java8 runtime
24a244f: gplazma: authzdb add test case
8f63805: dcap: do not call toString() on error object
2f2badb: dcap: ensure that we remove session on DoorTransferFinishedMessage
3f34bda: info-provider: fix works-by-accent bug when importing dCache config
bdc8f49: webdav: switch activity logging to NetLogger format
005b7bf: pom: update jglobus library
35ab3f8: srm: Use srm.enable.space-reservation rather than dcache.enable.space-reservation
85d9e29: Fix rendering of negative numbers in columm layout
4bd5947: namespace: Improve error message in upload cancellation
58192a1: srm: Fix NPE
eeb9301: ftp,dcap: Don’t export child doors
eb9212c: ftp: Add more information to info page
a966d7a: cleaner: Shut down timer thread
afac6ab: pool: Improve robustness of p2p
ed1975e: httpd: Fix a couple of bugs
39b2c11: srm: Avoid duplicate status code update on status query
0eb1af0: xrootd: split protocol name form the client IP in XrootdProtocolInfo
ee90392: chimera: Fix duplicate key supression
9bbb9dc: Revert “srm: Drop fail fast for protocol checks”
fd9d0c7: srm: Fix request cancellation due to SURL deletion
4837397: srm: Fix request expiration
45f658f: nfs: make cell export configurable
2e3034d: fix type in gplazma ldap default value
27fdd4d: Drop messages when TTL is exceeded
360a230: cells: Fix message counting in case of queue overflow
69fbf87: pool: Improve how we deal with unexpected files in pools
c34fa7e: pool: Fix error handling in staging code
c67f043: system-test: Pick up DN of dCache user cert if available
3a22b10: pool: Log P2P fatal errors with stack trace
6723151: info: add JSON serialisation
f6c4fc5: srm: Fix transition ordering for restored requests
3a27973: hsm scripts: fix command and options order
8364809: billing: do not generate billing message if file based store is disabled
47852ba: webdav: generate error page when template couldn’t be found
c688083: webdav: work-around race-condition in StringTemplate library
239a7f9: systemtest: regenerate host credentials, add user credentials
d728fc0: httpd: log bugs with stacktrace
eb82e9f: core: remove obsolete code
d90196a: srm: Fix deadlock in pinning
e0e9a51: srm: Schedule resubmit of pin request to avoid deadlock
0f5a4d5: srm: Drop fail fast for protocol checks
d078fdb: pnfsmanager: Fix expected size extraction and tag caching
b30bc4a: scripts: fix credentials script to work for non-Linux machines
b97bec9: srm: Minor cleanup of job scheduling
4b84cef: srm: Simplify scheduler configuration
608b736: ssh1: Remove service from dCache
c1e6fed: Silence log messages commonly associated with client disconnects
4b7d7a0: fhs: Use gzip compression for deb package
654b6ce: srm: Simplify unpin callbacks
7209d9f: srm: Align SRM scheduler with SRM request states
4fc36a0: srm: Report correct state for RQUEUED and TQUEUED
0194173: statistics: Fix NPE
a01653f: pool: Remove legacy hsm initialization from default pool setup
418c8ff: core: split protocol name form the client IP in HttpProtocolInfo
5cf8806: srm: Split standalone configuration from generic configuration
cbb5936: srm: fix retry of copy-pull
a4c7f34: alarms, billing: fix liquibase character varying and cleanup.
460b9e0: systemtest: add support for generating user credentials
7e6e2ef: pool: Report correct client IP to billing for passive FTP transfers
942035a: srm: Fix race condition in listing
6750fd4: poolmanager: Fix infinite loop in lru partition type
3c6da51: gplazma-nis: pulgin must fail if no matching principal found
c34be64: nfs: do not use CDC as try-with-resource
dfc809c: databases: update hikari minimum connection configuration
193cba9: nfs4: do not log stacktrace on IO errors
b937f41: nfs: make cache time unit property name guideline compliant
80ade65: poolmanager,pool: Fix precision underflow in WASS
d56c1e0: poolmanager: remove wrandom partition type
dc0188b: pool: small cleanups in sweeper
f82ced7: pool: wake-up sweeper if new entry is added
89e8063: releases: update dCache version to v2.10

What’s new in dCache 2.10 The release notes

Executive summary

Incompatibilities

Release 2.10.62

poolmanager

Changelog 2.10.61..2.10.62

Release 2.10.61

pnfsmanager

Changelog 2.10.60..2.10.61

Release 2.10.60

Changes affecting multiple services

cells

pool

Changelog 2.10.59..2.10.60

Release 2.10.59

many

Changelog 2.10.58..2.10.59

Release 2.10.58

cells

srm

Changelog 2.10.57..2.10.58

Release 2.10.57

pnfsmanager

poolmanager

webdav

Changelog 2.10.56..2.10.57

Release 2.10.56

pnfsmanager

srm

Changelog 2.10.55..2.10.56

Release 2.10.55

Changes affecting multiple services

pool

srm

statistics

Changelog 2.10.54..2.10.55

Release 2.10.54

nfs

Changelog 2.10.53..2.10.54

Release 2.10.53

Changes affecting multiple services

info-provider

Changelog 2.10.52..2.10.53

Release 2.10.52

pool

spacemanager

Changelog 2.10.51..2.10.52

Release 2.10.51

xrootd

Changelog 2.10.50..2.10.51

Release 2.10.50

Changes affecting multiple services

Changelog 2.10.49..2.10.50

Release 2.10.49

Changes affecting multiple services

pool

Changelog 2.10.48..2.10.49

Release 2.10.48

nfs

poolmanager

spacemanager

Changelog 2.10.47..2.10.48

Release 2.10.47

Changes affecting multiple services

Changelog 2.10.46..2.10.47

Release 2.10.46

Changes affecting multiple services

pnfsmanager

webdav

xrootd

Changelog 2.10.45..2.10.46

Release 2.10.45

Changes affecting multiple services

admin

pnfsmanager

pool

Scripts

srm

Changelog 2.10.44..2.10.45

Release 2.10.44

What’s new in dCache 2.10
The release notes