[an error occurred while processing this directive]

Advanced

[return to top]

Email on Alarm

To configure to send an email on alarm you need to modify the /opt/d-cache/etclogback.xml file.

<appender name="ALARM_MAIL" class="ch.qos.logback.classic.net.SMTPAppender">
        <!-- this filter ensures that only events sent marked as ALARM
             are received by this appender -->
        <filter class="org.dcache.alarms.logback.AlarmMarkerFilter"/>
        <smtpHost></smtpHost>
        <to></to>
        <to></to>
        <from></from>
        <subject>dCache Alarm</subject>
        <layout class="ch.qos.logback.classic.PatternLayout">
            <pattern>%d{dd MMM yyyy HH:mm:ss} \(%X{cells.cell}\) [%X{org.dcache.ndc}] %m%n</pattern>
        </layout>
        <cyclicBufferTracker class="ch.qos.logback.core.spi.CyclicBufferTrackerImpl">
            <!-- send just one log entry per email -->
            <bufferSize>1</bufferSize>
        </cyclicBufferTracker>
    </appender>

[return to top]

Defining an Alarm

The file logback.xml found in the /opt/d-cache/etc directory adds an org.dcache.alarms.logback.AlarmDefinitionAppender to the root logger. This alarm appender embeds a child SocketAppender set to send events on the port specified by the property alarms.server.port to a host specified by the property alarms.server.host.

The alarms defined are listed in the the section called “The Defined Alarms”.

Define additional alarms simply by including other <alarmType> elements in the <filter> element.

Example:

Extract of the definition of the SERVICE_CREATION_FAILURE and the CHECKSUM alarms in the /opt/d-cache/etc/logback.xml file.

            <alarmType>
                regex:"(.+) from ac_create",
                type:SERVICE_CREATION_FAILURE,
                level:ERROR,
                severity:CRITICAL,
                include-in-key:group1 type host domain service
            </alarmType>
            .
            .
            .
            <alarmType>
                logger:org.dcache.pool.classic.ChecksumScanner,
                regex:"Checksum mismatch detected for (.+) - marking as BROKEN",
                type:CHECKSUM,
                level:ERROR,
                severity:MODERATE,
                include-in-key:group1 type host service domain
            </alarmType>

The text of the <alarmType> element must be formed as a JSON string but without the beginning and ending braces ( ’{’, ’}’ ); this means, in essence, a comma-delimited list of NAME:VALUE pairs, with arbitrary whitespace between the pairs. The set of properties and their possible values is as follows:

PropertyPossible valuesRequired
loggername of the logger (see example above)at least one of logger, regex
regexA pattern to match the message with.

Note

It is advisable to place the regex pattern in double quotes, so that the JSON parser will accept the special characters used in regular expressions: e.g., "[=].[\w]*"
at least one of logger, regex
match-exception False, TrueNO
depthInteger ≥ 0NO
typeAn arbitrary name which will serve as the alarm’s marker.YES
levelTRACE, DEBUG, INFO, WARN, ERRORYES
severityINDETERMINATE (default), LOW, MODERATE, HIGH, CRITICALNO
regex-flagsA string representation of the (Java) regex flags options, joined by the 'or' pipe symbol: e.g., CASE_INSENSITIVE | DOTALL. For fuller explanation, see the Java Tutorials on Regular Expressions.NO
threadThread name (restricts this alarm type only to this particular thread). NO
include-in-keyConcatenation of key field names (see below)YES

[return to top]

The Properties match-exception and depth

The property match-exception is False by default. If set to True, it applies the regex pattern to all embedded exception messages, recursively, until a match is found.

The property depth is to be used with the property match-exception. The default is undefined (null), meaning unbounded. Setting depth to an integer > 0 indicates the level to which the match will be applied (in terms of nested messages). Setting it to 0 is equivalent to setting match-exception to false.

Example:

Have a look at the extract of the definition of the DB_UNAVAILABLE alarm in the /opt/d-cache/etc/logback.xml file.

            <alarmType>
                regex:"Unable to open a test connection to the given database|Connections could not be acquired from the underlying database",
                match-exception:true,
                depth:1,
                type:DB_UNAVAILABLE,
                level:ERROR,
                severity:CRITICAL,
                include-in-key:type host
            </alarmType>

[return to top]

The property include-in-key

The alarm key (the property include-in-key) is the set of properties whose values uniquely identify the alarm instance. For example, the checksum alarm defined above does not include the timestamp in its key, as all reports of this kind of error for a given file (PNFS id is given in the message body) are to be considered as duplicates of the first such alarm. The key field names which can be used to constitute the key are those which all alarms have in common:

groupN, timestamp, message, logger, type, domain, service, host and thread.

These property names should be delimited by (an arbitrary number of) whitespace characters. Note that logger, timestamp and message derive from the logging event, host is determined by static lookup, and domain and service correspond to the cells.domain and cells.cell properties in the event’s MDC map.

The key field name groupN, where N is an integer, means that the Nth substring (specified by parentheses) will be included. For N=0, group0 is identical to message, which means that the whole message string should be included as an identifier.

Example:

Matching on Regex Groups.  Have a look at the extract of the definition of the CHECKSUM alarm in the /opt/d-cache/etc/logback.xml file.

            <alarmType>
                logger:org.dcache.pool.classic.ChecksumScanner,
                regex:"Checksum mismatch detected for (.+) - marking as BROKEN",
                type:CHECKSUM,
                level:ERROR,
                severity:MODERATE,
                include-in-key:group1 type host service domain
            </alarmType>

Here the tag group1 in the include-in-key extracts the PNFS-ID from the message and includes only that portion of the message string as an identifier. As usual, group0 is the same as the entire message.

When the appender applies this alarm definition filter, it relies on an implicit matching function: (logger, level, regex, thread) ⇒ type; hence a given alarm can be generated by more than one logger, and a logger in turn can send multiple types of alarms if these are mapped to different logging levels, thread names and/or regex patterns for the message body.

[return to top]

Run the Logback Server Independently from dCache

In most cases, running the alarm server as a dCache service will be adequate. Nevertheless, it is always possible to run the logback server entirely independently from dCache. In that case, you must be sure that the classpath carries the necessary dCache dependencies to provide the alarm appending functionality. Here is a bash snippet which will sufficiently define the classpath based on the dCache classes directory:

 #!/bin/sh
case $# in
    0)
        PORT=60001
	;;
    1)
        PORT=${1}
        ;;
    *)
        echo "Usage: $(basename $0) [PORT]" >&2
	exit 1
	;;
esac
DC=/etc/dcache
CL=/usr/share/dcache/classes

CP=.
CP=${CP}:`find ${CL} -name "activation-*.jar"`
CP=${CP}:`find ${CL} -name "datanucleus-api-jdo-*.jar"`
CP=${CP}:`find ${CL} -name "datanucleus-cache*.jar"`
CP=${CP}:`find ${CL} -name "datanucleus-core*.jar"`
CP=${CP}:`find ${CL} -name "datanucleus-xml*.jar"`
CP=${CP}:`find ${CL} -name "dcache-core*.jar"`
CP=${CP}:`find ${CL} -name "guava-*.jar"`
CP=${CP}:`find ${CL} -name "jargs-*.jar"`
CP=${CP}:`find ${CL} -name "jaxb-*.jar" | tr '\n' ':'`
CP=${CP}:`find ${CL} -name "jaxrpc-*.jar" | tr '\n' ':'`
CP=${CP}:`find ${CL} -name "jdo-api-*.jar"`
CP=${CP}:`find ${CL} -name "json-*.jar"`
CP=${CP}:`find ${CL} -name "log4j-over-slf4j-*.jar"`
CP=${CP}:`find ${CL} -name "logback-classic-*.jar"`
CP=${CP}:`find ${CL} -name "logback-core-*.jar"`
CP=${CP}:`find ${CL} -name "mail-*.jar"`
CP=${CP}:`find ${CL} -name "slf4j-api-*.jar"`

java -cp ${CP} ch.qos.logback.classic.net.SimpleSocketServer ${PORT} /var/lib/dcache/alarms/logback-server.xml &