To configure to send an email on alarm you need to modify
the /opt/d-cache/etclogback.xml
file.
<appender name="ALARM_MAIL" class="ch.qos.logback.classic.net.SMTPAppender"> <!-- this filter ensures that only events sent marked as ALARM are received by this appender --> <filter class="org.dcache.alarms.logback.AlarmMarkerFilter"/> <smtpHost></smtpHost> <to></to> <to></to> <from></from> <subject>dCache Alarm</subject> <layout class="ch.qos.logback.classic.PatternLayout"> <pattern>%d{dd MMM yyyy HH:mm:ss} \(%X{cells.cell}\) [%X{org.dcache.ndc}] %m%n</pattern> </layout> <cyclicBufferTracker class="ch.qos.logback.core.spi.CyclicBufferTrackerImpl"> <!-- send just one log entry per email --> <bufferSize>1</bufferSize> </cyclicBufferTracker> </appender>
The file logback.xml
found in the
/opt/d-cache/etc
directory adds an
org.dcache.alarms.logback.AlarmDefinitionAppender
to the root logger. This alarm appender embeds a child
SocketAppender set to send events on the port specified by
the property alarms.server.port
to a host
specified by the property
alarms.server.host
.
The alarms defined are listed in the the section called “The Defined Alarms”.
Define additional alarms simply by including other
<alarmType>
elements in the
<filter>
element.
Example:
Extract of the definition of the
SERVICE_CREATION_FAILURE
and the
CHECKSUM
alarms in the
/opt/d-cache/etc/logback.xml
file.
<alarmType> regex:"(.+) from ac_create", type:SERVICE_CREATION_FAILURE, level:ERROR, severity:CRITICAL, include-in-key:group1 type host domain service </alarmType> . . . <alarmType> logger:org.dcache.pool.classic.ChecksumScanner, regex:"Checksum mismatch detected for (.+) - marking as BROKEN", type:CHECKSUM, level:ERROR, severity:MODERATE, include-in-key:group1 type host service domain </alarmType>
The text of the <alarmType>
element
must be formed as a JSON string but without
the beginning and ending braces ( ’{’, ’}’ ); this means, in
essence, a comma-delimited list of
NAME:VALUE
pairs, with arbitrary
whitespace between the pairs. The set of properties and
their possible values is as follows:
Property | Possible values | Required |
---|---|---|
logger | name of the logger (see example above) | at least one of logger ,
regex |
regex | A pattern to match the message with.
NoteIt is advisable to place the regex pattern in double quotes, so that the JSON parser will accept the special characters used in regular expressions: e.g., "[=].[\w]*" | at least one of logger ,
regex |
match-exception | False, True | NO |
depth | Integer ≥ 0 | NO |
type | An arbitrary name which will serve as the alarm’s marker. | YES |
level | TRACE , DEBUG ,
INFO , WARN ,
ERROR | YES |
severity | INDETERMINATE (default),
LOW , MODERATE ,
HIGH , CRITICAL | NO |
regex-flags | A string representation of the (Java) regex flags options,
joined by the 'or' pipe symbol: e.g.,
CASE_INSENSITIVE | DOTALL . For fuller
explanation, see the Java Tutorials on Regular
Expressions. | NO |
thread | Thread name (restricts this alarm type only to this particular thread). | NO |
include-in-key | Concatenation of key field names (see below) | YES |
The property match-exception
is False by
default. If set to True, it applies the regex pattern to all
embedded exception messages, recursively, until a match is
found.
The property depth
is to be used with
the property match-exception
. The default is
undefined (null), meaning unbounded. Setting
depth
to an integer > 0 indicates the
level to which the match will be applied (in terms of nested
messages). Setting it to 0 is equivalent to setting
match-exception
to false.
Example:
Have a look at the extract of the definition of the
DB_UNAVAILABLE
alarm in the
/opt/d-cache/etc/logback.xml
file.
<alarmType> regex:"Unable to open a test connection to the given database|Connections could not be acquired from the underlying database", match-exception:true, depth:1, type:DB_UNAVAILABLE, level:ERROR, severity:CRITICAL, include-in-key:type host </alarmType>
The alarm key (the property include-in-key
) is
the set of properties whose values uniquely identify the alarm
instance. For example, the checksum alarm defined above does not
include the timestamp in its key, as all reports of this kind of
error for a given file (PNFS id is given in the message body) are
to be considered as duplicates of the first such alarm. The key
field names which can be used to constitute the key are those
which all alarms have in common:
groupN
, timestamp
,
message
, logger
,
type
, domain
,
service
, host
and
thread
.
These property names should be delimited by (an arbitrary number
of) whitespace characters. Note that logger
,
timestamp
and message
derive from the logging event, host
is
determined by static lookup, and domain
and
service
correspond to the
cells.domain
and
cells.cell
properties in the event’s MDC map.
The key field name groupN
, where
N
is an integer, means that the
Nth
substring (specified by parentheses) will
be included. For N=0
,
group0
is identical to
message
, which means that the whole message
string should be included as an identifier.
Example:
Matching on Regex Groups. Have a look at the
extract of the definition of the CHECKSUM
alarm in the /opt/d-cache/etc/logback.xml
file.
<alarmType> logger:org.dcache.pool.classic.ChecksumScanner, regex:"Checksum mismatch detected for (.+) - marking as BROKEN", type:CHECKSUM, level:ERROR, severity:MODERATE, include-in-key:group1 type host service domain </alarmType>
Here the tag group1
in the
include-in-key
extracts the PNFS-ID
from the
message and includes only that portion of the message string
as an identifier. As usual, group0 is the same as the entire
message.
When the appender applies this alarm definition filter, it relies on an implicit matching function: (logger, level, regex, thread) ⇒ type; hence a given alarm can be generated by more than one logger, and a logger in turn can send multiple types of alarms if these are mapped to different logging levels, thread names and/or regex patterns for the message body.
In most cases, running the alarm server as a dCache service will be adequate. Nevertheless, it is always possible to run the logback server entirely independently from dCache. In that case, you must be sure that the classpath carries the necessary dCache dependencies to provide the alarm appending functionality. Here is a bash snippet which will sufficiently define the classpath based on the dCache classes directory:
#!/bin/sh case $# in 0) PORT=60001 ;; 1) PORT=${1} ;; *) echo "Usage: $(basename $0) [PORT]" >&2 exit 1 ;; esac DC=/etc/dcache CL=/usr/share/dcache/classes CP=. CP=${CP}:`find ${CL} -name "activation-*.jar"` CP=${CP}:`find ${CL} -name "datanucleus-api-jdo-*.jar"` CP=${CP}:`find ${CL} -name "datanucleus-cache*.jar"` CP=${CP}:`find ${CL} -name "datanucleus-core*.jar"` CP=${CP}:`find ${CL} -name "datanucleus-xml*.jar"` CP=${CP}:`find ${CL} -name "dcache-core*.jar"` CP=${CP}:`find ${CL} -name "guava-*.jar"` CP=${CP}:`find ${CL} -name "jargs-*.jar"` CP=${CP}:`find ${CL} -name "jaxb-*.jar" | tr '\n' ':'` CP=${CP}:`find ${CL} -name "jaxrpc-*.jar" | tr '\n' ':'` CP=${CP}:`find ${CL} -name "jdo-api-*.jar"` CP=${CP}:`find ${CL} -name "json-*.jar"` CP=${CP}:`find ${CL} -name "log4j-over-slf4j-*.jar"` CP=${CP}:`find ${CL} -name "logback-classic-*.jar"` CP=${CP}:`find ${CL} -name "logback-core-*.jar"` CP=${CP}:`find ${CL} -name "mail-*.jar"` CP=${CP}:`find ${CL} -name "slf4j-api-*.jar"` java -cp ${CP} ch.qos.logback.classic.net.SimpleSocketServer ${PORT} /var/lib/dcache/alarms/logback-server.xml &