Edit me on GitHub

Debugging

This chapter provides an information on how to exam running dCache system, identify problems and provide debug information to developers.

Java Flight recorder

When debugging an issue on a running system often we need to collect jvm performance stats with Java flight recorder. Starting from release 7.2 the Java flight recorder attach listener is enabled by default. Site admins can collect and provide developers with additional information when high CPU load, memory consumption or file descriptor leaks are observed. To enable the flight recorder jcmd command is used, which is typically provided as a part of java-11-openjdk-devel (on RHEL and clones).

To control recoding the following subcommands of jcmd available:

  • JFR.start
  • Start a recording.
  • JFR.stop

  • Stop a recording with a specific identification number.
  • JFR.dump

  • Dump the data collected so far by the recording with a specific identification number.

Example:

jcmd <pid> JFR.start filename=/tmp/dcache.jfr

The java process pid can be obtained with systemctl or jps command. There are several to limit the recording:

  • time duration based
  • Limiting by time window
  • Limiting by dump file size

Typically, a duration or time window based recordings should be used.

Duration based recording

Useful to record for a fixed period of time. The duration can be specified in in (s)econds, (m)inutes, (h)ours, or (d)ays:

jcmd <pid> JFR.start duration=60s filename=/tmp/dcache.jfr

The recording file will be written into defined file after specified time duration. The flight recorder will be automatically switched off.

Time window based limit

The recording is collected in a ring-buffer like state with a fixed time-based limit. Such recording is useful in situations when we can’t predict the point in time of the interested event.

jcmd <pid> JFR.start maxage=10m name=my-10m-records

The name=my-10m-records option allows to give a human readable identifier to the recording.

The recording can be collected as:

jcmd <pid> JFR.dump name=my-10m-records filename=/tmp/dcache.jfr

The flight recorder will stay active and new recording can be collected, if needed.

To stop the recording the following command can be used:

jcmd <pid> JFR.stop name=my-10m-records

Size based limit

The recording can be started with an explicit maximum recorded data size. The size can be specified in in (k)B, (M)B or (G)B.

jcmd <pid> JFR.start maxsize=100K name=my-100k-records

Inspecting flight recordings

Typically, the collected data should be provided to dCache development team. However, curious sysadmins can inspect the recordings with Oracle VisualVM or Oracle JMC

Perf-Tools

Linux perf tools is often used by admins to inspect CPU performance counters, kprobes, tracepoints and uprobes, and included into many distros by default. The perf tools expect symbols for code executed from unknown memory regions at /tmp/perf-.map. Starting Java17 such mapping file can be generated by JVM directly:

jcmd <pid> Compiler.perfmap

After the mapping file is generated perf tools are able to correctly display Java classes and method names, for example, the output of perf top -p 527973:

Samples: 18K of event 'cycles:P', 4000 Hz, Event count (approx.): 6797468610 lost: 0/0 drop: 0/0
Overhead  Shared Object         Symbol
   4.64%  [JIT] tid 527973      [.] vtable chunks
   2.46%  libjvm.so             [.] _ZN8ZBarrier29mark_barrier_on_oop_slow_pathEm
   2.46%  libjvm.so             [.] _ZN5ZMark15mark_and_followEP10ZMarkCache15ZMarkStackEntry
   2.09%  [JIT] tid 527973      [.] boolean javax.security.auth.Subject$SecureSet.contains(java.lang.Object)
   1.58%  [JIT] tid 527973      [.] java.security.ProtectionDomain[] javax.security.auth.SubjectDomainCombiner.combine(java.security.ProtectionDomain[], java.security.ProtectionDomain[])
   1.54%  [JIT] tid 527973      [.] ch.qos.logback.core.spi.FilterReply ch.qos.logback.classic.spi.TurboFilterList.getTurboFilterChainDecision(org.slf4j.Marker, ch.qos.logback.classic.Logge
   1.45%  [JIT] tid 527973      [.] boolean com.sun.security.auth.UnixNumericGroupPrincipal.equals(java.lang.Object)
   1.30%  libjvm.so             [.] JVM_GetStackAccessControlContext
   1.08%  [JIT] tid 527973      [.] StubRoutines (2)
   1.07%  [JIT] tid 527973      [.] vtable chunks
   1.05%  [JIT] tid 527973      [.] java.lang.Object java.util.concurrent.ConcurrentHashMap.computeIfAbsent(java.lang.Object, java.util.function.Function)
   0.93%  libjvm.so             [.] _ZNK9ZRelocate14forward_objectEP11ZForwardingm
   0.89%  libjvm.so             [.] _ZN21OopOopIterateDispatchI22ZMarkBarrierOopClosureILb0EEE5Table15oop_oop_iterateI13InstanceKlassP7oopDescEEvPS1_S7_P5Klass
   0.77%  [JIT] tid 527973      [.] void org.postgresql.core.v3.QueryExecutorImpl.processResults(org.postgresql.core.ResultHandler, int, boolean)
   0.74%  [kernel]              [k] entry_SYSCALL_64

More info on usage perf usage at perf Examples by Brendan Gregg.