Coordinating multiple sources of monitoring information in the cluster

Clusters of multi-core computing nodes have been used in high performance scientific computing for some time. These are typically CPU intensive tasks, with all nodes being identical and running the same single scientific application. Telecom clusters, and many online systems, run on heterogeneous hardware with a mix of different applications, possibly with virtualization and migration, creating a much more complex system to analyze, debug and tune. Isolated solutions exist to trace the operating system, to trace Java Virtual Machines and some applications, and to remotely debug individual processes. The LTTng kernel tracer and User-Space tracing library represent a particularly efficient tool to trace these components. The challenge is to combine the information available at the different levels, from the hardware to the operating system, hardware virtualization, system libraries, Java Virtual Machine and applications. Without some coordination and correlation between these sources of information, it is impossible to understand for example why a request took a certain time, if we cannot discover that, while in one application function, the JVM ran a garbage collection and the kernel virtual machine was migrated onto a different node in the cluster.

With a proper tracing and debugging architecture, the tracing (stream of events) and process state (memory) information from any application or kernel in the cluster should be available for display. Similarly, it must be possible to control the execution of each process and to activate or disable any tracepoint, combining tracing and debugging functionalities. This also includes the possibility to initiate tracing on several nodes simultaneously in a cluster. Some tools have been proposed for communication between a target and a host (e.g. TCF, GDB remote protocol) but, in the context of a cluster, more robust algorithms are required to optimize the architecture in order to support efficient remote tracing and debugging. The remote tracing and debugging infrastructure must take into account the cluster architecture (e.g. Hierarchy of nodes, boards and racks) and appropriately use the remaining buffer memory and bandwidth available on the different elements of the networks involved.

The objective of this track is to develop efficient algorithms and techniques to provide a single frontend to control the tracing, profiling and debugging of all processes in the cluster, running in computing nodes and network equipment, at the logical and physical level, thus taking into account virtualization, migration and redundancy.

 

 

Team members

Béchir Bani École Polytechnique de Montréal Intern

 

Documents and presentations

Pages