Integrated Tracing, Profiling and Debugging for Tuning Large Heterogeneous Clusters

The communication and computing infrastructure is rapidly evolving towards mobile clients accessing a large number of servers in the cloud. The complexity of that infrastructure has been increasing exponentially over the last several decades with the processors in the clients, servers and networking nodes now running several processor cores in parallel, often with application specific integrated circuits (ASIC) coprocessors for specialized tasks. Adding to this complexity is the increasing reliance on virtualization and transparent migration for both computing nodes and networking. Moving services from one physical node to another or from one communication link and switch to another may be useful for resource optimization purposes, scalability or fault tolerance.
As a result, even a simple request as initiating a phone call or making a Web search can involve several parallel processors on several servers. Moreover, the same request issued a few seconds later may be served in a different way by different physical servers. Therefore, understanding the performance of these services has become extremely difficult and the tools for that purpose are severely lacking [1]. Debugging tools can follow with a significant overhead the state of a program. Profiling tools are helpful to characterize the CPU usage profile of the different sections of a running program, with minimal overhead, but provide little or no information about the interactions with the operating system or with the remaining parts of a distributed system. Tracing tools have the potential to provide with low overhead all the needed information about the different parts of a system; however they lack the needed integration and correlation to link the events originating from different layers and different pieces of the distributed system puzzle. This lack of integration is especially visible between the low level tools used by hardware designers (hardware support for tracing and debugging) and the higher level software tools used by software engineers, preventing either from having the big picture and being able to work efficiently on cross-cutting issues.
The objective of this project is: i) to complement the data gathering tools in order to cover all the elements of a server cluster, ii) to integrate and correlate the information coming from all the pieces and all the layers of the distributed system formed by the server cluster and its clients, iii) to accumulate all that information in a system model of the current state (and the events leading to that state), and iv) to provide analysis and visualization modules. To achieve this, a number of new algorithms and techniques will be required, in order to extract that information with low overhead, minimizing the perturbation on the system being analyzed, and to efficiently analyze online the gigabytes of data produced by such tracing tools. This monitoring and analysis infrastructure will help system engineers quickly understand the behavior and performance of these complex distributed systems.
For the Canadian industry, the availability of these tools will simplify the design of these new complex clusters with many-core processors and ASICs. They will be able to quickly understand the system behavior and performance and to optimize its operation, leading to a faster design of more efficient products, and with fewer defects. This will similarly help operators of these complex systems who will benefit from these powerful tools to understand the behavior and performance of their systems, helping them to quickly understand problems and better tune their systems. This is particularly important as the complexity of these systems continues to rise with both computing and networking being subject to virtualization and live migration. In particular, track 4 of the project is dedicated to studying the specific problems encountered on the clusters operated by the industrial sponsors, Ericsson and Revolution Linux, insuring that the proposed tools cover well all their needs, and developing dedicated modeling and analysis modules adapted to their specific context of architecture, resource consumption and operation optimization criteria.
The project will also be extremely beneficial to the highly qualified personnel in training. The graduate students in the project will get direct exposure to these advanced software development tools and state of the art industrial clusters and distributed applications. The project will also help further improve the expertise of Ecole Polytechnique and Ecole de Technologie Superieure in many-core processors and ASICs, software development tools, and virtualized server clusters. This will in turn benefit all their undergraduate and graduate students taking courses in these areas and therefore the Canadian industry recruiting them.