![]() ![]() |
|
ASDS - Autonomic Survivable Distributed Systems
|
Next Generation Data Center
Next-generation data centers (NGDCs) expect an unprecedented system complexity due to application scales, diversity, and dynamics running inside. On Internet services, Web 2.0 applications (YouTube, FaceBook) rely on a reliable data center to serve millions of users on the data generated by the same massive-sized pool of content providers. On cloud computing, an on-demand computing infrastructure (Amazon EC2, Google App Engine) offers virtual data centers for ever-arriving-or-leaving application providers. On enterprise IT services, server consolidation technologies (VMware, Xen) motivates IT departments to centralize diverse services into one data center for management and resource usage improvement. While simplicity in usage is expected to present outside NGDCs, complexity is left inside and brings the challenges in dependability, manageability, and efficiency of those systems. To address it, we seek to develop technologies on building a Data Center Operating system (DCOS) that allows a warehouse of servers and services to be managed as simple as manage multiple programs in a single computer. The basic idea is to build a suite of highly scalable technologies that offer basic blocks in management and coordination of activities and the sharing of the resources for application instances running in a data center. The project is currently working on the blocks of resource scheduling, data management, and process monitoring. IP Service Management
As the Internet is experiencing an explosive growth in its scale and capacity, it is also witnessing a revolutionary migration of a substantial fraction of computing activities (e.g., software) and communication services (e.g., telephony and TV) from their traditional media (e.g., local host, PSTN, and cable) to a unified platform using Internet Protocol (IP). However, due to the tremendous heterogeneity among those applications in their traffic patterns and QoS requirements, as well as the vastly complex structure of the underlying Internet, this transition in turn imposes an unprecedented challenge to service management over large-scale heterogeneous IP networks. In the IP Service Management project, we seek to design intelligent and automated technologies for network operators at different stages of service management, including network structure inference, system performance monitoring, and anomaly detection/localization/prediction. System Data Mining and Analysis
Distributed networking and computing systems are becoming increasingly complex and hard to manage due to the interactions between workload, software structure, hardware, traffic conditions and so on. Meanwhile large amounts of monitoring data can be collected during system operation, which provides valuable evidences about the system's operational status. Software log files, system audit events, and network traffic statistics are typical examples of such measurements. The goal of this project is to create advanced system management technology through analyzing, mining and modeling the vast amount of system measurements. We have already developed a sophisticated approach to extracting the hidden dependencies between the system attributes called `system invariants' which can significantly facilitate many system management tasks such as fault detection and diagnosis, capacity planning and so on. We are also working on solutions for other system management topics such as resource management, performance tuning and debugging, and configuration management. |
|