Advanced Network Telemetry for AI-Driven Network Optimization in Ultra Ethernet and InfiniBand Interconnects
Main Article Content
Abstract
The sheer proliferation of terascale artificial intelligence (AI) workloads (in disk drive, Hadoop and NoSQL) like distributed deep learning, model inference pipelines, has put unprecedented pressure on data center interconnects. As part of capturing these demands, there is a rampant use of high-performance network technologies such as the Ultra Ethernet, and the InfiniBand in modern infrastructures with ultra-low latency and high bandwidth. But conventional telemetry systems do not have the density and real-time sensitivity to best tune network dynamics with such loads. The topic of the paper at hand is the development of advanced network telemetry and AI-based optimization in order to improve performance, identify anomalies, and mitigate congestion in high-speed interconnects. Our architecture is inspired by telemetry and is based on programmable data planes, in-band telemetry and high bandwidth monitoring engines that use to emit highly granular, low-latency data streams. The streams are passed through AI/ML models, such as unsupervised anomaly detectors, predictive congestion algorithms, to dynamically adjust routing and resource allocation. Our findings indicate that this method works well in enhancing usage of communications networks, latency and pro-active management of network health. The paper advances a scalable design of a real-time intelligent network management in next-generation AI systems, and proposes a set of factors it would be necessary to consider in future studies along the telemetry-AI-high-speed networking nexus.