Downtime is so expensive to a super computer operator that a network failure can cost it a million dollars in lost productivity in half a day.
So a new level of artificial intelligence, which predicts and prevents operational issues, prevents network failures and catches hackers early, can instantly pay for itself.
Processing giant Nvidia has unveiled a new artificial intelligence (AI) driven security system, which aims to minimise downtime in InfiniBand data centres using analytics to detect anticipate problems.
The NVIDIA Mellanox UFM has been used to manage InfiniBand systems for a decade. It applies AI to learn a data center’s operational cadence and network workload patterns. Drawing on this knowledge of both real-time and historic telemetry and workload data it can create a baseline of what is normal and acceptable. It then tracks the system’s health and network modifications, and detects performance degradations, usage and profile changes.
In June Nvidia added today a third element to the UFM family, the UFM Telemetry platform. This tool captures real-time network telemetry data, which is streamed to an on-premises or cloud-based database to monitor network performance and validate the network configuration.
This means the new system can spot abnormal system and application behaviour. It can also predict potential system failures and nip these threats in the bud by taking corrective action.
Supercomputers are often targets of high value system hacking by sophisticated crooks attempting to host undesired applications, such as cryptocurrency mining. The result is reduced data center downtime — which typically costs more than $300,000 an hour, according to research by ITIC.
The UFM Cyber-AI system allows system administrators to instantly detect and respond to potential security threats and prevent failures. This saves a fortune and provides the continuity of service that keveryone in a job, according to Gilad Shainer, senior vice president of marketing for Mellanox networking at NVIDIA.
‘It determines a data centre’s unique vital signs and uses them to identify performance degradation, component failures and abnormal usage patterns,” said Shainer.
Douglas Johnson, association director of the Ohio Supercomputer Center, has used the UFM platform for years in his employer’s InfiniBand data centres. ‘UFM and the expertise from the Mellanox networking team have been fundamental ingredients in the management of our network and the stability we’ve achieved,’ said Johnson.
The UFM Cyber-AI platform complements the UFM Enterprise platform, which manages networks, performance and security.