← 返回
储能系统技术 储能系统 机器学习 ★ 4.0

云原生环境中使用大语言模型和贝叶斯网络的异常检测与根因分析

Anomaly Detection and Root Cause Analysis in Cloud-Native Environments Using Large Language Models and Bayesian Networks

作者 Diego Frazatto Pedroso · Luís Almeida · Lucas Eduardo Gulka Pulcinelli · William Akihiro Alves Aisawa · Inês Dutra · Sarita Mazzini Bruschi
期刊 IEEE Access
出版日期 2025年1月
技术分类 储能系统技术
技术标签 储能系统 机器学习
相关度评分 ★★★★ 4.0 / 5.0
关键词 云计算 微服务架构 机器学习 异常检测 大语言模型
语言:

中文摘要

云计算技术提供可扩展性和性能优势,但微服务架构引入复杂的监控和故障诊断挑战。本文提出一种集成大语言模型与贝叶斯网络的异常检测与根因分析框架,通过智能化分析微服务日志和指标数据,自动识别系统异常并追溯根本原因。

English Abstract

Cloud computing technologies offer significant advantages in scalability and performance, enabling rapid deployment of applications. The adoption of microservices-oriented architectures has introduced an ecosystem characterized by an increased number of applications, frameworks, abstraction layers, orchestrators, and hypervisors, all operating within distributed systems. This complexity results in the generation of vast quantities of logs from diverse sources, making the analysis of these events an inherently challenging task, particularly in the absence of automation. To address this issue, Machine Learning techniques leveraging Large Language Models (LLMs) offer a promising approach for dynamically identifying patterns within these events. In this study, we propose a novel anomaly detection framework utilizing a microservices architecture deployed on Kubernetes and Istio, enhanced by an LLM model. The model was trained on various error scenarios, with Chaos Mesh employed as an error injection tool to simulate faults of different natures, and Locust used as a load generator to create workload stress conditions. After an anomaly is detected by the LLM model, we employ a dynamic Bayesian network to provide probabilistic inferences about the incident, proving the relationships between components and assessing the degree of impact among them. Additionally, a ChatBot powered by the same LLM model allows users to interact with the AI, ask questions about the detected incident, and gain deeper insights. The experimental results demonstrated the model’s effectiveness, reliably identifying all error events across various test scenarios. While it successfully avoided missing any anomalies, it did produce some false positives, which remain within acceptable limits.
S

SunView 深度解读

该智能运维技术可应用于阳光电源的储能云平台和远程监控系统。通过AI驱动的异常检测技术,提升ST系列储能系统的故障预警能力和运维效率,减少人工诊断时间,实现大规模储能电站的智能化运维管理。