Kublos: Human-centric Monitoring of Distributed Systems using GenAI and LLMs

Conceptual overview of Kublos: an orchestration of AI-powered services for human-centric monitoring of distributed systems like Kubernetes

Frontend
Backend
Data Streaming

Introduction

There are several systems out there that are designed to monitor distributed systems like Kubernetes. However, the Kublos project focuses on leveraging the latest developments in Generative AI, especially with Large Language Models, to provide a more human-oriented monitoring of the distributed systems.

Kublos is an orchestration of AI-powered services that are designed to provide human-centric monitoring of distributed systems like Kubernetes. By leveraging the latest developments in Generative AI, especially with Large Language Models (LLMs), Kublos is able to provide monitoring insights that are optimized for human consumption and can be easily understood and acted upon.

Kublos is not a conventional monitoring solution, in the sense that it does not focus on providing a solution that can be used to view and visualize certain metrics of the systems. Instead, it scrapes itself of all other functionalities, in favor of optimizing in the delivery of its core service: providing monitoring insights optimized for human consumption.

Core Principles

Before, we have a look at the overall architecture of Kublos, let’s first understand the core principles that drive the design of Kublos:

  1. Human-centric Monitoring: The primary goal of Kublos is to provide monitoring insights that are optimized for human consumption. This means that the insights provided by Kublos are not just raw data, but are insights that are presented in a way that is easy to understand and interpret by humans.
  2. Large Language Models as Monitoring Assistants: Kublos leverages LLMs as a real-time on-site monitoring assistant, interpreting logs of distributed systems and providing only the most relevant insights in natural language, that can be understood and acted upon quickly.
  3. Q/A-enabled Monitoring Assistant: Kublos is designed to be a monitoring assistant that can answer questions about the distributed systems being monitored. This means that users can ask questions about the system, and Kublos will provide answers in natural language.
  4. Low-overhead and Cost-effective: Kublos is designed to be a low-overhead and cost-effective monitoring solution. This means that Kublos is designed to be lightweight and efficient, and does not require a lot of resources to run.

Architecture Overview

Let’s start from the beginning.

Logs Collection

This aspect of monitoring distributed systems is outside the scope of Kublos. This enables clients to use their preferred log collection tools, like Fluentd, Logstash, or Prometheus, to collect logs from the distributed systems, and store them in a centralized location.

All that Kublos will need, is the ability to retrieve logs for a given time period. There are no strict restrictions on the format of logs, to provide greater flexibility, and to allow Kublos to be used with a wide range of distributed systems. However, it is recommended that the logs are such that they can be easily parsed and interpreted by LLMs.

Distributed System Monitoring

Example Kubernetes Monitoring Solution

Summary Generation Service

This is one of the core services of the Kublos architecture. This service pulls in logs from the logs collection for a specific time period (for example, the last 10 minutes), and processes these logs using LLM to generate a summary for this batch of logs. You can think of this as a batch summary generation service.

Summary Generation Service

Kublos Summary Generation Service

Why Batch Summaries?

We generate summaries of a batch of logs for two main reasons:

  1. Efficiency: Generating summaries for a batch of logs is more efficient than generating summaries for each log individually. This is because LLMs are computationally expensive, and generating summaries for a batch of logs allows us to amortize the cost of generating summaries over multiple logs, lowering the overall cost and overhead of the system.
  2. Context: Generating summaries for a batch of logs allows us to provide the LLM with more context surrounding the logs, which can help the LLM generate more accurate and relevant summaries.

Challenges

  1. Batch Size: One of the challenges of generating summaries for a batch of logs is determining the optimal batch size. A larger batch size can help provide more context to the LLM, but can also increase the computational cost of generating summaries. A smaller batch size can reduce the computational cost, but may not provide enough context to the LLM. The amount of the time that the summary generation service pulls the logs for is used to control the batch size.
  2. Long-Term Context: Let’s say the service just processed and generated summary for the last 10 minutes of logs. How does it know the context of the previous batch, to be able to accurately assess the context of the next batch of logs? This is a challenge that needs to be addressed, to ensure that the summaries generated by the service are consistent and accurate.

Daily Summary Service

This core service of Kublos generates an overall summary of the state and events of the distributed system over a period of 24 hours. This provides the users with high-level overview and insights of the system’s performance and health over the course of a day, and informs them of any significant events or anomalies that occurred during that time.

This is one of the core values that Kublos delivers. Providing users with the ability to quickly understand, interpret, and act upon natural language-based insights about their distributed systems, without having to sift through logs or metrics.

Daily Summary Generation Service

Kublos Daily Summary Generation Service

One important thing to note here is that this service does not directly rely upon or use the logs collection solution used by the distributed system being monitored. Instead, it uses the summaries being generated and stored by the Summary Generation Service that we discussed earlier.

This has three main advantages:

  1. Decoupling: By decoupling the daily summary generation service from the logs collection solution, we can ensure that the daily summaries are generated in a consistent and reliable manner, regardless of the logs collection solution being used by the distributed system.
  2. Better Results: Summary Generation Service is already processing system logs to generate summaries and insights that are most relevant. Using these summaries help the Daily Summary Service to provide a more accurate and relevant daily summary of the system’s performance and health.
  3. Reduced Costs: By using the summaries generated by the Summary Generation Service, we can avoid the need to process the entire set of logs for the day, which can be computationally expensive. Instead, we can generate the daily summary by aggregating the summaries generated by the Summary Generation Service.

QA Service (Question-Answering Service)

This service is actually a collection two separate services:

  1. QA Frontend Service: This service provides a frontend interface that allows users to ask questions about the distributed system being monitored. The frontend service then sends these questions to the QA Backend Service.
  2. QA Backend Service: This service processes the questions received from the QA Frontend Service, and generates answers using LLMs. The answers are then sent back to the QA Frontend Service, which presents them to the user in a human-readable format.
QA Service

Kublos QA Service

The QA Service is designed to provide users with the ability to ask questions about the distributed system being monitored, and receive answers in natural language. This can be especially helpful in situations where users need to quickly understand and interpret certain events or anomalies in the system, and take appropriate actions, such as in the event of an incident or outage.

Retrieval Augmented Generation

The core part of the QA Service is the Retrieval Augmented Generation (RAG) pipeline. Core components of this pipeline include:

  • Log Puller: This component pulls the logs from the logs collection for a specified time period.
  • Embeddings Generator: This component generates embeddings for the pulled logs using a pre-trained embeddings model.
  • Vector Store Provider: This component stores the embeddings generated previously into an efficient vector store for retrieval.
  • Retriever: This component performs a similarity search through the vector store to retrieve the most relevant logs for the given question.
  • Prompt Builder: This component builds a prompt using the retrieved logs (i.e., additional context) and the question.
  • Response Generator: This component generates the response to the questions using the built prompt.

Additional components can be added to this pipeline, for example, for response evaluation, LLM hallucination detection, PII leakage detection, and more. However, its essential to take into consideration the computational cost and overhead of adding additional components to the pipeline and the actual value they deliver for this specific use case.

Challenges and Future Work

While Kublos is a promising approach to monitoring distributed systems, there are several challenges that need to be addressed, and areas for future work that need to be explored:

  1. Performance Optimization: One of the key challenges of Kublos is optimizing the performance of the LLMs used in the system. LLMs are computationally expensive, and optimizing their performance is essential to ensure that Kublos is cost-effective and low-overhead.
  2. Accuracy Improvement: Another challenge is improving the accuracy of the summaries and answers generated by the system. This involves fine-tuning the LLMs used in the system, and ensuring that they are able to generate accurate and relevant insights.
  3. Multi-language Support: Providing support for generation of summaries and insights in multiple languages can enhance the accessibility of the system, and make it more useful for users who speak languages other than English.

In addition to these challenges, there are several areas for future work that need to be explored, such as:

  1. Incident Detection and Response: Developing capabilities for detecting incidents and anomalies in the distributed system, and providing users with actionable insights to respond to these incidents.
  2. Enhanced Visualization: Developing capabilities for visualizing the insights and summaries generated by Kublos, to provide users with a more intuitive and interactive monitoring experience.
  3. Open Source and Collaboration: Open-sourcing Kublos and collaborating with the community to further develop and improve the system, and to ensure that it remains up-to-date with the latest developments in AI and monitoring technologies.
  4. Documentation and Support: Providing comprehensive documentation and support for Kublos, to help users get started with the system, and to troubleshoot any issues they may encounter.

Conclusion

Kublos is an orchestration of AI-powered services that are designed to provide human-centric monitoring of distributed systems like Kubernetes. By leveraging the latest developments in Generative AI, especially with Large Language Models, Kublos is able to provide monitoring insights that are optimized for human consumption and can be easily understood and acted upon.

The core services of Kublos include the Summary Generation Service, the Daily Summary Service, and the QA Service. These services work together to provide users with real-time insights, daily summaries, and the ability to ask questions about the distributed system being monitored, and receive answers in natural language.

This approach to monitoring distributed systems is designed to provide users with a more intuitive and efficient way to monitor their systems, and to quickly understand and interpret the state and events of the system, without having to sift through logs or metrics.

This article provided a conceptual overview of the Kublos architecture. Kublos is still in its early stages of development, and there are several challenges that need to be addressed, such as optimizing the performance of the LLMs, improving the accuracy of the summaries and answers generated by the system, ensuring that the system is cost-effective and low-overhead, and providing support for multiple languages to enhance accessibility.

However, the potential of Kublos is immense, and it has the potential to revolutionize the way we monitor distributed systems, and provide users with a more human-centric and intuitive monitoring experience.

The below image provides a high-level overview of the Kublos architecture with all the core services and components:

Kublos Architecture

Kublos Architecture Overview

Support my work