ClusterAPI (CAPI) is vital in managing Kubernetes clusters in our fast-paced cloud-native landscape. On the flip side, organizations often encounter significant challenges when debugging and troubleshooting Kubernetes Clusters as their infrastructure grows. This complexity can restrain the consistency and reliability of Kubernetes operations and highlight the need for clear strategies and advanced tooling to navigate production environments.
This session addresses Day 2 hurdles by introducing “Observatio”, a cutting-edge open-source monitoring platform shaped for ClusterAPI-based scenarios. Observatio revolutionizes Kubernetes cluster assessment through its sophisticated data aggregation and gathering platform, a dedicated MCP server backed by an LLM backend and integrated AI agents, which provides advanced analytical insights and enables DevOps teams to apply consistent troubleshooting with easy remediation across distributed CAPI clusters.
The project focuses on monitoring Kubernetes clusters managed ty ClusterAPI, providing tools and solutions to enhance visibility and efficiency. By collecting and consolidating data from diverse sources, it offers comprehensive insights into cluster performance and health. Equipped with advanced dashboards and real-time visualization, the project enables users to swiftly identify and address issues, improving operational reliability and reducing downtime. This solution empowers organizations to maintain optimal cluster functionality, streamline troubleshooting efforts, and ensure robust management of their cloud-native environments.


The cluster health screen serves as a centralized dashboard for monitoring the operational status of your entire vSphere Cluster API environment, presenting both quantitative metrics and visual health indicators in an at-a-glance format. This view is crucial for vSphere administrators as it aggregates health data across multiple layers - from cluster-level availability down to individual machine lifecycle states The accompanying circular health visualization provides an intuitive radial representation where each ring corresponds to different infrastructure layers, with green segments indicating healthy components and red segments highlighting problematic areas that require immediate attention. For vSphere environments, this health summary is particularly valuable during troubleshooting scenarios, as it allows operators to quickly identify whether issues are isolated to specific machine provisioning problems (potentially vSphere resource constraints) or represent broader cluster-wide failures that might indicate management plane issues, networking problems, or underlying vSphere infrastructure degradation affecting multiple workload clusters simultaneously.

The topology screen provides a comprehensive visual representation of your vSphere-based Cluster API infrastructure, displaying the hierarchical relationships between management and workload clusters as interconnected nodes. In the context of vSphere monitoring, this view is particularly valuable as it shows the real-time status and dependencies between different cluster components - from the top-level management cluster down to individual machine deployments and their underlying vSphere virtual machines. The color-coded boxes (blue for infrastructure components, orange for control plane elements, green for worker nodes, and teal for deployments) allow administrators to quickly identify the health and operational state of each component, making it easier to trace issues from the Kubernetes layer down to the vSphere infrastructure. This topology visualization is essential for debugging scenarios where problems might cascade through the stack - for instance, if a vSphere datastore issue affects specific machines, you can immediately see which workload clusters and applications are impacted by following the connection lines between the affected infrastructure components and their dependent resources.
Problem: A Kubernetes operator notices that a worker node machine nx2-workload-md-0-dtphw-s2vbx-bl7gc has been stuck in a failing state for 25 minutes and is unable to join the cluster. The machine shows a red error indicator and multiple failed conditions, but the root cause is unclear from basic cluster monitoring. Available Information: The operator has access to the machine’s object status conditions, which reveals a cascade of failure states including :

The AI assistant explains that the machine deployment is waiting for at least 2 control plane nodes to be ready before proceeding, suggesting this is likely a control plane scaling issue rather than a worker node-specific problem. This automated analysis helps the operator quickly pivot from investigating worker node issues to examining control plane availability and scaling configurations, significantly reducing mean time to resolution in complex vSphere Cluster API environments.
Visit the repository: https://github.com/knabben/observatio
Ensure your management cluster is accessible via ${HOME}/.kube/config compile the bundled frontend in the go binary
and run the server.
make build && ./output/observatio serve
Both API and frontend are accessible via port TCP 8080.
cd webserver
go mod tidy
make run-backend what=serve
make run-tests-backend
The backend server will start and listen for WebSocket connections. By default, it runs on port 8080.
cd front
pnpm install
make run-frontend
make run-tests-frontend
The frontend development server will start and be available at http://localhost:3000.