Gaurav Behere is a Senior Technical User Interface Architect designing and architecting the Integrated Business Planning Spaces for the Blue Yonder Cognitive Supply Chain Platform and its offerings. Today, he sheds some light onto a Generative AI approach into an end-to-end traceability from the lens of a laymen of a technical matrix in a cloud offering.

Preface

Imagine building a complex web application that works with many microservices working together. This is a pure SaaS native web application that has many elements that can go wrong. To provide 99.99% uptime and high availability, all the components/services should be up and running at any given point in time.

Sounds Challenging?

It is indeed. Now let’s put ourselves in the shoes of the user who is using this web application and is trying to edit some data in a table shown in a widget on the screen and it fails. There are two questions to answer here.

  • How does the application operations team know that something is going wrong and take a preventive/proactive step to correct the state of the system?
  • Now that the error has happened, how is the user made aware of what went wrong, as well as why and what can the user do next to recover from that error?

This problem is not new and we have been using a lot of sophisticated mechanisms to answer both the questions above. In this article let us see how Generative artificial intelligence (AI) can help us answer them better.

Observability & AI — Benefits

Observability has evolved into a critical aspect of modern application architectures, demanding advanced tools and methodologies to ensure efficient monitoring and troubleshooting. Traditional monitoring tools often struggle to keep up with the complexity of distributed systems, leading to a need for innovative solutions.

Dynamic Baseline Establishment

Establishing a baseline for normal system behavior is crucial for effective observability. Generative AI can dynamically adapt to changes in application behavior and adjust baselines accordingly. This adaptability is essential in dynamic environments where traditional static baselines may not accurately represent the system’s normal state. By continuously learning and updating baselines, Generative AI ensures that observability tools remain effective in the face of evolving application architectures.

Example: Consider a web application that experiences traffic variations throughout the day. Traditional baselines might fail to adapt to these changes. Generative AI continually learns from the application’s behavior, adjusting baselines dynamically to accommodate fluctuations and ensuring accurate anomaly detection.

Automated Anomaly Detection

Generative AI excels at pattern recognition and anomaly detection. By training models on historical data and expected behaviors, AI algorithms can automatically identify deviations from normal patterns. In the context of observability, this means the ability to detect anomalies in application metrics, logs, and traces in real time. This automated anomaly detection reduces the time it takes to identify and respond to issues, improving overall system reliability.

Example: Consider an e-commerce platform experiencing a sudden surge in traffic during a flash sale. Generative AI, trained on historical data, can identify this unusual spike in user activity as an anomaly. The system generates alerts in real time, allowing the operations team to investigate and scale resources accordingly.

Let’s take a hypothetical interaction:

Support Engineer > I got an alert for the spike in incoming traffic. How many active sessions do we have in the past two hours?

Gen AI > There is an uptick in the incoming traffic. The average number of users active during 10 PM -12 PM on the application is 12K. In the last hour during the same time window, 18K users were having active sessions.

Support Engineer > Add another node to serve the traffic from the U.S. region and register it in the load balancer.

Gen AI > Done.

Natural Language Processing for Human-Centric Observability

Generative AI, equipped with Natural Language Processing (NLP), can facilitate human-centric observability. By transforming raw data into human-readable insights, NLP-powered AI systems make it easier for developers and system administrators to interpret complex metrics and logs. This enhanced accessibility accelerates issue resolution and fosters collaboration among cross-functional teams.

In a scenario where developers need to quickly understand the impact of an API change, Generative AI with Natural Language Processing (NLP) transforms raw logs and metrics into human-readable insights. Developers can easily grasp the context, accelerating the debugging process.

Contextual Log Analysis

Traditional log analysis tools often generate overwhelming amounts of data, making it challenging to identify relevant information during troubleshooting. Generative AI can help by providing contextual analysis of logs, extracting meaningful patterns, and correlating events across multiple logs. This contextual understanding allows for quicker root cause analysis, reducing downtime and improving the efficiency of incident response teams.

There could be two consumers of log analysis.

  • End-user
  • Technical user

Example: In microservices architecture, logs from different services can be overwhelming. Generative AI, using contextual analysis, identifies correlations between logs, helping teams quickly pinpoint the root cause of issues. For instance, a sudden increase in error logs might be linked to a specific service or component failure.

Let’s take a hypothetical interaction:

Technical User

Support Engineer > User with id “abc123” says that at 2:30 PM PST he was not able to save data into the component “Data Table.” What went wrong?

Gen AI > After analyzing the logs for user with id “abc123” around 2:30 PM PST, it looks like Operational Data Management Service was failing with the error: <Error description>. After looking at the linked logs & trace Ids it seems the connection pool was exhausted from the database. Information Source: “Error Code: DB_POOL_123.” The proposed resolution is to increase the pool size.

End User

User > I am not able to save data into the component “Data Table.” What is wrong?

Gen AI > The error code is DB_POOL_123 from Service A. There is a recovery mechanism associated with the error code, do you want to run the recovery mechanism?

User > Yes please.

Gen AI > It is done, please try again.

User > I can use the widget now.

AI Readiness

All the above amazing human-friendly interactions with a bot and features like auto-healing/auto-scaling would require us to make changes in the way the services are designed, as well as the way we log errors/ context in the messages we push to log servers from UI and Services.

  • There should be a single thread tying up a whole transaction together. e.g., traceID.
  • The services should be AI-ready, and there should be a health check and scaling/healing mechanism for each service that can be invoked in a controlled way.
  • The more you help AI, the better it can help you. The more and better our logs are, the more meaningful and human friendly it will be for the Large Language Models (LLMs) to serve the bot users. The error codes should be standardized, each error code should have a recovery, cause and origin associated with it.

Conclusion

Generative AI offers a multifaceted approach to improving observability, as illustrated by the examples and block diagrams. Automated anomaly detection, predictive analysis, contextual log analysis, dynamic baseline establishment, and natural language processing collectively empower technical architects to build resilient and efficient systems. The integration of Generative AI into observability practices not only streamlines troubleshooting but also fosters a proactive approach to managing application performance, ensuring the reliability and optimal functioning of modern applications in dynamic environments. As technology advances, the marriage of Generative AI and observability is poised to redefine how we monitor and maintain complex systems.

While there are promising outcomes to this, there is always data security and privacy concerns that go hand in hand with LLMs and Generative AI.

Here is an interesting read I want to leave you with before I end this article: https://stackoverflow.blog/2023/10/23/privacy-in-the-age-of-generative-ai/