Capabilities of an observable system

An observable system provides the tools to accomplish the following types of tasks:

  • Logging information about events that occur in the system

  • Obtaining metrics on the performance of the system, both at a single point in time and over a time period

  • Tracing requests as they flow through the system

  • Making system operators aware of the state of the system

  • Predicting future performance of the system

  • Discovering the root cause of observed problems in the system

Telemetry solutions enable you capture logs, metrics, and traces. Monitoring solutions improve awareness of telemetry data by providing dashboards and alerting administrators when the system is operating outside predefined thresholds. A fully observable system extends telemetry and monitoring solutions to enable prediction and root cause analysis.

Prediction

Prediction involves using information you have about the past and current performance of a system to determine its likely future performance. Prediction can help you make decisions about changes to your system. For example, if you predict a substantial increase in the number of requests being made to map services, you may decide to add an additional machine to your ArcGIS Server site before users see degraded performance due to resource constraints.

Observing patterns or trends in telemetry data can help you make predictions. For example, cyclical patterns for metrics enable you to predict the times when the peaks and troughs of those metrics will occur. Strong upward or downward trends enable you to predict future values under the assumption that the trend will continue.

Predictive analysis can be done by administrators, but increasingly sophisticated artificial intelligence (AI) capabilities create the possibility of a system that can analyze itself and respond appropriately. AI agents with access to telemetry, permission to alter the system, and a robust training model for determining appropriate actions may be able to keep a system stable with minimal direct human intervention. As with any type of AI, exercise care when considering implementing an automated response system to ensure it remains aligned with your priorities and values.

Root cause analysis

If your system is unhealthy, you need to know the root cause of the problem in order to implement the appropriate solution. For example, an increase in latency for a service might require an expensive infrastructure upgrade to resolve, or it might be fixed with a simple service reconfiguration.

There are six steps involved in root cause analysis:

  1. Detect the issue. For example, your monitoring solution may alert you that the response time for a service has exceeded the predetermined threshold. For problems that your monitoring solution did not anticipate, you might instead receive reports from users that the system is not operating as expected.

  2. Triage the problem. Determine if the issue is isolated to a single service or whether it is a systemic issue that affects multiple services. Monitoring tools like dashboards, service maps, and health checks are useful at this stage to determine the scope of the issue.

  3. Investigate the problem using telemetry data.

    • Metrics can help you identify anomalies or threshold breaches, such as a sudden spike in 500 errors or a drop in throughput.

    • Search logs for error messages, stack traces, and warnings. Structured logging helps you correlate log entries with specific request IDs or timestamps, which can help narrow the search for relevant log messages.

    • Traces can pinpoint where requests experienced latency or failure. For example, a trace may show a slow database query or a failed API call.

  4. Correlate and contextualize data. Combine metrics, logs, and traces to build a timeline of observed events. For example, maybe a slow database query shown in a trace happened at the same time there was a spike in the CPU utilization metric on the database machine. Third-party tools like OpenTelemetry, Jaeger, or Datadog APM can be useful for correlating different data sources.

  5. Identify the root cause. Look for changes in your system that could explain the correlations you observed. Some common causes include:

    • Recent deployment or configuration changes

    • Resource exhaustion, such as a memory leak

    • Failures in external dependencies, such as a failed response from a third-party API

    • Network issues, such as a change in firewall rules

  6. Resolve the issue. Implement the appropriate fix based on your identification of the root cause. For example, you may need to roll back a recent configuration change, patch your software, or scale up your services. Simply fixing the current issue, however is not sufficient. Document the problem, your analysis, and the solution so that others can more easily address similar issues in the future. Add alerts, dashboards, or automated tests to help prevent the recurrence of the issue.

Top