Member post by Fredric Newberg, CTO and Co-Founder of Embrace
OTel spans are very powerful for gaining an understanding of the performance of mobile applications. However, given the restriction placed on spans in the OTel specification that spans cannot be exported until they have completed, there is a blind spot that is especially significant for mobile applications.
What happens when you need to be aware of a span having started before it has completed?
If you primarily work in backend systems, you might be curious what value uncompleted spans provide in terms of observability. While covering all the differences between backend and mobile observability is beyond the scope of this post – but you can check out this article to learn more – we’ll touch briefly on a few key differences here.
In this post, we’ll cover the following:
- How spans differ between backend and mobile applications
- How Embrace approached the problem of in-progress spans
- How Embrace created span snapshots as a solution
How spans differ between backend and mobile applications
There are several key differences between backend and mobile applications that make visibility into ongoing spans a much more significant issue in the mobile environment than it is in the backend environment.
Mobile applications operate in both foreground and background states
Unlike backend applications, which tend to operate in a single operational state, mobile applications operate in two very different states: foreground and background. There are restrictions placed on behavior when in the background in both Android and iOS*. Most critical for this discussion is the importance of limiting network usage in the background.
Keep reading
- Add OTel instrumentation to a React Native app
- OTel has embraced profiling
- Implement OTel natively in an event broker
- Register for KubeCon + CloudNativeCon North America 2024 today
Mobile application tasks cross the foreground/background lifecycle boundary
Users background apps at arbitrary points, often interrupting what app developers consider key tasks. For example, a user could load a list of products before backgrounding the app to send a text message, then return to complete a purchase. Even though the task of loading the product list might typically not take that long, with millions of daily sessions in an application, inevitably most key tasks in the app will be interrupted quite a few times.
Network requests are made over networks that the app developer does not control, so even trivial requests may take several seconds to complete for a subset of users. Some other examples of key tasks that may take a while in a mobile application are:
- Uploading a video
- Cleaning up cached data
- Rendering complex views on low-end devices
In addition to spans that cover low-level functions, such as network requests, parent spans that include these spans will then also often exist across lifecycle boundaries. In a backend service, this is not common behavior, where all tasks related to an API request will be completed by the time the response to the request is sent.
Given the network usage restrictions, the time when OTel span data can be safely delivered from a device is limited to when the app is in the foreground or shortly after it has gone into the background.** Thus, unlike in a backend application where there is no ecosystem to enforce these types of restrictions, mobile observability vendors are constrained in when they can deliver span data.
Now you might be thinking, “What’s the problem?” After all, it seems like there are two main situations to contend with:
- If the app completes all spans while it’s foregrounded, then the SDK can send these spans to an OTel collector.
- If the app has spans that do not complete in a foreground session, the SDK can just wait until they complete in a subsequent foreground session, then send these spans to an OTel collector.
The second bullet point gets into another key difference between mobile and backend observability data. In backend observability, you are collecting telemetry in the form of individual signals like metrics, traces, and logs. In mobile observability, while you are collecting these same signals individually, you also collect telemetry in the form of user sessions.
Since OpenTelemetry does not currently have the concept of a session – beyond being a high cardinality attribute of a signal – every vendor defines it differently. At Embrace, we define a session as what happens when the application is in the foreground. When the app is backgrounded, we deliver the technical and behavioral data, which includes the associated span data, to the backend as part of this session payload. In addition to being a practical way to deliver the data in a timely and bandwidth-efficient manner for most applications, we have found that the notion of a foreground session maps well to how most developers reason about understanding and debugging their applications.
How Embrace approached the problem of in-progress spans
To get a complete view of what happened in a session, it is important to include tasks that are in progress. How can we achieve this if these tasks are being monitored using spans? To work around this, we considered the following options:
- Ignore the in-progress spans and just report them when they end – We rejected this approach since we considered the loss of visibility into events that had started but not finished in a session to be unacceptable. We have customers who experienced crashes because users started a large data upload in the foreground right before backgrounding the app and the OS killed the app process before the upload completed. Leaving customers blind to issues like this was not tolerable.
- Send the spans without end times – We want the spans that our SDKs emit to adhere to the OTel specification. While we could have our backend system work around missing spans, we would be compromising the portability that OTel provides, so this option was rejected.
- Set the end time to the time they are sent – This is such a bad idea that I won’t attempt to enumerate all reasons, but suffice it to say that getting the same span with multiple end times as it made its way towards completion would have made the people developing the systems to process and analyze this data question why they were using an SDK that did this.
- Take snapshots of in-progress spans and deliver them separately from completed spans – To ensure that we maintained compatibility with the OTel specification, while also providing the data we needed to show complete sessions, we make a copy of in-progress spans – span snapshots – that do not include an end time for spans. This data can be delivered separately from the regular, completed spans. When these in-progress spans end, they will be delivered as OTel-compliant spans.
Given the title of this post, it will surprise no one that the last option is the path we ultimately chose. We are able to have our SDKs deliver OTel-compliant spans to OTel collectors, and we are able to augment the data sent to our backend with the span snapshot data that provides a clearer picture of what happened on a mobile device.
How Embrace created span snapshots as a solution
You can review the code for how we capture span snapshots in our Android and iOS SDKs. Our span snapshot format is derived from the OTel Span Primitive with the following changes:
- The endTime is expected to be nil or not present
- The status field is also expected to be nil, not present, or explicitly “unset”
Since we deliver span snapshots as part of a session, we may have snapshots that are uploaded multiple times if the operations they are tracking cross multiple session boundaries. This means the normal (TraceId, SpanId) tuple is not enough to act as a primary key if storing the spans in a database. Instead, the tuple (SessionId, TraceId, SpanId) would need to be used. We do not use the snapshot data for aggregation purposes – only completed spans contribute to aggregates – which simplifies how we process and store this data. We only use the snapshots to provide additional context for individual sessions.
It is also important to remember that the span snapshots are just that – snapshots in time of a span. The attributes, events, and links can change between snapshots and the final non-snapshot version.
We also use the snapshot approach for recovery of in-progress spans when a crash occurs. Having ongoing spans when an app crashes is a common scenario, and we do not want to lose this span information since it may provide helpful context on why the crash occurred. Having snapshots means we can easily recover these in-progress spans on the next application launch, and turn those snapshots into failed spans before we export them.
We’re hoping that the span snapshot concept can eventually be considered for inclusion into the OpenTelemetry spec. That way, observability vendors can better visualize the technical details that cross multiple foreground and background user experiences.
In this post, we wanted to share how we overcame a unique mobile observability challenge while adhering to the OpenTelemetry spec. We’d love to hear feedback on our open source repos, or you can join our Slack community if you’d like to learn more.
* It is possible to get exemptions from these restrictions for certain types of applications, but the vast majority of applications will not have them. Since we are ultimately not in control of the environment the app is running under, we have to prepare for the worst case.
** iOS is especially strict when it comes to this, with applications being subjected to even harsher limits if they do not abide by the restrictions that are imposed. Once an app is in the background, it is at the mercy of the OS – whether it’s thread scheduling, GC, or simply killing the app when it needs to recover resources, nothing is guaranteed when your app is in the background. The hostility posed by such an environment means “waiting for it to finish” simply isn’t good enough.