Why does OpenTelemetry work differently on mobile versus backend apps?

Posted on November 29, 2024 by Jamie Lynch

CNCF projects highlighted in this post

Member post by Jamie Lynch, Senior Software Engineer at Embrace

OpenTelemetry has historically been adopted mainly on backend systems, where it’s a great solution for gaining insight into what’s happening in production by gathering telemetry via an open standard. This avoids the dreaded costs of vendor lock-in because as long as a provider supports the OTel data format, you can easily switch and take control over your own data.

Up until now, OTel’s adoption on mobile has not been quite as widespread. However, there are early signs that this is rapidly changing and that engineers are adopting the standard for similar reasons as for backend observability. Mobile provides some unique challenges for gathering telemetry compared to backend development, and in this post we will highlight what those challenges are as well as a few solutions for fixing them.

A primer on mobile challenges

Before we cover some of the ways OTel is different on mobile, it’s worth comparing mobile development to backend development, as mobile devices have unique constraints that affect how telemetry is collected.

Hardware is lower spec on mobile devices

Backend servers have lots of CPU and memory, whereas most mobile devices have lower spec hardware that is less performant.

Furthermore, if your backend app runs into performance issues, you can usually just provision more servers with beefier hardware. When (not if) your mobile app runs into performance issues, shipping your users better devices is usually not an option! So on mobile, you’re stuck supporting a cohort of potentially thousands of underpowered Android/iOS device models.

Battery life is sacrosanct on mobile

You’ve probably experienced the frustration of running out of charge on a mobile device. Power consumption is much more important on mobile than the backend where everything is plugged into the mains – that’s one reason why CPUs tend to be lower frequency on mobile as they require less power.

The OS itself is also much more aggressive about prolonging battery life on mobile. Approaches that might be viable on the backend, such as polling for data every second, are almost always not an option on mobile. Mobile operating systems will eagerly kill processes that use excessive resources, and it’s usually impossible to have a process running continuously in the background. In contrast, this is fairly simple to do on the backend.

Network connectivity

Backend servers have consistent network connections with great bandwidth and low latency. Mobile devices do not enjoy these luxuries! Their network connection is usually high latency, may have prolonged periods with no connection (e.g., long-haul flights with airplane mode enabled), and bandwidth can be low.

Process lifecycle differences

An application server that responds to HTTP requests will run continuously. That’s not the case on mobile. A user may switch between dozens of apps in a short period, and to save limited battery and compute resources, the OS may terminate these processes at any time without warning. While this may happen on the backend in extreme scenarios such as memory pressure, on mobile this is a daily fact of life.

Transactions versus user experience

Backend applications typically have short transactions – a HTTP request comes in, some operation occurs, and a response is returned to the client. On mobile, a user might open an app for a few minutes, but they might also use it for hours, performing hundreds or even thousands of interactions with the application during a single session. The data, context, and duration of traces captured can therefore be vastly different between backend and mobile applications.

Mobile runs in a single process

The majority of mobile apps run in a single process which means OTel collectors and exporters run in the same process. This is quite different to the backend where these components typically run in separate processes. If the process terminates on mobile due to a crash or an OS kill, without additional work to persist telemetry, data will be lost.

How do these constraints affect OTel on mobile?

Sending telemetry over poor network connection

Network connectivity is situational on mobile devices, making it necessary to plan for the worst case where telemetry cannot be delivered to your backend of choice. Even if you’re lucky enough to have a connection 95% of the time, that still means you could be missing 1 in 20 requests if you take the “fire and forget” approach. For mobile, where your app may be running on thousands of different devices, this can leave a substantial hole in your observability. It’s therefore essential to persist data before sending it.

Persisting data may sound straightforward – it’s just writing a bunch of data to disk, right? Unfortunately, on mobile, this simple act introduces a whole bunch of complexity. First, the average mobile device does not have much free disk space, so the amount of telemetry that can be persisted needs to be limited in some way. This requires picking a strategy to delete telemetry data. Common strategies involve prioritizing the newest and most important types rather than stale data.

Second, it’s necessary to deal with I/O errors, potential schema changes depending on how the data is persisted, and data being sent from a different process (and maybe even a different day) from when it was captured. If engineers forget to deal with this complexity, then subtle bugs can creep into your data pipeline that impact your observability.

Handling process termination (crashes or OS kills)

If a process terminates on mobile, it’s necessary to immediately save any captured telemetry. This is partly due to poor network connections as discussed previously, but also because blocking the UI thread with HTTP requests on process termination can lead to blockages and ANRs.

For a crash or OS kill, it’s generally not possible to predict when a process is going to terminate, and once it has happened, the amount that can be done is fairly limited. For example, once a C signal is raised on mobile, it’s possible to install a signal handler that reacts to the crash, but the implementation must be async-safe. These implementation constraints make it impossible to send HTTP requests, and very hard to do anything other than storing telemetry for later processing.

In order to capture most telemetry, an option is to periodically persist telemetry so that an up-to-date “snapshot” of the captured data can be read the next time an application launches. We’ll elaborate on this a bit later.

Conserving limited device resources

There’s no silver bullet for conserving resources such as battery and memory on mobile. The first step as an engineer working on an app is to be judicious in what telemetry is captured. For example, polling the OS every minute for memory data might be acceptable on the backend, but on mobile devices it would be preferable to rely on OS callbacks instead for significant events.

Profiling the impact of telemetry code in hot paths, such as application startup, also becomes more crucial on mobile. This is something that at Embrace we do as SDK vendors on our own code, but it’s also something that should be considered for your own application, as every mobile app may behave differently out in the wild.

Supporting long-running spans

Long-running spans are a challenge in OpenTelemetry for mobile because user sessions can run for much longer times than a typical backend HTTP request. This means they can accumulate many events, making the span payloads quite large. Spans also can’t be sent until they are completed, so there’s the potential for data loss if the process terminates halfway through a span.

Embrace solves this problem on mobile with “span snapshots.” In this approach a JSON representation of all non-completed spans is stored on disk periodically, and if the process does terminate unexpectedly, then on next launch the application is able to send these to an OTel-compliant backend.

Are there additional differences between mobile and backend OTel?

Semantic conventions are less developed

OTel’s semantic conventions are agreed-upon conventions on how telemetry should be captured by an application. These are great because using conventions means that OTel implementations can assume knowledge about what telemetry data contains.

For example, rather than showing an OTel span for a network request, a backend solution could process a span containing HTTP call information and add opinionated logic on top of standard OTel that reveals superior insights into network performance. As OTel has historically had traction on the backend, more semantic conventions are agreed-upon for backend concepts such as HTTP requests and cloud events, than for mobile events such as user sessions.

Backend is always on, while mobile has user sessions

The other key difference between backend and mobile is that the backend is usually running 24/7, 365 days a year. In contrast, a mobile messaging app might have very short user sessions of a few seconds, or a movie-streaming app might have user sessions that potentially run for hours. Problems can develop over time and across endless combinations of user, device, and app conditions. The uncontrolled nature of the mobile environment is certainly more complicated than the standard OTel paradigm.

How is OTel likely to evolve on mobile in the future?

In Embrace’s view there are three key points where the OTel community can improve support for mobile.

Network connectivity handling

Currently, the OTel implementation for mobile doesn’t fully account for the differences between the backend and mobile, and it’s too easy for data to get lost due to unexpected process termination or long-running spans. We expect this to change as more observability vendors adopt OTel and agree on common solutions.

More semantic conventions will be established

Semantic conventions are agreed-upon conventions for how telemetry should be captured using basic OTel data types such as spans and events. For example, an Android phone entering battery saver mode and then exiting it when a user plugs in a charger could be modeled as a span as it has a start and end time.

If an OTel-compliant backend implementation supports a semantic convention for battery saver mode, then it could perform extra processing on the telemetry data that might surface hidden trends that correlate with the presence of this span. This data is important from a mobile engineer perspective as low power indicates the OS will be more willing to constrain background jobs and reduce the amount of system resources available – therefore affecting the performance of an app.

The OTel ecosystem will expand to mobile

There is already a rich ecosystem of OTel instrumentation for backend libraries and technologies. The same doesn’t exist (yet) on mobile, but we believe that as more and more mobile engineers implement OTel and further SDK vendors become OTel-compliant, that will change.

Hopefully, we can move towards a future where telemetry and instrumentation really only need to be written once for commonly shared libraries in the mobile ecosystem. If you’d like to explore OpenTelemetry for mobile today, check out Embrace’s open source, OTel-compliant SDKs and join our Slack community to learn more about how to modernize your mobile observability.

Hyderabad, India