Roadmap Review: Sync and Storage

Firefox should wow users with a smart personalized experience that’s available seamlessly across services, apps, and devices. User data storage and syncing are vital to this vision.

We propose to reframe user profile data as a pillar of the Firefox ecosystem, rather than continuing to allow our storage and syncing story to emerge piecemeal from the implementation details of each new product feature.

User data becomes a deliberately designed service layer on which many different product experiences are built. We foster through design reviews the long-term importance of user data. And we invest technically in a more unified, extensible, flexible, cross-platform suite of storage and sync solutions that support a coherent set of approaches to storage.

Review question

Is this roadmap proposal congruent with Firefox’s strategy?

Document status

Roles

Lay Summary

Firefox, and the other new experiences that we want to create, rely on easily recording, combining, syncing, extending, and repurposing user data.

Desktop Firefox’s data is spread across 45+ different stores using 10+ different storage technologies, and we keep adding more.

In general, these stores and technologies:

Firefox Sync itself is incomplete, in every sense:

Our current technologies are ill-suited to even our current sync and extensibility needs, let alone our predicted future needs.

Moreover, we have no off-the-shelf solution to adopt for new products or features that want to avoid these issues: the status quo is maintained. Teams either do significant amounts of redundant work to bootstrap new storage for each feature and for each platform, to add sync support, and to reach a baseline level of functionality, or they neglect platforms and functionality.

These problems are already limiting factors to product development. Recent examples:

Stakeholders are:

Brief

Altering the course of 10+ years will be a multi-step process.

We propose expending effort to build replacement tools and systems that meet current and known future needs, supporting in-development Firefox features, and migrating existing Firefox data stores as it makes sense to do so.

Rather than just building single-platform MVPs, we hope to encourage engineering and product managers to consider the total cost of cross-platform features and integration, and shift incentives to encourage engineers and product managers to drive towards consolidation — make the right thing easy, so that features get syncing, backup, extensibility, and fast querying without significant new work.

Mozilla needs a culture that places importance on user data — more forward-looking data modeling, implementations that are built to sync across platforms, and data that can be repurposed. One path to doing so will be the architecture review process itself.

We expect technical work to include designing and building a durable and scalable end-to-end sync and storage system for user data. This system should handle evolving data without the involvement of Sync engineers, without locking out clients on other platforms or releases, and without losing server data. Log-structured data is the industry standard for distributed systems with growing data and multiple consumers, so we expect this to be a key part of the solution.

We’ve captured some detailed requirements for sync and storage, and will continue to do so.

This is a relatively large piece of work. Mozilla’s needs are fairly unique: e.g., client-side crypto, disconnected writes, years of activity data stored on the client, and working without an account. This means that although the conceptual framework has been well understood for many years, off-the-shelf solutions tend not to be suited to our needs. (Chrome Sync itself has a similar implementation to Firefox for iOS or 2017/2018 desktop Firefox, and is presumably staffed to match the costs of this high-overhead approach. Google defaults to non-private approaches.)

Getting this right unlocks the future of user data.

Technical work will involve:

The Sync and BA teams have already begun exploring prototyping parts of this puzzle in an incremental way: in Q4, building a proof of concept as an example of an event observation and capturing infrastructure, materializing a view for querying. We expect this POC to inform our implementation plan.

We also expect to build cross-platform storage and sync implementations partly or wholly in Rust, to build a future in which products throughout the Firefox ecosystem can share fast, reliable code instead of reinventing the wheel. With current staffing levels we have a hard time keeping our multiple codebases interoperable, let alone growing features at a healthy rate. Emily has already made progress on the groundwork for cross-platform Rust libraries.

We will continue work on documenting Firefox’s user data and publishing our findings. In the course of preparing this proposal we discovered that essentially none of our storage systems are documented, sometimes to the point of not knowing what a file or database was for, and often not being able to tell whether a file was obsolete. We don’t even know which prefs will exist in about:config. It is hard to make good decisions without information.

We will continue to help guide new data-centric systems — including Lockbox and ET work like Foxy — towards a cooperative future that builds capabilities for our ecosystem strategy.

And we will guide the incremental consolidation of existing Firefox data stores into more capable systems. This might begin by building or buying replacements or abstractions, so the number of stores will grow before it shrinks. For example, libpref is ill-suited for storing data for front-end features; we should provide a durable, syncable store that meets their needs, and migrate features away from libpref, ultimately allowing it to be replaced by a simpler and more performant system that better meets its “pref-shaped” needs. It’s not yet reasonable to pin a number on how many stores we should have, but it’s almost certainly less than 45.

Some of this work will involve measurement work on existing systems or product features; Firefox currently lacks a holistic view of its storage footprint and performance characteristics. We expect to work with product management and engineering to improve this.

We expect initial staffing for these efforts to be low. Research and prototyping work (much of it not parallelizable) needs to be done for new storage, more exploration of Firefox needs to be documented, and the groundwork for Rust components needs to be completed. 3–4 FTEs is appropriate.

In 3–9 months we should be in a position to evaluate a number of more concrete directions, gradually committing more headcount to accelerate progress as opportunities become actionable.

We can call out some representative product end states of this effort:

Effort subsequently invested in consolidation should gradually reduce ongoing cross-platform maintenance burden, which — after the initial expense — should free up engineering resources for more useful product work.

Alternatives

Do Nothing

The main alternative to this roadmap proposal is to do nothing.

There’s no immediate additional cost to doing so, beyond, as Alex puts it, “shipping shitty products, really slowly!”

We will continue to bear the high cost of maintaining and extending Sync, we will continue to have a broad array of undocumented and inconsistent storage systems, Sync will continue to be a lower-quality feature than we want, and as a result we will continue to offer inconsistent and incomplete experiences across our products, harming mobile adoption.

We will be largely unable to offer Context Graph-like features on top of existing user data. Telemetry data and Pocket will thus be the foundation of Context Graph. Activity Stream will soon face significant difficulties in storing and syncing new data; ultimately they will end up building an ad hoc event storage system, and will collapse under the effort of replicating that work on iOS and Android where staffing is limited. New agent projects and mobile browsers will reinvent their own non-syncing storage, slowing development and convergence. Prototyping and experimentation will continue to be costly, particularly when integrating with or extending existing data.

Bottom-Up

Another alternative is to allow this work to happen bottom-up. We are skeptical of this possibility: evidence suggests that engineers have neither incentive nor leverage to tackle cross-component work like this. Conway’s Law applies. We are trying to make our architecture decision making clearer, not force it under the radar. Worse, this encourages unsustainable heroism — if we think improvement is necessary, we should produce a supportive environment for it to occur.

Rebuild Whole Products

A more ambitious alternative is to construct new products in the right way — up to and including constructing a new browser. Mozilla has no plans to ship a Servo-based browser as a replacement for Firefox, so this strategy certainly doesn’t fly on desktop. It doesn’t avoid cross-platform costs, as discussed above. And — more significantly — simply building a new product is no guarantee that the work will be done differently. Servo is a very modern renderer, but without incentives to make other choices, it has already started down Firefox’s path with its prefs storage. Zerda/Firefox Rocket’s initial data stores are bespoke and not syncable, despite having access to Fennec’s code, because that team’s incentives and the costs of reuse don’t align.

Improve Firefox Sync 1.5

We could attempt to invest in Firefox Sync 1.5 to try to make it incremental, extensible, durable, etc., rather than exploring alternative solutions. In the last 7+ years there’s been enough architectural motivation to design Sync 2.0, Kinto, QueueSync and TreeSync (the latter two during the short-lived PiCL project). This, and recent experiences with adding data to Sync 1.5, suggest that we’ve improved the old Weave architecture as far as it can go. Additionally, we are not staffed to grow all five Sync client implementations in parallel.

Competitive analysis and simplifying opportunities

Some of the requirements we’ve captured, and some aspects of the situation in which we find ourselves, might be malleable.

Some of these requirements constitute strategic choices: e.g., end-to-end encryption offers some strong privacy guarantees, but impacts recoverability and access by other services. The set of such strategic choices makes up a space of solutions. We expect the outcome of this roadmap to make progress towards defining a complementary set of points in this space.

In this section we briefly examine some of the choices available to us.

Encryption and servers

Storing data in a way that prevents Mozilla from seeing cleartext means that the server can’t help much with conflict resolution, versioning, data access, etc.

Chrome makes a different strategic choice, and is thus able to show your history on the web. Realm, a cross-platform database that competes with Core Data, offers automatic synchronization of its ‘realms’ between clients via a central Realm Object Server, with all data in cleartext.

A solution that could rely on cleartext data would be able to shift logic from the client to a single server infrastructure, which would increase agility and simplify the client.

The end state of this shift is something like Pancake or Facebook: a thin client with a rich server application and a server-canonical data model. This avoids synchronization entirely.

A number of systems use encryption at rest, but not end-to-end encryption. This allows for simpler web access, sharing, and recovery. Dropbox uses this model for files. This approach is not resistant to subpoena or intrusion.

This proposal assumes that end-to-end client-side encryption is table-stakes for some or all of Mozilla’s applications, and so we can’t disregard this requirement.

Data-rich client applications

If we were to commit to storing less data on clients, and changing the set of stored data less frequently, then the cost-benefit analysis of this problem would shift, perhaps to the point that significant engineering investment should be avoided. To an extent this is similar to the alternative “Do Nothing”: we currently try to avoid growing or changing the data we store, and we change Sync very infrequently.

Chrome is an example of this: there is no evidence of Chrome moving towards anything more than its rudimentary history and bookmarks storage.

This proposal assumes that there is product desire to leverage more user data on clients, not less, and eventually we will have to tackle the issues outlined in this proposal.

Offline writes

If we were to make significant changes to the relationship between users and accounts, and between devices and servers, we could simplify the problem of syncing. For instance, if we linked all profile data to a Firefox Account from first opening Firefox, then repeated large-scale merges can be avoided. If we require all devices to be connected in order to record data (providing high availability for server storage), then we can turn synchronization into a distributed write. These are still non-trivial engineering problems, but they’re different problems.

Most database systems we’ve examined, including Realm, Dropbox Datastore, CouchDB/PouchDB, and others, allow for offline writes.

This proposal assumes that our ability to tie writes to server resources is limited, for both technical and product experience reasons, but we have been thinking about this possibility.

The definition of ‘data’

Storage systems might store some or all of documents, small blobs (e.g., icons), large blobs (e.g., downloaded files), independent structured data (e.g., emails), interrelated structured data (e.g., social graphs or bookmark trees), measurements/events (e.g., sensor data), and more.

A system optimized for document-structured data will typically scale horizontally but lack features like cross-document transactions. A system optimized for graph traversal might not handle blobs well.

Choosing carefully which kinds of data to support in which systems makes it easier to meet requirements.

Tools like CouchDB and MongoDB aim to provide scale and simplicity in syncing by deferring conflict resolution, identity, and validation to application code, themselves working with a more simplistic data model. Dropbox’s decommissioned Datastore API treats datastores independently and merges records field-wise to resolve conflicts; that’s a good set of tradeoffs, but it leaves identity merging, data evolution, and validation to the app.

Strawman Roadmap

We expect to gradually firm up our understanding of concrete solutions and shipping vehicles over the course of the next two quarters: Phase 1.

Phase 1: building confidence (2017Q4–2018Q1)

We have a good idea of the characteristics of a solution for structured user data that allows for the data evolution, experimentation/iteration, encrypted sync support, and reuse that our products need.

We have identified two main remaining puzzle pieces that the Browser Architecture team will explore: the client-side data pipeline that preserves existing query interfaces, and the concrete sync protocol that moves data around.

If we can’t tackle those, then there is no mainstream path forward for this part of the solution. If we’re wrong, we want to find out sooner rather than later.

Our first step, then, is to exercise those two parts by prototyping each, iterating as needed. If we fail, then we will reassess our approach. This is the cheapest and best route to failure.

The Sync team has a related plan for this timeframe to reduce uncertainty around Rust-based cross-platform dedicated storage APIs that wrap generic storage. This will build confidence in our ability to consolidate implementations across desktop and mobile while meeting the needs of product engineering teams, and help to determine the developer experience linking application and service.

In parallel, we can move forward with independent cleanup work: making APIs async (post 57 freedom), incrementally improving and teasing apart existing problematic stores (e.g., libpref), work being proposed elsewhere around process separation for storage, etc. These efforts will set the stage for productization, but also make the world a better place regardless of the success of this particular effort.

A successful outcome for this phase is one of: fast failure; a revised roadmap or strategy that reflects new knowledge; or a proposed set of design options that can be reviewed with an eye to tackling Phase 2.

Phase 2: attack the hardest thing, fail fast (2018Q1–Q3)

The largest single point of risk beyond basic functionality is addressing Places.

Places’ history store is a large source of value, Places-linked data is core to Sync, and it’s also a good test case: a high-volume store of structured user data with existing complex query consumers.

If we can’t tackle Places, then there’s much less value in proceeding.

We think the quickest and lowest-risk approach is to intermediate some writes to Places, turning parts of the Places database into a read-only materialized view. This stress-tests the client-side data pipeline, and allows us to explore syncing new data (e.g., containers, alternative representations for keywords and tags), recovering Places data from the log, etc., without disturbing existing consumers.

At this stage we hope to find and fix scaling and performance issues before doing further work. If we fail here, the process has been both informative and relatively cheap.

As this phase proceeds we’ll design and build production server capacity for the sync system, begin to address fit-and-finish concerns like push notification integration, and continue to assess requirements from new product efforts.

Phase 3: consolidate (2018Q3–)

If we’re confident in our new sync implementation, can use it from apps, and can handle Places-level scale, then we can proceed with consolidation and simplification to add value to Firefox. We can fan out: pick suitable storage (e.g., data that’s high priority for syncing, an old store that’s not owned by an existing team, …), make sure it has an async API, and port it on top of the log, following the example of Places.

Early in this phase our documentation will go from adequate to great, we will place heavier emphasis on the developer experience and UI programming integration (e.g., GraphQL), and we expect to see further adoption in mobile apps and on the server as the cost of accessing Firefox data decreases.

We can also assess the urgency of exposing this kind of storage — both access to Firefox’s data, and dedicated storage — to WebExtensions.

These additional efforts will likely be assessed and planned closer to the time; it would be unrealistic to estimate resourcing or timing when priorities are likely to shift.

Resources

Requirements capture for Sync

Scoped encryption keys and access control for FxA

Vision statements

Blog posts

Thanks

Thanks to our readers: myk, asuth, mak, grisha, tspurway, snorp, overholt, ckarlof, wennie, selena, gps, trink, hildjj, and plenty more.

Questions

Are you going to require this everywhere?

No; it’s not applicable to all problems, nor does the cost-benefit calculation always play out. Decisions should be made based on suitability and value: e.g., to make some data available to other apps or platforms, to bring a store up to a feature baseline, or to integrate two data sources to support a new feature.

Does this mean we’re using Mentat?

Not necessarily. Our roadmap so far involves answering questions about how to provide capabilities that we think are important for Firefox’s strategy. It’s possible that parts of Mentat will evolve into parts of the solution — after all, it’s an embedded Rust log-structured store — but a solution will be integrated with other parts of the Firefox ecosystem, not just a database.

Does this restrict the design of Sync and Storage at Mozilla?

Not really. We expect FxA identity-attached services to proliferate in time, and we expect there to be justified divergent approaches to storage, too — here we are mainly focused on structured user profile data, but elsewhere we are also e.g., helping to shape a redesign of libpref. What we hope to happen is that we can provide a small spectrum of solutions that are better choices than building yet another store from scratch, and that are dramatically less restrictive than the current situation, achieving convergence at a healthy rate.

Does this help ET?

We hope so. Having reusable, portable components and data opens doors to exploration and development.

How do you plan to scope out the problem of scaling the service?

In short: by finding sets of properties that are beneficial for both the requirements we’ve captured for storage and syncing and also for management of a hosted service — including scalability — and designing a solution that reflects those properties.

This is something of a design question, but we have some idea of where we might go (we’re initially proposing log-structured storage) so we can answer in brief. In short:

Questions from ckarlof (not a lot of time to answer these, so a little brief, but better than nothing!):

What are the outcomes we’re trying to achieve here and how do you know we’ve achieved them? What are your key metrics of success?

I see some “representative product end states”, but I feel these need to be more prominent in communicating your vision of success.

In addition to the concrete definitions of success in each phase, we’ll know the overall effort has been successful if:

Figuring out how to concretely measure our success is a task for after this review.

How many different complete options are you considering to achieve your proposed outcomes?

I feel there’s really only one under consideration, but I feel having one or two alternative approaches to reach your goal, even if they’re straw dogs, would help folks understand the merits of your favored approach better. Your competitive analysis is close to this, but some of things you’re comparing against aren’t solutions to reach your vision. They’re compromises. Sell me on your vision.

We’re concerned with agreement that the situation is as described, that our vision is desirable, and that the situation can be reasonably improved. The alternatives listed above are ‘roadmap alternatives’, rather than design alternatives, precisely because those are different paths that the organization could take other than investing in a suite of technical solutions.

We have some understanding of characteristics that we feel are important to forward-looking technical solutions, and the next steps on the roadmap are focused on exploring those characteristics. The results of those explorations will dictate the concrete approaches we consider.

For the approach you’re suggesting, what are they key choices in that approach that you want us to understand? Using Rust to implement a shared library is one such strategic choice. Are there others worth surfacing?

The proposal to use Rust is essentially an assertion of two things: that supporting multiple applications on multiple platforms is part of Mozilla’s future; and that native code is the right way to do that in this instance. Rust is, we will admit, an opinionated choice here.

There are other strategic/organizational opinions that lie beneath this document:

What would have to be true for your approach to be a winning one? What are your riskiest assumptions here?

I’m looking for a prioritized list of questions that need answers and assumptions that need to validated, an order to tackle them in, and how you’re going to tackle them (ideally, as cheaply and as early as possible).

Phase 1 and Phase 2 capture what we think are the most significant technical risks: can we build a system that more closely meets the requirements we’ve captured for sync and storage, in a way that serves our strategic needs (e.g., around reusability), and maintain syncability, performance, and the developer experience (including the preservation of existing interfaces for as long as makes sense) — can we build it? can we make it sync? can we make it scale? can we get people to understand it?

We’ll break these down into more detailed work items and questions as we go about working on those phases.

From a product perspective, most (if not all) of the following would have to be true for this to be a success and to meet the needs of the engineering teams we’ve spoken to: