Design Review: Key-Value Storage

We propose the standardization of a simple key-value storage capability, based on LMDB, that is fast, compact, multi-process-capable, and equally usable from JS, Java, Rust, Swift, and C++.

Document status

Roles

Summary

There is no one-size-fits-all solution for storage.

The best we can do to minimize engineering complexity is to cluster the varied requirements of various consumers around particular points in the solution space, reducing our large number of unmaintained ad hoc solutions into a smaller number of supported pieces of infrastructure.

The exploration effort that’s now underway as a result of 2017’s Sync & Storage Roadmap Review primarily addresses one such point: the capabilities required for structured, evolving, reusable, long-lived, syncable user data — data around which Firefox user-facing features are built.

The subject of this document is the polar opposite of that point: data that is unstructured or primitive, non-evolving (or at least with no advances on the current state of the art for managing change), special-purpose, and with no particular storage-level effort expended on reliable sync, if it syncs at all.

Components that store this kind of data often have a different set of required capabilities: a need for very fast reads, reads and writes from multiple processes, synchronous reads, control over filesystem impact, etc.

Examples of data like this are configuration data, certificate revocation lists, session storage, visited link caches, and simple extension storage.

The existing implementations of these features, if they exist, tend to use what’s available — flat text files, JSON files, Kinto, IndexedDB, libpref, or non-persisted in-process data structures repopulated via IPC or messaging1. In doing so they tend to work around missing capabilities in, and necessary limitations of, those tools. Some of these tools are hard to access from different languages, and most are next to impossible to use correctly in an embedding context2.

Brief

We propose ‘buying’, not building, the core of such a solution, and wrapping it in idiomatic libraries that we can use on all platforms.

We propose that LMDB is a suitable core (see Appendix A for options considered): it is compact (32KB of object code), well-tested, professionally maintained, reliable, portable, scales well, and is exceptionally fast for our kinds of load. We have engineers at Mozilla with prior experience with LMDB, and their feedback is entirely positive.

This key-value storage system can be targeted at a number of new consumers that might otherwise need to roll their own storage, or (ab)use preferences:

and potentially displace or support a number of existing solutions:

It should provide a sensible, scalable, and efficient default choice for consumers that now use hand-rolled key-value systems.

The core wrapping library provides safety (both padding LMDB’s sharp-edged API and also protecting against system limitations like NFS locking), improves developer ergonomics (e.g., tracking value types, seamlessly resizing databases, even automatic sharding), and consolidates best practices (e.g., asynchronous reads and writes, versioning).

Idiomatic per-environment APIs can then take care of platform integration on top of a shared representation, managing things like string conversions, Future/Promise/Deferred, etc.

We expect the resulting APIs to be simpler (and more opinionated) than LMDB’s, and for ongoing maintenance of language wrappers to be relatively cheap. We have recent and relevant experience with Rust wrappers for C++, Swift, and Java, so we’re not too worried about this.

Explicit non-goals for this proposal are:

Not-yet or never goals for this proposal are:

Content process use

Particular thanks to Randell and gcp for this input.

One appealing aspect of LMDB is its relative ease of use from multiple processes, above and beyond its basic capabilities as yet-another-fast-key-value-store.

We expect to have multiple non-content processes in Firefox that could use this capability.

We also expect to have lots — potentially dozens or hundreds — of content processes. Naturally it’s not desirable from a security perspective to allow arbitrary reads and writes, and arbitrary filesystem access, from content processes. There is a tension between performance, reducing memory usage, etc. and reduced attack surface — we can have a locked-down content process, or efficient sharing, but not both.

LMDB uses mmap, which requires some amount of filesystem access. It requires write access to the database lock file for coordination. In ordinary use, the memory-mapped region is read-only, with only the lock file requiring writes.

There are several possibilities for use from the content process.

We do not propose solving this up-front: the worst-case scenario is no different to what we do now, but with the benefit of easier coordination between non-content processes, and it’s possible to make different choices in the future.

Open and partially answered questions

32-bit systems

32-bit systems have limited address space for memory-mapped files. Will this restriction bite us? How large can we make data stores without running into these issues?

LMDB files’ maximum size is specified at opening time. Address space usage is dictated by this maximum size.

64-bit users make up at least 80% of our hardware report (we under-count). On Android we don’t yet ship ARM 64 , but intend to do so this year.

Preliminary survey of crashstats suggests that we’ll be totally fine for reasonable (< 1MB or < 256KB) database volumes.

oom-small is about 4% of our crash rate. Most of those are genuine memory pressure: 2GB or 4GB devices with 75%+ memory usage. Quite a few of the reports are nonsense — failure to allocate 8 bytes with gigs of free RAM and VM.

If I search for oom, <4GB total virtual memory, memory usage <80%, and < 30 seconds uptime — that is, a startup OOM where we shouldn’t actually have run out of memory — we get about 350 oom-small crashes per week across Firefox 58-60. 10% of those are Android, and the rest are Windows.

At those rates, typical LMDB usage should be lost in the noise. It might even reduce OOMs: the entire mmapped file occupies address space, but the actual resident set will be less than snarfing an entire JSON file into memory and causing an oom-large in the JS engine or in file-writers like ChunkedJSONWriteFunc, both of which appear in crash-stats.

Windows support

How well does LMDB work on Windows? About 85% of our desktop users are on Windows.

Reading suggests that as of 2016, LMDB should work just fine, even using sparse files correctly, but we should experimentally verify.

Android support

How well does LMDB work on our supported Android versions?

Remote filesystem support

LMDB’s documentation recommends not using it on network filesystems4. The principal obstacle is the lack of advisory locking.

We already have some accommodations for SQLite on NFS. It’s possible that we could apply the same. Depending on exactly which issues occur — locking, syncing, or other — we might need to find a variety of approaches; the ultimate workaround is to put a ‘working copy’ of each LMDB file in temporary storage, and use the atomic snapshot feature to copy them back to the profile in the background at some point before crash/quit. Some notes:

Being careful

LMDB has a number of straightforward restrictions: no multiple opens from a single process, no multiple top-level transactions open at the same time within a single thread5, careful handling of locks when processes crash, the need to clean up stale readers if they crash. Will these restrictions be onerous? Current work suggests that a layer or two of Rust helps a lot with the sharp edges.

Defaults

What kinds of file sizes and defaults give reasonable behavior on our supported platforms? How changeable are these later? LMDB requires some amount of tuning and specifying limits (e.g., maximum number of open named databases in an environment).

Binary storage

How can we protect ourselves against variations in platform/runtime encoding issues? Endianness, string encoding, Rust repr…? Initial exploratory work used bincode for this, but we need to validate.

Resilience to corruption

What level of external corruption (e.g., disk issues) or internal corruption (e.g., writing malformed data) do we need to handle? We are very careful about things like session storage (R: how careful? Do we handle a failure to parse due to corrupt JSON? Do we handle a failure to parse the compressed file due to corruption?), but many of our existing stores either fail to init, breaking the browser, or discard the corrupt file. Do we need to do better? Is LMDB more or less prone to issues?

Are customers who need recoverability from corruption adequately served by other points in the solution space, or by features of this one (e.g., LMDB’s ability to take consistent live backups)?

LMDB’s live backup seems like it could be used to build a sensible backup/migration strategy, but it would be good if we could understand to what extent we consider the requirement important before investing effort into building capabilities. Even if we have some potential consumers, maybe those aren’t the first ones.

Observers/change notifications

What approaches to observers/notifications make sense? We don’t get this out of the box for most stores, particularly not cross-process; we’re breaking new ground. We can learn some lessons from observer and preference change notifications in Firefox.

Performance

Does a memory-mapped solution give us:

Can LMDB beat IPC-coordinated shmem or flat file solutions for for inter-process data sharing?

Shipping vehicle(s)

It makes sense to have at least one target consumer in mind when building out a capability. We have lots of options; which ones will have the biggest potential upside, the lowest risk, and offer the best learning opportunities?

Cross-process, cross-language use examples

Chrome shares its visited link set between the main process and content process via a complicated shared memory scheme: the main process constructs chunks and coordinates handoff to content processes via IPC.

In Firefox for Android we use JNI to allow the Gecko main thread to read from and add to a Java-side data structure that serves a similar purpose.

GeckoView will, eventually, distinguish between the application process (started by Android, runs the enclosing Java application) and Gecko’s main process. These two processes will communicate via Binder.

In contrast to Fennec, application data — including history data — won’t be in the same process as the Gecko main process. Sharing persistent data in this situation is well-suited to LMDB: the same LMDB database could be opened from both the application (Java) process and the Gecko main process. The two processes would have a shared understanding of the data format used and the file path and settings required. Given those things, the application could add and remove entries from the database, and the Gecko main process could read (and, indeed, write) without additional coordination.

Planning

Executing on this proposal consists of five overlapping phases.

  1. Evaluation. Given the set of open questions, research, measure (either through prototypes or through probes in Firefox itself), or write code in order to answer the question. In particular, the following:
    1. Windows support.
    2. Android support.
    3. NFS.
  2. Design. It’s likely that some kind of profile-linked manager will be needed to control access to storage. An API needs to be defined and documented. Design and evaluation are intertwined.
  3. Productization. Produce and document:
    1. A core library (using the stable Rust wrapper for safety and ergonomics).
    2. Build integration with mozilla-central and Gecko, and an idiomatic interface that can be consumed by C++ code.
    3. … and chrome JS code.
    4. A Swift wrapper around our core Rust library.
    5. A Java library that allows for independent use of that library, and shared data use with Gecko.
    6. Baseline performance tests to make sure we stay fast.
  4. Leverage. Use these libraries in consumers, migrating older systems as appropriate. We propose the following as the initial set of consumers, depending on staffing:
    1. XULStore (existing).
      • Addresses a concern raised by the XUL replacement work. Replaces a JS XPCOM component that’s in the hot path for window creation. Stores a manageable amount of data.
    2. Search service cache (search.json.lz4).
      • Binary data storage: opportunity for size wins. The component currently compresses to halve disk usage… but 80% of that space usage is base64 and percent-encoded image data!
      • Read asynchronously as the first thing the component does, so a good perf challenge.
      • It’s a cache: no need to address migration, and rollback is easy.
      • Opportunity to rework how the component manages state (e.g., no need to read the descriptions of each engine during init).
    3. Devtools configuration (new).
      • Meets an unmet need, and involves figuring out content process communication.
    4. Security storage (new/replacement).
      • We currently store some certificate data (revocations, whitelists, HSTS, etc.) in a mixture of Kinto (for delivery) and plain-text files.
      • This data needs a mixture of synchronous and asynchronous access.
      • Infrequently updated. Can be repopulated from the server if necessary.
      • Blob storage.
      • Two interesting needs: ideally this data is shared between profiles (no sense in having it copied), and usable on Android, iOS, and from C++, Rust (soon), and JS in Gecko.
    5. A new Android capability to support new Fennec and GeckoView work.
      • Exercises another platform, and explores how the system meets the needs of a diverse set of developers.
  5. Analysis of other consumers, existing and nascent. There are dozens of stores in Firefox on each platform. Some of these will be better suited to this kind of key-value store than their existing storage system (and some will be better suited to other capabilities that we’re building). Still others will not be worth the effort to port. Assess each consumer along axes: size, complexity, risk, technical debt, cross-platform costs and potential value. This will inform subsequent porting efforts.

We have begun building a simple production-quality Rust wrapper to implement storage of typed values and to begin to impose an opinionated and safe interface. This can be used for evaluation and as a testbed for design, ultimately being stabilized and productized.

Footnotes

Appendix A: survey of key-value stores

IndexedDB

A web-standard storage abstraction layer. In Chrome, uses LevelDB for storage. In Firefox, currently backed by SQLite. Both use remoting for multi-process access. Only accessible from JS.

SQLite

Well-tested and already ships in Firefox. Crypto available via SQLCipher. In-memory (non-persistent) storage is easy. Built-in VACUUM is more convenient than compaction in LMDB. Disk sizes are approximately equivalent to LMDB. Known good behavior on 32-bit platforms.

WAL versus COW — writes, and database opens after crashes, are less predictable due to WAL checkpointing. Reads likely to be on the order of microseconds, not nanoseconds; not suitable to back a synchronous API. Reader concurrency not quite as good as LMDB. SQLite will have a higher baseline of memory usage: real allocated memory per connection, not mapped virtual memory.

Benchmarking shows SQLite to perform relatively poorly on KV workloads when compared to LMDB. In sequential reads on published benchmarks LMDB has 47x throughput (I measured 80x for string keys). In random reads LMDB has 9x throughput. Sequential writes, 8x; random writes 5.7x. Batching makes writes significantly faster for LMDB thanks to amortized cost of COW, and reading sequential keys is very fast.

Using SQLite instead of a specialized key-value store would trade performance (particularly read performance) for existing deployment experience.

SQLite is the only real contender for delivering this capability in Firefox. It is conceivable that we could use the same opinionated interface to front both SQLite and LMDB. There is negligible increase in object size to ship both, because LMDB is so small.

My own runs of benchmarks, adjusting the SQLite schema and using WAL, with key times highlighted in bold:

String keys:

SQLite:

readrandom : 8.110 micros/op 123310 ops/sec; (1000000 of 1000000 found)
**readseq : 2.586 micros/op 386687 ops/sec; 36.9 MB/s**
readreverse : 2.580 micros/op 387580 ops/sec; 37.0 MB/s

fillrandsync : 82.125 micros/op 12176 ops/sec; 1.3 MB/s (1000 ops)
fillrandom : 24.933 micros/op 40107 ops/sec; 4.4 MB/s
**fillrandbatch : 16.004 micros/op 62483 ops/sec; 6.9 MB/s**
fillseqsync : 31.004 micros/op 32253 ops/sec; 3.6 MB/s (1000 ops)
fillseq : 12.433 micros/op 80430 ops/sec; 8.9 MB/s
fillseqbatch : 2.108 micros/op 474321 ops/sec; 52.5 MB/s
overwrite : 36.793 micros/op 27179 ops/sec; 3.0 MB/s

LMDB:

readrandom : 1.070 micros/op 934143 ops/sec; (1000000 of 1000000 found)
**readseq : 0.032 micros/op 31309684 ops/sec; 3463.7 MB/s**
readreverse : 0.023 micros/op 42877969 ops/sec; 4743.4 MB/s

fillrandsync : 166.412 micros/op 6009 ops/sec; 0.7 MB/s (1000 ops)
fillrandom : 5.176 micros/op 193206 ops/sec; 21.4 MB/s

Integer keys (faster reads, slightly smaller file):

SQLite:

readrandom : 4.258 micros/op 234843 ops/sec; (1000000 of 1000000 found)
readseq : 0.281 micros/op 3560505 ops/sec; 339.6 MB/s
readreverse : 0.261 micros/op 3826081 ops/sec; 364.9 MB/s

LMDB:

readrandom : 0.548 micros/op 1825317 ops/sec; (1000000 of 1000000 found)
readseq : 0.028 micros/op 35937612 ops/sec; 3701.5 MB/s
readreverse : 0.019 micros/op 53358945 ops/sec; 5495.8 MB/s

fillseqsync : 165.802 micros/op 6031 ops/sec; 0.7 MB/s (1000 ops)
fillseq : 3.360 micros/op 297576 ops/sec; 32.9 MB/s
fillseqbatch : 0.504 micros/op 1983717 ops/sec; 219.5 MB/s
overwrite : 5.099 micros/op 196105 ops/sec; 21.7 MB/s

LevelDB

Google-sourced. Implemented in Go. Not lightweight (306KB macOS release dylib). Non-ACID, no transactions: only read-snapshots and atomic batch writes. Poor reputation for data loss and consistency bugs. Writes are automatically compressed, spending CPU to get reasonable writes. Coarse-grained locking. LSM trees are write-optimized. Not multiprocess. Windows support is second-class, and iOS regressions occur.

LSM tree LevelDB-alikes derivatives (RocksDB (Facebook), HyperLevelDB, Basho’s LevelDB fork etc.)

Dgraph’s summary:

LSM tree based stores provide high write performance by sequentially writing key-value pairs in memory in the foreground, and then arranging them in multi-tiered levels on disk in the background. This approach is not without tradeoffs though. Values are written multiple times when arranging them in the tree (known as write amplification) and a single read might need to read multiple levels in the LSM tree before finding a value (known as read amplification).

These various libraries all aim to be faster or more scalable than LevelDB, typically by improving thread parallelism or altering how compaction works (e.g., RocksDB doesn’t use level-based compaction). All are targeted at server workloads.

Bolt

A Go implementation of LMDB. B+tree. Not multiprocess.

MDBM

A memory-mapped hash/cache (it would be a stretch to call it persistent) implemented in C, with C++ and Perl bindings. Loses data by default (manual sync), not resistant to power loss/crash, vulnerable to data corruption. No transactions. Keys aren’t sorted, so no range queries or cursors. Building your own persistent storage/log is recommended.

https://github.com/yahoo/mdbm

https://news.ycombinator.com/item?id=8732891

Kyoto Cabinet

Reportedly very slow. Not multiprocess.

BerkeleyDB

Very long pedigree. For our purposes, essentially obsoleted in every way by LMDB.

LMDB

A distillation of experience with BerkeleyDB and use in OpenLDAP. Implemented in C.

Dgraph gives an excellent brief summary:

LMDB provides a key-value stored using B+ trees. It has ACID semantics and support for transactions. It does not do any background work and has a crash resilient design that uses MVCC instead of locking. This means readers operate on an isolated snapshot of the database and don’t block. The codebase is quite small and portable, so LMDB runs across multiple platforms including Android, BSD, and Solaris. This talk by Howard Chu at Databaseology 2015 goes into much more details about LMDB’s design.

It is generally understood that B+ tree based stores are ideal for workloads which are read intensive.

Read-optimized, very lightweight (32KB), predictable performance. Copy-on-write mmap B+tree. Zero-copy reads. Handles larger values better than log-based approaches, with less write amplification. Database sizes are limited by address space (~128TB on 64-bit machines). Single-writer multiple-reader (linear scaling in readers, and total throughput reportedly better with serialized writes).

nanj’s experience:

Some less well-known merits of LMDB:

Some lessons we’ve learnt:

https://banksco.de/p/lmdb-the-leveldb-killer.html

https://symas.com/is-lmdb-a-leveldb-killer/

ActorDB

Distributed SQL database.

https://github.com/biokoda/actordb

Built on top of SQLite and LMDB: it replaces SQLite’s pager with LMDB, and hooks into SQLite’s WAL for Raft-based replication between actors. Developed for applications such as file syncing.

nanj’s observations:

Let me share some observations on this project here. Disclaimer, I’ve never used it before, all my understanding on it is from following blog posts and skimming its source code.

http://blog.biokoda.com/post/112206754025/why-we-built-actordb

http://blog.biokoda.com/post/133121776825/actordb-how-and-why-we-run-sqlite-on-top-of-lmdb

What’s ActorDB

In short, the ActorDB team wants to build a file synchronisation solution anyone could install and use.

“The best way we can describe ActorDB is that it is an ideal server-side database for apps. Think of running a large mail service, dropbox, evernote, etc. They all require server side storage for user data, but the vast majority of queries to the database is within a specific user”

The author believes that a distributed SQL database is the best storage solution to this.

Features

Other Observations

Although it’s unclear how exactly ActorDB works out in production, and some other limitations in its design & implementation. I think it still provides us with some interesting points for further investigation, such as, the approach of integration LMDB with SQLite. Introduction of WAL replication via Raft.

Questions for the review: