We propose the standardization of a simple key-value storage capability, based on LMDB, that is fast, compact, multi-process-capable, and equally usable from JS, Java, Rust, Swift, and C++.
There is no one-size-fits-all solution for storage.
The best we can do to minimize engineering complexity is to cluster the varied requirements of various consumers around particular points in the solution space, reducing our large number of unmaintained ad hoc solutions into a smaller number of supported pieces of infrastructure.
The exploration effort that’s now underway as a result of 2017’s Sync & Storage Roadmap Review primarily addresses one such point: the capabilities required for structured, evolving, reusable, long-lived, syncable user data — data around which Firefox user-facing features are built.
The subject of this document is the polar opposite of that point: data that is unstructured or primitive, non-evolving (or at least with no advances on the current state of the art for managing change), special-purpose, and with no particular storage-level effort expended on reliable sync, if it syncs at all.
Components that store this kind of data often have a different set of required capabilities: a need for very fast reads, reads and writes from multiple processes, synchronous reads, control over filesystem impact, etc.
Examples of data like this are configuration data, certificate revocation lists, session storage, visited link caches, and simple extension storage.
The existing implementations of these features, if they exist, tend to use what’s available — flat text files, JSON files, Kinto, IndexedDB, libpref, or non-persisted in-process data structures repopulated via IPC or messaging1. In doing so they tend to work around missing capabilities in, and necessary limitations of, those tools. Some of these tools are hard to access from different languages, and most are next to impossible to use correctly in an embedding context2.
We propose ‘buying’, not building, the core of such a solution, and wrapping it in idiomatic libraries that we can use on all platforms.
We propose that LMDB is a suitable core (see Appendix A for options considered): it is compact (32KB of object code), well-tested, professionally maintained, reliable, portable, scales well, and is exceptionally fast for our kinds of load. We have engineers at Mozilla with prior experience with LMDB, and their feedback is entirely positive.
This key-value storage system can be targeted at a number of new consumers that might otherwise need to roll their own storage, or (ab)use preferences:
and potentially displace or support a number of existing solutions:
It should provide a sensible, scalable, and efficient default choice for consumers that now use hand-rolled key-value systems.
The core wrapping library provides safety (both padding LMDB’s sharp-edged API and also protecting against system limitations like NFS locking), improves developer ergonomics (e.g., tracking value types, seamlessly resizing databases, even automatic sharding), and consolidates best practices (e.g., asynchronous reads and writes, versioning).
Idiomatic per-environment APIs can then take care of platform integration on top of a shared representation, managing things like string conversions,
We expect the resulting APIs to be simpler (and more opinionated) than LMDB’s, and for ongoing maintenance of language wrappers to be relatively cheap. We have recent and relevant experience with Rust wrappers for C++, Swift, and Java, so we’re not too worried about this.
Explicit non-goals for this proposal are:
Not-yet or never goals for this proposal are:
Particular thanks to Randell and gcp for this input.
One appealing aspect of LMDB is its relative ease of use from multiple processes, above and beyond its basic capabilities as yet-another-fast-key-value-store.
We expect to have multiple non-content processes in Firefox that could use this capability.
We also expect to have lots — potentially dozens or hundreds — of content processes. Naturally it’s not desirable from a security perspective to allow arbitrary reads and writes, and arbitrary filesystem access, from content processes. There is a tension between performance, reducing memory usage, etc. and reduced attack surface — we can have a locked-down content process, or efficient sharing, but not both.
LMDB uses mmap, which requires some amount of filesystem access. It requires write access to the database lock file for coordination. In ordinary use, the memory-mapped region is read-only, with only the lock file requiring writes.
There are several possibilities for use from the content process.
We do not propose solving this up-front: the worst-case scenario is no different to what we do now, but with the benefit of easier coordination between non-content processes, and it’s possible to make different choices in the future.
32-bit systems have limited address space for memory-mapped files. Will this restriction bite us? How large can we make data stores without running into these issues?
LMDB files’ maximum size is specified at opening time. Address space usage is dictated by this maximum size.
64-bit users make up at least 80% of our hardware report (we under-count). On Android we don’t yet ship ARM 64 , but intend to do so this year.
Preliminary survey of crashstats suggests that we’ll be totally fine for reasonable (< 1MB or < 256KB) database volumes.
oom-small is about 4% of our crash rate. Most of those are genuine memory pressure: 2GB or 4GB devices with 75%+ memory usage. Quite a few of the reports are nonsense — failure to allocate 8 bytes with gigs of free RAM and VM.
If I search for oom, <4GB total virtual memory, memory usage <80%, and < 30 seconds uptime — that is, a startup OOM where we shouldn’t actually have run out of memory — we get about 350 oom-small crashes per week across Firefox 58-60. 10% of those are Android, and the rest are Windows.
At those rates, typical LMDB usage should be lost in the noise. It might even reduce OOMs: the entire mmapped file occupies address space, but the actual resident set will be less than snarfing an entire JSON file into memory and causing an oom-large in the JS engine or in file-writers like
ChunkedJSONWriteFunc, both of which appear in crash-stats.
How well does LMDB work on Windows? About 85% of our desktop users are on Windows.
Reading suggests that as of 2016, LMDB should work just fine, even using sparse files correctly, but we should experimentally verify.
How well does LMDB work on our supported Android versions?
LMDB’s documentation recommends not using it on network filesystems4. The principal obstacle is the lack of advisory locking.
We already have some accommodations for SQLite on NFS. It’s possible that we could apply the same. Depending on exactly which issues occur — locking, syncing, or other — we might need to find a variety of approaches; the ultimate workaround is to put a ‘working copy’ of each LMDB file in temporary storage, and use the atomic snapshot feature to copy them back to the profile in the background at some point before crash/quit. Some notes:
LMDB has a number of straightforward restrictions: no multiple opens from a single process, no multiple top-level transactions open at the same time within a single thread5, careful handling of locks when processes crash, the need to clean up stale readers if they crash. Will these restrictions be onerous? Current work suggests that a layer or two of Rust helps a lot with the sharp edges.
What kinds of file sizes and defaults give reasonable behavior on our supported platforms? How changeable are these later? LMDB requires some amount of tuning and specifying limits (e.g., maximum number of open named databases in an environment).
How can we protect ourselves against variations in platform/runtime encoding issues? Endianness, string encoding, Rust repr…? Initial exploratory work used bincode for this, but we need to validate.
What level of external corruption (e.g., disk issues) or internal corruption (e.g., writing malformed data) do we need to handle? We are very careful about things like session storage (R: how careful? Do we handle a failure to parse due to corrupt JSON? Do we handle a failure to parse the compressed file due to corruption?), but many of our existing stores either fail to init, breaking the browser, or discard the corrupt file. Do we need to do better? Is LMDB more or less prone to issues?
Are customers who need recoverability from corruption adequately served by other points in the solution space, or by features of this one (e.g., LMDB’s ability to take consistent live backups)?
LMDB’s live backup seems like it could be used to build a sensible backup/migration strategy, but it would be good if we could understand to what extent we consider the requirement important before investing effort into building capabilities. Even if we have some potential consumers, maybe those aren’t the first ones.
What approaches to observers/notifications make sense? We don’t get this out of the box for most stores, particularly not cross-process; we’re breaking new ground. We can learn some lessons from observer and preference change notifications in Firefox.
Does a memory-mapped solution give us:
Can LMDB beat IPC-coordinated shmem or flat file solutions for for inter-process data sharing?
It makes sense to have at least one target consumer in mind when building out a capability. We have lots of options; which ones will have the biggest potential upside, the lowest risk, and offer the best learning opportunities?
Chrome shares its visited link set between the main process and content process via a complicated shared memory scheme: the main process constructs chunks and coordinates handoff to content processes via IPC.
GeckoView will, eventually, distinguish between the application process (started by Android, runs the enclosing Java application) and Gecko’s main process. These two processes will communicate via Binder.
In contrast to Fennec, application data — including history data — won’t be in the same process as the Gecko main process. Sharing persistent data in this situation is well-suited to LMDB: the same LMDB database could be opened from both the application (Java) process and the Gecko main process. The two processes would have a shared understanding of the data format used and the file path and settings required. Given those things, the application could add and remove entries from the database, and the Gecko main process could read (and, indeed, write) without additional coordination.
Executing on this proposal consists of five overlapping phases.
We have begun building a simple production-quality Rust wrapper to implement storage of typed values and to begin to impose an opinionated and safe interface. This can be used for evaluation and as a testbed for design, ultimately being stabilized and productized.
MDB_NOTLSis specified, tying transactions to the transaction object instead of to the thread. ‘Defaults’ and ‘Being careful’!
A web-standard storage abstraction layer. In Chrome, uses LevelDB for storage. In Firefox, currently backed by SQLite. Both use remoting for multi-process access. Only accessible from JS.
Well-tested and already ships in Firefox. Crypto available via SQLCipher. In-memory (non-persistent) storage is easy. Built-in
VACUUM is more convenient than compaction in LMDB. Disk sizes are approximately equivalent to LMDB. Known good behavior on 32-bit platforms.
WAL versus COW — writes, and database opens after crashes, are less predictable due to WAL checkpointing. Reads likely to be on the order of microseconds, not nanoseconds; not suitable to back a synchronous API. Reader concurrency not quite as good as LMDB. SQLite will have a higher baseline of memory usage: real allocated memory per connection, not mapped virtual memory.
Benchmarking shows SQLite to perform relatively poorly on KV workloads when compared to LMDB. In sequential reads on published benchmarks LMDB has 47x throughput (I measured 80x for string keys). In random reads LMDB has 9x throughput. Sequential writes, 8x; random writes 5.7x. Batching makes writes significantly faster for LMDB thanks to amortized cost of COW, and reading sequential keys is very fast.
Using SQLite instead of a specialized key-value store would trade performance (particularly read performance) for existing deployment experience.
SQLite is the only real contender for delivering this capability in Firefox. It is conceivable that we could use the same opinionated interface to front both SQLite and LMDB. There is negligible increase in object size to ship both, because LMDB is so small.
My own runs of benchmarks, adjusting the SQLite schema and using WAL, with key times highlighted in bold:
SQLite: readrandom : 8.110 micros/op 123310 ops/sec; (1000000 of 1000000 found) **readseq : 2.586 micros/op 386687 ops/sec; 36.9 MB/s** readreverse : 2.580 micros/op 387580 ops/sec; 37.0 MB/s fillrandsync : 82.125 micros/op 12176 ops/sec; 1.3 MB/s (1000 ops) fillrandom : 24.933 micros/op 40107 ops/sec; 4.4 MB/s **fillrandbatch : 16.004 micros/op 62483 ops/sec; 6.9 MB/s** fillseqsync : 31.004 micros/op 32253 ops/sec; 3.6 MB/s (1000 ops) fillseq : 12.433 micros/op 80430 ops/sec; 8.9 MB/s fillseqbatch : 2.108 micros/op 474321 ops/sec; 52.5 MB/s overwrite : 36.793 micros/op 27179 ops/sec; 3.0 MB/s LMDB: readrandom : 1.070 micros/op 934143 ops/sec; (1000000 of 1000000 found) **readseq : 0.032 micros/op 31309684 ops/sec; 3463.7 MB/s** readreverse : 0.023 micros/op 42877969 ops/sec; 4743.4 MB/s fillrandsync : 166.412 micros/op 6009 ops/sec; 0.7 MB/s (1000 ops) fillrandom : 5.176 micros/op 193206 ops/sec; 21.4 MB/s
Integer keys (faster reads, slightly smaller file):
SQLite: readrandom : 4.258 micros/op 234843 ops/sec; (1000000 of 1000000 found) readseq : 0.281 micros/op 3560505 ops/sec; 339.6 MB/s readreverse : 0.261 micros/op 3826081 ops/sec; 364.9 MB/s LMDB: readrandom : 0.548 micros/op 1825317 ops/sec; (1000000 of 1000000 found) readseq : 0.028 micros/op 35937612 ops/sec; 3701.5 MB/s readreverse : 0.019 micros/op 53358945 ops/sec; 5495.8 MB/s fillseqsync : 165.802 micros/op 6031 ops/sec; 0.7 MB/s (1000 ops) fillseq : 3.360 micros/op 297576 ops/sec; 32.9 MB/s fillseqbatch : 0.504 micros/op 1983717 ops/sec; 219.5 MB/s overwrite : 5.099 micros/op 196105 ops/sec; 21.7 MB/s
Google-sourced. Implemented in Go. Not lightweight (306KB macOS release dylib). Non-ACID, no transactions: only read-snapshots and atomic batch writes. Poor reputation for data loss and consistency bugs. Writes are automatically compressed, spending CPU to get reasonable writes. Coarse-grained locking. LSM trees are write-optimized. Not multiprocess. Windows support is second-class, and iOS regressions occur.
LSM tree based stores provide high write performance by sequentially writing key-value pairs in memory in the foreground, and then arranging them in multi-tiered levels on disk in the background. This approach is not without tradeoffs though. Values are written multiple times when arranging them in the tree (known as write amplification) and a single read might need to read multiple levels in the LSM tree before finding a value (known as read amplification).
These various libraries all aim to be faster or more scalable than LevelDB, typically by improving thread parallelism or altering how compaction works (e.g., RocksDB doesn’t use level-based compaction). All are targeted at server workloads.
A Go implementation of LMDB. B+tree. Not multiprocess.
A memory-mapped hash/cache (it would be a stretch to call it persistent) implemented in C, with C++ and Perl bindings. Loses data by default (manual sync), not resistant to power loss/crash, vulnerable to data corruption. No transactions. Keys aren’t sorted, so no range queries or cursors. Building your own persistent storage/log is recommended.
Reportedly very slow. Not multiprocess.
Very long pedigree. For our purposes, essentially obsoleted in every way by LMDB.
A distillation of experience with BerkeleyDB and use in OpenLDAP. Implemented in C.
Dgraph gives an excellent brief summary:
LMDB provides a key-value stored using B+ trees. It has ACID semantics and support for transactions. It does not do any background work and has a crash resilient design that uses MVCC instead of locking. This means readers operate on an isolated snapshot of the database and don’t block. The codebase is quite small and portable, so LMDB runs across multiple platforms including Android, BSD, and Solaris. This talk by Howard Chu at Databaseology 2015 goes into much more details about LMDB’s design.
It is generally understood that B+ tree based stores are ideal for workloads which are read intensive.
Read-optimized, very lightweight (32KB), predictable performance. Copy-on-write mmap B+tree. Zero-copy reads. Handles larger values better than log-based approaches, with less write amplification. Database sizes are limited by address space (~128TB on 64-bit machines). Single-writer multiple-reader (linear scaling in readers, and total throughput reportedly better with serialized writes).
Some less well-known merits of LMDB:
- You can get a quite good performance out of the box, without any fuss or configuration like RocksDB.
- Its cursor iterator is awesome, very powerful and flexible.
- It also supports integer key and value, dup keys, and whole bunch of other nice value types.
- You can have multiple sub-dbs in a single database, each of them could be of different type (integer key, dup key etc.)
- We’ve never run into a single case of database corruption before, even for those traffic (both write and read) intensive applications. See more in this OSDI paper: Torturing Databases for Fun and Profit
Some lessons we’ve learnt:
- You will need to choose a maximum database size upon its creation. Since LMDB uses MVCC and shadow paging to implement ACID, long running read transactions would increase the page usage quickly, potentially fill up the mmap, and the write transaction will abort and leave the database in an unwritable state until page reclaiming is performed. Resizing the database on the fly is feasible, although it needs certain amount of cooperation between readers and writers.
- LMDB by default has a maximum key size set as 512 bytes, there is a compile time configuration to change it, though a larger key may have performance impact too.
- Random write is just OK, unlike other log-based data stores.
- Lack of built-in compression usually leads to a higher disk usage.
- Fully relying on OS’s page management may block the way of certain optimization.
- Some database maintenances are still needed, LMDB provides various APIs (e.g mdb_reader_check) to clean up zombie readers (due to the improper transaction termination) so that the unused pages could be reclaimed. No downtime needed though, could be done in the background.
- The code base of LMDB is not very readable ;), at least comparing to SQLite.
- Howard Chu usually is quite responsive on bugs and feature requests, also he is a quite opinionated guy, you can see his comments on HN :)
Distributed SQL database.
Built on top of SQLite and LMDB: it replaces SQLite’s pager with LMDB, and hooks into SQLite’s WAL for Raft-based replication between actors. Developed for applications such as file syncing.
Let me share some observations on this project here. Disclaimer, I’ve never used it before, all my understanding on it is from following blog posts and skimming its source code.
In short, the ActorDB team wants to build a file synchronisation solution anyone could install and use.
“The best way we can describe ActorDB is that it is an ideal server-side database for apps. Think of running a large mail service, dropbox, evernote, etc. They all require server side storage for user data, but the vast majority of queries to the database is within a specific user”
The author believes that a distributed SQL database is the best storage solution to this.
- Server side storage with SQL including transaction support (unlike KV store or document store)
- No single point of failure (user data is replicated among multiple nodes by Raft)
- Horizontally scalable (each user has its own database, so they call that ActorDB)
- In the early versions, they use SQLite as both the SQL engine and the storage engine, and hack SQLite’s WAL to implement Raft. According to the second article, although they manage to combine all the WALs into a single one, it is not ideal since every user has a separate SQLite base file.
- Then, they choose LMDB as the storage engine to mitigate the drawback above. They again hack the SQLite’s WAL module so that it completely bypasses the SQLite’s b-tree based storage layer, instead stores the actual data (i.e. pages) and Raft logs on LMDB. They find LMDB is perfect engine for their use case, since LMDB supports duplicate keys with sorted value (i.e. dupsort key) as well as multiple sub-dbs within a single database.
- Other components are written in Erlang. Also worth pointing out that the server application, the SQL engine, and the storage engine are tightly coupled with each other in this project. To my understanding, there is no clear boundaries between subsystems. This makes ActorDB hard to extend and evolve.
- Due to its design, i.e. one db for each user, certain SQL features are not supported, like cross user table joins, ATTACH database etc.
Although it’s unclear how exactly ActorDB works out in production, and some other limitations in its design & implementation. I think it still provides us with some interesting points for further investigation, such as, the approach of integration LMDB with SQLite. Introduction of WAL replication via Raft.