008-tang.pdf

Holistic Conﬁguration Management at Facebook

Chunqiang Tang, Thawan Kooburat, Pradeep Venkatachalam, Akshay Chander,

Zhe Wen, Aravind Narayanan, Patrick Dowell, and Robert Karl

Facebook Inc.

{tang, thawan, pradvenkat, akshay, wenzhe, aravindn, pdowell, robertkarl}@fb.com

Abstract

Facebook’s web site and mobile apps are very dynamic.

Every day, they undergo thousands of online conﬁguration

changes, and execute trillions of conﬁguration checks to

personalize the product features experienced by hundreds

of million of daily active users. For example, conﬁguration

changes help manage the rollouts of new product features,

perform A/B testing experiments on mobile devices to iden-

tify the best echo-canceling parameters for VoIP, rebalance

the load across global regions, and deploy the latest machine

learning models to improve News Feed ranking. This paper

gives a comprehensive description of the use cases, design,

implementation, and usage statistics of a suite of tools that

manage Facebook’s conﬁguration end-to-end, including the

frontend products, backend systems, and mobile apps.

1. Introduction

The software development and deployment cycle has accel-

erated dramatically [13]. A main driving force comes from

the Internet services, where frequent software upgrades are

not only possible with their service models, but also a neces-

sity for survival in a rapidly changing and competitive envi-

ronment. Take Facebook, for example. We roll facebook.com

onto new code twice a day [29]. The site’s various conﬁgura-

tions are changed even more frequently, currently thousands

of times a day. In 2014, thousands of engineers made live

conﬁguration updates to various parts of the site, which is

even more than the number of engineers who made changes

to Facebook’s frontend product code.

Frequent conﬁguration changes executed by a large popu-

lation of engineers, along with unavoidable human mistakes,

lead to conﬁguration errors, which is a major source of site

outages [24]. Preventing conﬁguration errors is only one of

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for proﬁt or commercial advantage and that copies bear this notice and the full citation

on the ﬁrst page. Copyrights for components of this work owned by others than ACM

must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,

to post on servers or to redistribute to lists, requires prior speciﬁc permission and/or a

fee. Request permissions from [email protected].

SOSP’15, October 4–7, 2015, Monterey, CA.

 2015 ACM 978-1-4503-3834-9/15/10. . . $15.00.

http://dx.doi.org/10.1145/2815400.2815401

the many challenges. This paper presents Facebook’s holistic

conﬁguration management solution. Facebook uses Chef [7]

to manage OS settings and software deployment [11], which

is not the focus of this paper. Instead, we focus on the home-

grown tools for managing applications’ dynamic runtime

conﬁgurations that may be updated live multiple times a

day, without application redeployment or restart. Examples

include gating product rollouts, managing application-level

trafﬁc, and running A/B testing experiments.

Below, we outline the key challenges in conﬁguration

management for an Internet service and our solutions.

Conﬁguration sprawl. Facebook internally has a large

number of systems, including frontend products, backend

services, mobile apps, data stores, etc. They impose differ-

ent requirements on conﬁguration management. Historically,

each system could use its own conﬁguration store and distri-

bution mechanism, which makes the site as a whole hard to

manage. To curb the conﬁguration sprawl, we use a suite

of tools built on top of a uniform foundation to support the

diverse use cases. Currently, the tools manage hundreds of

thousands of conﬁgs (i.e., conﬁguration ﬁles) from a cen-

tral location, and distribute them to hundreds of thousands

of servers and more than one billion mobile devices.

Conﬁguration authoring and version control. A large-

scale distributed system often has many ﬂexible knobs that

can be tuned live. The median size of a conﬁg at Face-

book is 1KB, with large ones reaching MBs or GBs. Manu-

ally editing these conﬁgs is error prone. Even a minor mis-

take could potentially cause a site-wide outage. We take a

truly conﬁguration-as-code approach to compile and gener-

ate conﬁgs from high-level source code. We store the conﬁg

programs and the generated conﬁgs in a version control tool.

Defending against conﬁguration errors. We safeguard

conﬁgs in multiple ways. First, the conﬁguration compiler

automatically runs validators to verify invariants deﬁned for

conﬁgs. Second, a conﬁg change is treated the same as a

code change and goes though the same rigorous code re-

view process. Third, a conﬁg change that affects the frontend

products automatically goes through continuous integration

tests in a sandbox. Lastly, the automated canary testing tool

rolls out a conﬁg change to production in a staged fashion,

328

monitors the system health, and rolls back automatically in

case of problems. A main hurdle we have to overcome is to

reliably determine the health of numerous backend systems.

Conﬁguration dependency. Facebook.com is powered by a

large number of systems developed by many different teams.

Each system has its own conﬁg but there are dependencies

among the conﬁgs. For example, after the monitoring tool’s

conﬁg is updated to enable a new monitoring feature, the

monitoring conﬁgs of all other systems might need be up-

dated accordingly. Our framework expresses conﬁguration

dependency as source code dependency, similar to the in-

clude statement in a C++ program. The tool automatically

extracts dependencies from source code without the need to

manually edit a makeﬁle.

Scalable and reliable conﬁguration distribution. Our

tools manage a site that is much larger than the previously

reported conﬁguration distribution system [30], and support

a much more diverse set of applications, including mobile.

The size of a conﬁg can be as small as a few bytes or as large

as GBs. Given the scale and the geographically distributed

locations, failures are the norm. It is a signiﬁcant challenge

to distribute conﬁgs to all servers and mobile devices in a

timely and reliable manner, and not to make the availability

of the conﬁguration management tools become a bottleneck

of the applications’ availability.

In this paper, we describe our solutions to these chal-

lenges. We make the following contributions:

• Runtime conﬁguration management is an important

problem for Internet services, but it is not well deﬁned

in the literature. We describe the problem space and the

real use cases from our experience, in the hope of mo-

tivating future research in this area.

• We describe Facebook’s conﬁguration management

stack, which addresses many challenges not covered

by prior work, e.g., gating product rollouts, conﬁg au-

thoring, automated canary testing, mobile conﬁg, and

a hybrid subscription-P2P model for large conﬁg dis-

tribution. This is the ﬁrst published solution of holistic

conﬁguration management for Internet services.

• We report the statistics and experience of operating

a large-scale conﬁguration management stack, which

are made available in the literature for the ﬁrst time.

For example, do old conﬁgs become dormant, and how

often do conﬁg changes expose code bugs?

2. Use Cases and Solution Overview

We focus on the problem of managing an Internet service’s

dynamic runtime conﬁgurations that may be updated live

multiple times a day, without application redeployment or

restart. Conﬁguration management is an overloaded term.

We describe several real use cases to make the problem space

more concrete. Note that they are just a tiny sample set out

of the hundreds of thousands of conﬁgs we manage today.

Gating new product features. Facebook releases software

early and frequently. It forces us to get early feedback and

iterate rapidly. While a new product feature is still under de-

velopment, we commonly release the new code into produc-

tion early but in a disabled mode, and then use a tool called

Gatekeeper to incrementally enable it online. Gatekeeper can

quickly disable the new code if problems arise. It controls

which users will experience the new feature, e.g., Facebook

employees only or 1% of the users of a mobile device model.

The target can be changed live through a conﬁg update.

Conducting experiments. Good designs often require A/B

tests to guide data-driven decision making. For example, the

echo-canceling parameters for VoIP on Facebook Messenger

need tuning for different mobile devices because of the hard-

ware variation. Our tools can run live experiments in produc-

tion to test different parameters through conﬁg changes.

Application-level trafﬁc control. Conﬁgs help manage the

site’s trafﬁc in many ways. Automation tools periodically

make conﬁg changes to shift trafﬁc across regions and per-

form load tests in production. In case of emergency, a conﬁg

change kicks off automated cluster/region trafﬁc drain and

another conﬁg change disables resource-hungry features of

the site. During shadow tests, a conﬁg change starts or stops

duplicating live trafﬁc to testing servers. During a ﬁre drill, a

conﬁg change triggers fault injection into a production sys-

tem to evaluate its resilience.

Topology setup and load balancing. Facebook stores user

data in a large-scale distributed data store called TAO [5]. As

the hardware setup changes (e.g., a new cluster is brought

online), the macro trafﬁc pattern shifts, or failure happens,

the application-level conﬁgs are updated to drive topology

changes for TAO and rebalance the load.

Monitoring, alerts, and remediation. Facebook’s monitor-

ing stack is controlled through conﬁg changes: 1) what mon-

itoring data to collect, 2) monitoring dashboard (e.g., the lay-

out of the key-metric graphs), 3) alert detection rules (i.e.,

what is considered an anomaly), 4) alert subscription rules

(i.e., who should be paged), and 5) automated remediation

actions [27], e.g., rebooting or reimaging a server. All these

can be dynamically changed without a code upgrade, e.g., as

troubleshooting requires collecting more monitoring data.

Updating machine learning models. Machine learning

models are used to guide search ranking, News Feed rank-

ing, and spam detection. The models are frequently re-

trained with the latest data and distributed to the servers

without a code upgrade. This kind of model is used for many

products. Its data size can vary from KBs to GBs.

Controlling an application’s internal behavior. This is one

of the most common use cases. A production system often

has many knobs to control its behavior. For example, a data

store’s conﬁg controls how much memory is reserved for

caching, how many writes to batch before writing to the disk,

how much data to prefetch on a read, etc.

329

Gatekeeper

Sitevars

Other tools for

A/B testing

experiments

Configerator

MobileConfig

Package

Vessel

Figure 1: Facebook’s conﬁguration management tools. Mo-

bileConﬁg supports mobile apps. All the other tools support

applications running in data centers.

2.1 Tool Overview

Figure 1 shows Facebook’s conﬁguration management tools.

They work together to support the diverse use cases.

Conﬁgerator provides all the foundational functions, in-

cluding version control, authoring, code review, automated

canary testing, and conﬁg distribution. Other tools are built

on top of Conﬁgerator and provide specialized functions.

Gatekeeper controls the rollouts of new product features.

Moreover, it can also run A/B testing experiments to ﬁnd the

best conﬁg parameters. In addition to Gatekeeper, Facebook

has other A/B testing tools built on top of Conﬁgerator, but

we omit them in this paper due to the space limitation.

PackageVessel uses peer-to-peer ﬁle transfer to assist the

distribution of large conﬁgs (e.g., GBs of machine learning

models), without sacriﬁcing the consistency guarantee.

Sitevars is a shim layer that provides an easy-to-use conﬁg-

uration API for the frontend PHP products.

MobileConﬁg manages mobile apps’ conﬁgs on Android

and iOS, and bridges them to the backend systems such as

Conﬁgerator and Gatekeeper. MobileConﬁg is not bridged

to Sitevars because Sitevars is for PHP only. MobileConﬁg

is not bridged to PackageVessel because currently there is no

need to transfer very large conﬁgs to mobile devices.

We describe these tools in the following sections.

3. Conﬁgerator, Sitevars, and PackageVessel

Among other things, Conﬁgerator addresses the challenges

in conﬁguration authoring, conﬁguration error prevention,

and large-scale conﬁguration distribution.

3.1 Conﬁguration Authoring

Our hypotheses are that 1) most engineers prefer writing

code to generate a conﬁg (i.e., a conﬁguration ﬁle) instead

of manually editing the conﬁg, and 2) most conﬁg programs

are easier to maintain than the raw conﬁgs themselves. We

will use data to validate these hypotheses in Section 6.1.

Following these hypotheses, Conﬁgerator literally treats

“conﬁguration as code”. Figure 2 shows an example. A con-

ﬁg’s data schema is deﬁned in the platform-independent

Thrift [2] language (see “job.thrift”). The engineer writes

two Python ﬁles “create

job.cinc” and “cache job.cconf ”

to manipulate the Thrift object. A call to “export

if last()”

writes the conﬁg as a JSON [20] ﬁle. To prevent invalid

conﬁgs, the engineer writes another Python ﬁle “job.thrift-

cvalidator” to express the invariants for the conﬁg. The val-

idator is automatically invoked by the Conﬁgerator compiler

to verify every conﬁg of type “Job”.

The source code of conﬁg programs and generated JSON

conﬁgs are stored in a version control tool, e.g., git [14]. In

the upper-left side of Figure 3, the engineer works in a “de-

velopment server” with a local clone of the git repository.

She edits the source code and invokes the Conﬁgerator com-

piler to generate JSON conﬁgs. At the top of Figure 3, conﬁg

changes can also be initiated by an engineer via a Web UI, or

programmatically by an automation tool invoking the APIs

provided by the “Mutator” component.

The example in Figure 2 separates create

job.cinc from

cache

job.cconf so that the former can be reused as a com-

mon module to create conﬁgs for other types of jobs. Hy-

pothetically, three different teams may be involved in writ-

ing the conﬁg code: the scheduler team, the cache team,

and the security team. The scheduler team implements the

scheduler software and provides the shared conﬁg code,

including the conﬁg schema job.thrift, the reusable mod-

ule create

job.cinc, and the validator job.thrift-cvalidator,

which ensures that conﬁgs provided by other teams do not

accidentally break the scheduler. The cache team gener-

ates the conﬁg for a cache job by simply invoking cre-

ate

job(name=“cache”), while the security team gener-

ates the conﬁg for a security job by simply invoking cre-

ate

job(name=“security”).

Code modularization and reuse are the key reasons why

maintaining conﬁg code is easier than manually editing

JSON conﬁgs. Conﬁg dependencies are exposed as code de-

pendencies through import

thrift() and import python().An

example is shown below.

import_python (“app_port.cinc”, “*”)

firewall_cfg = FirewallConfig (port = APP_PORT …)

export_if_last (firewall_cfg)

APP_PORT = 8089

Python file “app_port.cinc”

import_python (“app_port.cinc”, “*”)

app_cfg = AppConfig (port = APP_PORT …)

export_if_last (app_cfg)

Python file “firewall.cconf”

Python file “app.cconf”

“app.cconf ” instructs an application to listen on a speciﬁc

port. “ﬁrewall.cconf ” instructs the OS to allow trafﬁc on

that port. Both depend on “app

port.cinc”. The “Depen-

dency Service” in Figure 3 automatically extracts depen-

dencies from source code. If APP

PORT in “app port.cinc”

is changed, the Conﬁgerator compiler automatically recom-

piles both “app.cconf ” and “ﬁrewall.cconf ”, and updates

330

Figure 2: The Conﬁgerator compiler generates a JSON conﬁguration ﬁle from the Python and Thrift source code.

their JSON conﬁgs in one git commit, which ensures consis-

tency. Dependency can be expressed using any Python lan-

guage construct, not limited to shared constants.

3.2 Improving Usability through UI and Sitevars

Conﬁgerator is designed as a uniform platform to support all

use cases. It must be sufﬁciently ﬂexible and expressive in

order to support complex conﬁgs. On the other hand, sim-

ple conﬁgs may not beneﬁt much from the complexity of the

Python and Thrift code in Figure 2. The Conﬁgerator UI al-

lows an engineer to directly edit the value of a Thrift conﬁg

object without writing any code. The UI automatically gen-

erates the artifacts needed by Conﬁgerator.

The Sitevars tool is a shim layer on top of Conﬁgerator

to support simple conﬁgs used by frontend PHP products. It

provides conﬁgurable name-value pairs. The value is a PHP

expression. An engineer uses the Sitevars UI to easily update

a sitevar’s PHP content without writing any Python/Thrift

code. A sitevar can have a checker implemented in PHP to

verify the invariants, similar to the validator in Figure 2.

Because PHP is weakly typed, sitevars are more prone to

conﬁguration errors, e.g., typos. Engineers are encouraged

to deﬁne a data schema for a newly created sitevar. A legacy

sitevar may predate this best practice. The tool automatically

infers its data type from its historical values. For example, it

infers whether a sitevar’s ﬁeld is a string. If so, it further

infers whether it is a JSON string, a timestamp string, or a

general string. If a sitevar update deviates from the inferred

data type, the UI displays a warning message to the engineer.

3.3 Preventing Conﬁguration Errors

Conﬁguration errors are a major source of site outages [24].

We take a holistic approach to prevent conﬁguration errors,

including 1) conﬁg validators to ensure that invariants are

not violated, 2) code review for both conﬁg programs and

generated JSON conﬁgs, 3) manual conﬁg testing, 4) auto-

mated integration tests, and 5) automated canary tests. They

complement each other to catch different conﬁguration er-

rors. We follow the ﬂow in Figure 3 to explain them.

To manually test a new conﬁg, an engineer runs a com-

mand to temporarily deploy the new conﬁg to some pro-

duction servers or testing servers, and veriﬁes that every-

thing works properly. Once satisﬁed, the engineer submits

the source code, the JSON conﬁgs, and the testing results

to a code review system called Phabricator [26]. If the con-

ﬁg is related to the frontend products of facebook.com,in

a sandbox environment, the “Sandcastle” tool automatically

performs a comprehensive set of synthetic, continuous in-

tegration tests of the site under the new conﬁg. Sandcastle

posts the testing results to Phabricator for reviewers to ac-

cess. Once the reviewers approve, the engineer pushes the

conﬁg change to the remote “Canary Service”.

The canary service automatically tests a new conﬁg on a

subset of production machines that serve live trafﬁc. It com-

plements manual testing and automated integration tests.

Manual testing can execute tests that are hard to automate,

but may miss conﬁg errors due to oversight or shortcut under

time pressure. Continuous integration tests in a sandbox can

have broad coverage, but may miss conﬁg errors due to the

small-scale setup or other environment differences.

A conﬁg is associated with a canary spec that describes

how to automate testing the conﬁg in production. The spec

deﬁnes multiple testing phases. For example, in phase 1,

test on 20 servers; in phase 2, test in a full cluster with

thousands of servers. For each phase, it speciﬁes the testing

target servers, the healthcheck metrics, and the predicates

that decide whether the test passes or fails. For example, the

click-through rate (CTR) collected from the servers using

the new conﬁg should not be more than x% lower than the

CTR collected from the servers still using the old conﬁg.

The canary service talks to the “Proxies” running on the

servers under test to temporarily deploy the new conﬁg (see

331

the bottom of Figure 3). If the new conﬁg passes all testing

phases, the canary service asks the remote “Landing Strip”

to commit the change into the master git repository.

3.4 Scalable and Reliable Conﬁguration Distribution

Conﬁgerator distributes a conﬁg update to hundreds of thou-

sands of servers scattered across multiple continents. In such

an environment, failures are the norm. In addition to scala-

bility and reliability, other properties important to Conﬁger-

ator are 1) availability (i.e., an application should continue

to run regardless of failures in the conﬁguration manage-

ment tools); and 2) data consistency (i.e., an application’s

instances running on different servers should eventually re-

ceive all conﬁg updates delivered in the same order, although

there is no guarantee that they all receive a conﬁg update

exactly at the same time). In this section, we describe how

Conﬁgerator achieves these goals through the push model.

In Figure 3, the git repository serves as the ground truth

for committed conﬁgs. The “Git Tailer” continuously ex-

tracts conﬁg changes from the git repository, and writes

them to Zeus for distribution. Zeus is a forked version of

ZooKeeper [18], with many scalability and performance en-

hancements in order to work at the Facebook scale. It runs

a consensus protocol among servers distributed across mul-

tiple regions for resilience. If the leader fails, a follower is

converted into a new leader.

Zeus uses a three-level high-fanout distribution tree,

leader→observer→proxy, to distribute conﬁgs through the

push model. The leader has hundreds of observers as chil-

dren in the tree. A high-fanout tree is feasible because the

data-center network has high bandwidth and only small data

is sent through the tree. Large data is distributed through a

peer-to-peer protocol separately (see Section 3.5). The three-

level tree is simple to manage and sufﬁcient for the current

scale. More levels can be added in the future as needed.

Each Facebook data center consists of multiple clusters.

Each cluster consists of thousands of servers, and has mul-

tiple servers designated as Zeus observers. Each observer

keeps a fully replicated read-only copy of the leader’s data.

Upon receiving a write, the leader commits the write on

the followers, and then asynchronously pushes the write to

each observer. If an observer fails and then reconnects to the

leader, it sends the latest transaction ID it is aware of, and

requests the missing writes. The commit log of ZooKeeper’s

consensus protocol helps guarantee in-order delivery of con-

ﬁg changes.

Each server runs a Conﬁgerator “Proxy” process, which

randomly picks an observer in the same cluster to connect to.

If the observer fails, the proxy connects to another observer.

Unlike an observer, the proxy does not keep a full replica

of the leader’s data. It only fetches and caches the conﬁgs

needed by the applications running on the server.

An application links in the Conﬁgerator client library to

access its conﬁg. On startup, the application requests the

proxy to fetch its conﬁg. The proxy reads the conﬁg from

Misc Web UI:

Configerator,

Gatekeeper,

Sitevar

Misc automation

tools driving

config changes

Canary

Service

Phabricator

(for code

review)

git checkout

Mutator

Git Tailer

Development

Server

git checkout

Master

Git

Repository

Multiple Observers in Each Cluster

Observer Observer Observer

Production Server

Zeus Ensemble

Follower FollowerFollower Follower

Leader

Landing Strip

Dependency

Service

Application A

Application B

Proxy

Cache

HHVM (PHP VM)

Product X Product Y Product Z

Gatekeeper runtime as

HHVM extension

query

canary

status

temporarily

deploy

a config

for canary

testing

publish diff for

code review

query dependency

Sandcastle

(continuous

integration

tests)

post testing results

query dependency

Figure 3: Architecture of Conﬁgerator. It uses git for ver-

sion control, Zeus for tree-structured conﬁg distribution,

Phabricator for code review, Sandcastle for continuous in-

tegration tests, Canary Service for automated canary tests,

Landing Strip for committing changes, Mutator for provid-

ing APIs to support automation tools, and Dependency Ser-

vice for tracking conﬁg dependencies. Applications on a pro-

duction server interact with the Proxy to access their conﬁgs.

332

the observer with a watch so that later the observer will no-

tify the proxy if the conﬁg is updated. The proxy stores the

conﬁg in an on-disk cache for later reuse. If the proxy fails,

the application falls back to read from the on-disk cache di-

rectly. This design provides high availability. So long as a

conﬁg exists in the on-disk cache, the application can ac-

cess it (though outdated), even if all Conﬁgerator compo-

nents fail, including the git repository, Zeus leader/followers,

observers, and proxy.

Conﬁgerator uses the push model. How does it compare

with the pull model [19, 30]? The biggest advantage of the

pull model is its simplicity in implementation, because the

server side can be stateless, without storing any hard state

about individual clients, e.g., the set of conﬁgs needed by

each client (note that different machines may run different

applications and hence need different conﬁgs). However, the

pull model is less efﬁcient for two reasons. First, some polls

return no new data and hence are pure overhead. It is hard

to determine the optimal poll frequency. Second, since the

server side is stateless, the client has to include in each poll

the full list of conﬁgs needed by the client, which is not

scalable as the number of conﬁgs grows. In our environment,

many servers need tens of thousands of conﬁgs to run. We

opt for the push model in our environment.

3.5 Distributing Large Conﬁgs through PackageVessel

Some conﬁgs can be large, e.g., machine learning models for

News Feed ranking. As these large conﬁgs may change fre-

quently, it is not scalable to deliver them through Zeus’ dis-

tribution tree, because it would overload the tree’s internal

nodes that have a high fan out. Moreover, it is hard to guar-

antee the quality of service if the distribution paths overlap

between large conﬁgs and small (but critical) conﬁgs.

Our PackageVessel tool solves the problem by separating

a large conﬁg’s metadata from its bulk content. When a large

conﬁg changes, its bulk content is uploaded to a storage

system. It then updates the conﬁg’s small metadata stored in

Conﬁgerator, including the version number of the new conﬁg

and where to fetch the conﬁg’s bulk content. Conﬁgerator

guarantees the reliable delivery of the metadata to servers

that subscribe to the conﬁg. After receiving the metadata

update, a server fetches the conﬁg’s bulk content from the

storage system using the BitTorrent [8] protocol. Servers

that need the same large conﬁg exchange the conﬁg’s bulk

content among themselves in a peer-to-peer (P2P) fashion to

avoid overloading the centralized storage system. Our P2P

protocol is locality aware so that a server prefers exchanging

data with other servers in the same cluster. We recommend

using PackageVessel for conﬁgs larger than 1MB.

A naive use of P2P cannot guarantee data consistency.

Our hybrid subscription-P2P model does not have this limi-

tation. Zeus’ subscription model guarantees the consistency

of the metadata, which in turn drives the consistency of the

bulk content. For example, Facebook’s spam-ﬁghting system

updates and distributes hundreds of MBs of conﬁg data to

thousands of global servers many times a day. Our statistics

show that PackageVessel consistently and reliably delivers

the large conﬁgs to the live servers in less than four minutes.

3.6 Improving Commit Throughput

Multiple engineers making concurrent conﬁg commits into a

shared git repository causes contention and slows down the

commit process. We explain it through an example. When

an engineer tries to push a conﬁg diff X to the shared

git repository, git checks whether the local clone of the

repository is up to date. If not, she has to ﬁrst bring her local

clone up to date, which may take 10s of seconds to ﬁnish.

After the update ﬁnishes, she tries to push diff X to the shard

repository again, but another diff Y from another engineer

might have just been checked in. Even if diff X and diff

Y change different ﬁles, git considers the engineer’s local

repository clone outdated, and again requires an update.

The “Landing Strip” in Figure 3 alleviates the problem,

by 1) receiving diffs from committers, 2) serializing them

according to the ﬁrst-come-ﬁrst-served order, and 3) pushing

them to the shared git repository on behalf of the committers,

without requiring the committers to bring their local repos-

itory clones up to date. If there is a true conﬂict between a

diff being pushed and some previously committed diffs, the

shared git repository rejects the diff, and the error is relayed

back to the committer. Only then, the committer has to up-

date her local repository clone and resolve the conﬂict.

The landing strip alleviates the commit-contention prob-

lem, but does not fundamentally solve the commit-

throughput problem, because 1) a shared git repository can

only accept one commit at a time, and 2) git operations be-

come slower as the repository grows larger. Conﬁgerator

started with a single shared git repository. To improve the

commit throughput, we are in the process of migration to

multiple smaller git repositories that collectively serve a par-

titioned global name space. Files under different paths (e.g.,

“/feed” and “/tao”) can be served by different git reposito-

ries that can accept commits concurrently. Cross-repository

dependency is supported.

import_python(‘‘feed/A.cinc’’, ‘‘

’’)

import_python(‘‘tao/B.cinc’’, ‘‘

’’)

...

In the example above, the conﬁg imports two other conﬁgs.

The code is the same regardless of whether those conﬁgs

are in the same repository or not. The Conﬁgerator compiler

uses metadata to map the dependent conﬁgs to their reposito-

ries and automatically fetches them if they are not checked

out locally. Along with the partitioning of the git reposito-

ries, the components in Figure 3 are also partitioned. Each

git repository has its own mutator, landing strip, and tailer.

New repositories can be added incrementally. As a repos-

itory grows large, some of its ﬁles can be migrated into a

new repository. It only requires updating the metadata that

333

lists all the repositories and the ﬁle paths they are responsi-

ble for. The contents of the migrated ﬁles require no change.

3.7 Fault Tolerance

Every component in Figure 3 has built-in redundancy across

multiple regions. One region serves as the master. Each

backup region has its own copy of the git repository, and re-

ceives updates from the master region. The git repository in

a region is stored on NFS and mounted on multiple servers,

with one as the master. Each region runs multiple instances

of all the services, including mutator, canary service, landing

strip, and dependency service. Conﬁgerator supports failover

both within a region and across regions.

3.8 Summary

Conﬁgerator addresses the key challenges in conﬁguration

authoring, conﬁguration error prevention, and conﬁguration

distribution. It takes a conﬁguration-as-code approach to

compile and generate conﬁgs from high-level source code.

It expresses conﬁguration dependency as source code de-

pendency, and encourages conﬁg code modularization and

reuse. It takes a holistic approach to prevent conﬁguration

errors, including conﬁg validators, code review, manual con-

ﬁg testing, automated integration tests, and automated ca-

nary tests. It uses a distribution tree to deliver small con-

ﬁgs through the push model, and uses a P2P protocol to

deliver the bulk contents of large conﬁgs. It avoids commit

contention by delegating commits to the landing strip. It im-

proves commit throughput by using multiple git repositories

that collectively serve a partitioned global name space.

4. Gatekeeper

Facebook releases software early and frequently. It forces

us to get early feedback and iterate rapidly. It makes trou-

bleshooting easier, because the delta between two releases is

small. It minimizes the use of code branches that complicate

maintenance. On the other hand, frequent software releases

increase the risk of software bugs breaking the site. This

section describes how Gatekeeper helps mitigate the risk by

managing code rollouts through online conﬁg changes.

While a new product feature is still under development,

Facebook engineers commonly release the new code into

production early but in a disabled mode, and then use Gate-

keeper to incrementally enable it online. If any problem is

detected during the rollout, the new code can be disabled in-

stantaneously. Without changing any source code, a typical

launch using Gatekeeper goes through multiple phases. For

example, initially Gatekeeper may only enable the product

feature to the engineers developing the feature. Then Gate-

keeper can enable the feature for an increasing percentage

of Facebook employees, e.g., 1%→10%→100%. After suc-

cessful internal testing, it can target 5% of the users from a

speciﬁc region. Finally, the feature can be launched globally

with an increasing coverage, e.g., 1%→ 10%→100%.

if(gk_check(‘‘ProjectX’’, $user_id)) {

// Show the new feature to the user.

...

} else {

// Show the old product behavior.

...

}

Figure 4: Pseudocode of a product using Gatekeeper to con-

trol the rollout of a product feature.

bool gk_check($project, $user_id) {

if ($project == ‘‘ProjectX’’) {

// The gating logic for ‘‘ProjectX’’.

if($restraint_1($user_id) AND

$restraint_2($user_id)) {

//Cast the die to decide pass or fail.

return rand($user_id) < $pass_prob_1;

} else if ($restraint_4($user_id)) {

return rand($user_id) < $pass_prob_2;

} else {

return false;

}

...

}

Figure 5: Pseudocode of Gatekeeper’s gating logic.

Figure 4 shows how a piece of product code uses a Gate-

keeper “project” (i.e., a speciﬁc gating logic) to enable or

disable a product feature. Figure 5 shows the project’s in-

ternal logic, which consists of a series of if-then-else state-

ments. The condition in an if -statement is a conjunction

of predicates called restraints. Examples of restraints in-

clude checking whether the user is a Facebook employee and

checking the type of a mobile device. Once an if -statement

is satisﬁed, it probabilistically determines whether to pass

or fail the gate, depending on a conﬁgurable probability that

controls user sampling, e.g., 1% or 10%.

The code in Figure 5 is conceptual. A Gatekeeper

project’s control logic is actually stored as a conﬁg that can

be changed live without a code upgrade. Through a Web UI,

the if-then-else statements can be added or removed (actu-

ally, it is a graphical representation, without code to write

);

the restraints in an if -statement can be added or removed;

the probability threshold can be modiﬁed; and the parame-

ters speciﬁc to a restraint can be updated. For example, the

user IDs in the “ID()” restraint can be added or removed so

that only speciﬁc engineers will experience the new product

feature during the early development phase.

A Gatekeeper project is dynamically composed out of re-

straints through conﬁguration. Internally, a restraint is stat-

ically implemented in PHP or C++. Currently, hundreds of

Code review is supported even if changes are made through UI. The UI

tool converts a user’s operations on the UI into a text ﬁle, e.g., “Updated

Employee sampling from 1% to 10%”. The text ﬁle and a screenshot of the

conﬁg’s ﬁnal graphical representation are submitted for code review.

334

restraints have been implemented, which are used to com-

pose tens of thousands of Gatekeeper projects. The restraints

check various conditions of a user, e.g., country/region, lo-

cale, mobile app, device, new user, and number of friends.

A Gatekeeper project’s control logic is stored as a JSON

conﬁg in Conﬁgerator. When the conﬁg is changed (e.g.,

expanding the rollout from 1% to 10%), the new conﬁg

is delivered to production servers (see the bottom of Fig-

ure 3). The Gatekeeper runtime reads the conﬁg and builds

a boolean tree to represent the gating logic. Similar to how

an SQL engine performs cost-based optimization, the Gate-

keeper runtime can leverage execution statistics (e.g., the ex-

ecution time of a restraint and its probability of returning

true) to guide efﬁcient evaluation of the boolean tree.

The example in Figure 5 is similar to the disjunctive

normal form (DNF), except the use of rand() to sample a

subset of users. Sampling is inherent to feature gating, i.e.,

rolling out a feature to an increasingly larger population,

e.g., 1% → 10%. The negation operator is built inside

each restraint. For example, the employee restraint can be

conﬁgured to check “not an employee”. As a result, the

gating logic has the full expressive power of DNF.

Gatekeeper uses DNF of restraints and user sampling to

form the gating logic. It strikes a balance among ﬂexibil-

ity, usability, and safety. Theoretically, it is possible not to

impose any structure on the gating logic (i.e., not limited

to the form in Figure 5), by allowing an engineer to write

arbitrary gating code in a dynamic programming language

(e.g., PHP), and immediately distributing it for execution as

a live conﬁg update. This approach offers maximum ﬂexibil-

ity, but increases the risk of conﬁguration errors. Moreover,

it is harder to use for most engineers, compared with sim-

ply selecting restraints from Gatekeeper’s UI without writing

any code. Finally, its additional ﬂexibility over Gatekeeper

is limited, because Facebook rolls out PHP code twice a day

and new restraints can be added quickly.

Some gating logic is computationally too expensive to

execute realtime inside a restraint. In one example, a prod-

uct feature should only be exposed to users whose recent

posts are related to the current trending topics. This compu-

tation requires continuous stream processing. In another ex-

ample, it needs to run a MapReduce job to analyze historical

data to identify users suitable for a product feature. Gate-

keeper provides a key-value-store interface to integrate with

these external systems. A special “laser()” restraint invokes

get(“$project-$user

id”) on a key-value store called Laser.

If the return value is greater than a conﬁgurable threshold,

i.e., get(...)>T, the restraint passes. Any system can integrate

with Gatekeeper by putting data into Laser. Laser stores data

on ﬂash or in memory for fast access. It has automated data

pipelines to load data from the output of a stream process-

ing system or a MapReduce job. The MapReduce job can be

re-run periodically to refresh the data for all users.

Mobile App Code

(iOS, Android)

Cross-platform

C++ library

MobileConfig Client Library

MobileConfig Translation Layer

Periodically pull

config changes

Emergency push

// Map MY_CONFIG to backend systems.

MY_CONFIG {

FEATURE_X => Gatekeeper (“ProjX”)

VOIP_ECHO => GKRestraintExperiment (“ECHO”)

}

myCfg = Factory.get (MY_CONFIG);

bool x = myCfg.getBool (FEATURE_X);

int y = myCfg.getInt (VOIP_ECHO);

Mobile

Device

Side

Server

Side

Other tools for A/B

testing experiments

Configerator Gatekeeper

Figure 6: MobileConﬁg architecture.

5. MobileConﬁg

Conﬁguration management for mobile apps differs from

that for applications running in data centers, because of the

unique challenges in a mobile environment. First, the mo-

bile network is a severe limiting factor. Second, mobile plat-

forms are diverse, with at least Android and iOS to support.

Lastly, legacy versions of a mobile app linger around for a

long time, raising challenges in backward compatibility.

MobileConﬁg addresses these challenges while maximiz-

ing the reuse of the conﬁguration management tools already

developed for applications running in data centers. Figure 6

shows the architecture of MobileConﬁg. Every conﬁg ap-

pears as a context class in a mobile app’s native language.

The app invokes the context class’ getter methods to retrieve

the values of the conﬁg ﬁelds. The client library that sup-

ports the context class is implemented in C++ so that it is

portable across Android and iOS.

Because push notiﬁcation is unreliable, MobileConﬁg

cannot solely rely on the push model for conﬁg distribution.

The client library polls the server for conﬁg changes (e.g.,

once every hour) and caches the conﬁgs on ﬂash for later

reuse. To minimize the bandwidth consumption, the client

sends to the server the hash of the conﬁg schema (for schema

versioning) and the hash of the conﬁg values cached on the

client.

The server sends back only the conﬁgs that have

changed and are relevant to the client’s schema version. In

addition to pull, the server occasionally pushes emergency

conﬁg changes to the client through push notiﬁcation, e.g.,

to immediately disable a buggy product feature. A combina-

tion of push and pull makes the solution simple and reliable.

To cope with legacy versions of a mobile app, separat-

ing abstraction from implementation is a ﬁrst-class citizen

One future enhancement is to make the server stateful, i.e., remembering

each client’s hash values to avoid repeated transfer of the hash values.

335

in MobileConﬁg. The translation layer in Figure 6 provides

one level of indirection to ﬂexibly map a MobileConﬁg ﬁeld

to a backend conﬁg. The mapping can change. For exam-

ple, initially VOIP

ECHO is mapped to a Gatekeeper-backed

experiment, where satisfying different if -statements in Fig-

ure 5 gives VOIP

ECHO a different parameter value to ex-

periment with. After the experiment ﬁnishes and the best pa-

rameter is found, VOIP

ECHO can be remapped to a con-

stant stored in Conﬁgerator. In the long run, all the backend

systems (e.g., Gatekeeper and Conﬁgerator) may be replaced

by new systems. It only requires changing the mapping in the

translation layer to smoothly ﬁnish the migration. To scale to

more than one billion mobile devices, the translation layer

runs on many servers. The translation mapping is stored in

Conﬁgerator and distributed to all the translation servers.

6. Usage Statistics and Experience

We described Facebook’s conﬁguration management stack.

Given the space limit, this section will primarily focus

on the usage statistics of the production systems and our

experience, because we believe those production data are

more valuable than experimental results in a sandbox en-

vironment. For the latter, we report Conﬁgerator’s commit-

throughput scalability test in a sandbox, because the data

cannot be obtained from production, and currently commit

throughput is Conﬁgerator’s biggest scalability challenge.

We attempt to answer the following questions:

• Does the conﬁguration-as-code hypothesis hold, i.e.,

most engineers prefer writing conﬁg-generating code?

• What are the interesting statistics about conﬁg update

patterns, e.g., do old conﬁgs become dormant quickly?

• What are the performance and scale of the tools?

• What are the typical conﬁguration errors?

• How do we scale our operation to support thousands

of engineers and manage conﬁgs on hundreds of thou-

sands of servers and a billion or more mobile devices?

• How does an organization’s engineering culture impact

its conﬁguration management solution?

6.1 Validating the Conﬁguration-as-Code Hypothesis

Conﬁgerator stores different types of ﬁles: 1) the Python and

Thrift source code as shown in Figure 2; 2) the “compiled

conﬁgs”, i.e., the JSON ﬁles generated by the Conﬁgerator

compiler from the source code; 3) the “raw conﬁgs”, which

are not shown in Figure 2 for brevity. Conﬁgerator allows

engineers to check in raw conﬁgs of any format. They are not

produced by the Conﬁgerator compiler, but are distributed in

the same way as compiled conﬁgs. They are either manually

edited or produced by other automation tools.

Figure 7 shows the number of conﬁgs stored in Conﬁg-

erator. The growth is rapid. Currently it stores hundreds of

thousands of conﬁgs. The compiled conﬁgs grow faster than

the raw conﬁgs. Out of all the conﬁgs, 75% of them are

100

200

300

400

500

600

700

800

900

1000

1100

1200

1300

1400

Number of Configs

(hundreds of thousands)

Days since the creation of the config repository

Compiled

Raw

Gatekeeper migrated

to Configerator

Figure 7: Number of conﬁgs in the repository. The scale on

the y-axis is removed due to conﬁdentiality, but we keep the

unit to show the order of magnitude.

20%

40%

60%

80%

100%

100

200

300

400

600

800

1,000

2,000

5,000

10,000

50,000

100,000

500,000

1,000,000

10,000,000

CDF

Config Size (bytes)

Raw Config

Compiled Config

Figure 8: Cumulative distribution function (CDF) of conﬁg

size. Note that the x-axis is neither linear nor logscale.

compiled conﬁgs. Moreover, about 89% of the updates to

raw conﬁgs are done by automation tools, i.e., not manually

edited. This validates one important hypothesis of Conﬁger-

ator, i.e., engineers prefer writing conﬁg-generating code to

manually editing conﬁgs. The custom automation tools are

used to generate raw conﬁgs because they suit the speciﬁc

(often simpler) jobs better, or they predate Conﬁgerator.

Figure 8 shows the CDF of conﬁg size. Many conﬁgs

have signiﬁcant complexity and are not trivial name-value

pairs. The complexity is one reason why engineers prefer

not editing conﬁgs manually. Compiled conﬁgs are more

complex than raw conﬁgs. The P50’s of raw conﬁg size and

compiled conﬁg size are 400 bytes and 1KB, respectively.

The P95’s are 25KB and 45KB, respectively. The largest

conﬁgs are 8.4MB and 14.8MB, respectively. Even larger

conﬁgs are distributed through PackageVessel and only their

small metadata is stored in Conﬁgerator.

On average, a raw/compiled/source conﬁg gets updated

44/16/10 times during its lifetime, respectively. Raw conﬁgs

get updated 175% more frequently than compiled conﬁgs,

because most (89%) of the raw conﬁg changes are done by

automation tools. Compiled conﬁgs are generated from con-

ﬁg source code, but the former changes 60% more frequently

than the latter does, because the change of one source code

ﬁle may generate multiple new compiled conﬁgs, similar to

a header ﬁle change in C++ causing recompilation of mul-

tiple .cpp ﬁles. This indicates that writing code to gener-

ate conﬁgs reduces the burden of manually keeping track of

changes, thanks to code reuse.

336

6.2 Conﬁg Update Statistics

This section reports conﬁg update statistics, which hopefully

can help guide the design of future conﬁguration manage-

ment systems, and motivate future research.

Are conﬁgs fresh or dormant, i.e., are they updated re-

cently? Figure 9 shows that both fresh conﬁgs and dormant

conﬁgs account for a signiﬁcant fraction. Speciﬁcally, 28%

of the conﬁgs are either created or updated in the past 90

days. On the other hand, 35% of the conﬁgs are not updated

even once in the past 300 days.

Do conﬁgs created a long time ago still get updated?

Figure 10 shows that both new conﬁgs and old conﬁgs get

updated. 29% of the updates happen on conﬁgs created in

the past 60 days. On the other hand, 29% of the updates

happen on conﬁgs created more than 300 days ago, which is

signiﬁcant because those old conﬁgs only account for 50%

of the conﬁgs currently in the repository. The conﬁgs do not

stabilize as quickly as we initially thought.

The frequency of conﬁg updates is highly skewed, as

shown in Table 1. It is especially skewed for raw conﬁgs

because their commits are mostly done by automation tools.

At the low end, 56.9% of the raw conﬁgs are created but

then never updated. At the high end, the top 1% of the raw

conﬁgs account for 92.8% of the total updates to raw conﬁgs.

By contrast, 25.0% of the compiled conﬁgs are created but

then never updated, and the top 1% of the compiled conﬁgs

account for 64.5% of the total updates.

When a conﬁg gets updated, is it a big change or a small

change? Table 2 shows that most changes are very small.

In the output of the Unix diff tool, it is considered a one-

line change to add a new line or delete an existing line in

a ﬁle. Modifying an existing line is considered a two-line

change: ﬁrst deleting the existing line, and then adding a

new line. Table 2 shows that roughly about 50% of the conﬁg

0.5%

17%

28%

39%

44%

52%

65%

71%

78%

83%

95%

100%

20%

40%

60%

80%

100%

120

150

200

300

400

500

600

700

CDF

Days since a config was last modified

Figure 9: Freshness of conﬁgs.

13%

17%

29%

38%

45%

52%

60%

71%

80%

87%

93%

96%

100%

20%

40%

60%

80%

100%

120

150

200

300

400

500

600

700

CDF

Age of a config at the time of an update (days)

Figure 10: Age of a conﬁg at the time of an update.

changes are very small one-line or two-line changes. On the

other hand, large conﬁg changes are not a negligible fraction.

8.7% of the updates to compiled conﬁg modify more than

100 lines of the JSON ﬁles.

How many co-authors update the same conﬁg? Table 3

shows that most conﬁgs are only updated by a small number

of co-authors. Speciﬁcally, 79.6% of the compiled conﬁgs

are only updated by one or two authors, whereas 91.5% of

the raw conﬁgs are only updated by one or two authors. It is

more skewed for raw conﬁgs, because most raw conﬁgs are

updated by automation tools, which are counted as a single

author. On the other hand, some conﬁgs are updated by a

large number of co-authors. For example, there is a sitevar

updated by 727 authors over its two years of lifetime. For

future work, it would be helpful to automatically ﬂag high-

risk updates on these highly-shared conﬁgs.

Is a conﬁg’s co-authorship pattern signiﬁcantly different

from that of regular C++/Python/Java code? No. This is be-

cause Facebook adopts the DevOps model, where engineers

developing a software feature also do the conﬁguration-

related operational work in production. The “fbcode” col-

umn in Table 3 shows the breakdown for Facebook’s back-

end code repository, which is primarily C++, Python, and

Java. There is no big difference between the “compiled con-

ﬁg” column and the “fbcode” column. One subtle but im-

No. of updates Compiled Raw

in lifetime conﬁg conﬁg

1 25.0% 56.9%

2 24.9% 23.7%

3 14.1% 5.2%

4 7.5% 3.2%

[5, 10] 15.9% 6.6%

[11, 100] 11.6% 3.0%

[101, 1000] 0.8% 0.7%

[1001, ∞) 0.2% 0.7%

Table 1: Number of times that a conﬁg gets updated. How

to read the bold cell: 25.0% of compiled conﬁgs are written

only once, i.e., created and then never updated.

No. of line changes Compiled Source Raw

in a conﬁg update conﬁg code conﬁg

1 2.5% 2.7% 2.3%

2 49.5% 44.3% 48.6%

[3, 4] 9.9% 13.5% 32.5%

[5, 6] 3.9% 4.6% 4.2%

[7, 10] 7.4% 6.1% 3.6%

[11, 50] 15.3% 19.3% 5.7%

[51, 100] 2.8% 2.3% 1.1%

[101, ∞) 8.7% 7.3% 2.0%

Table 2: Number of line changes in a conﬁg update. How to

read the bold cell: 49.5% of the updates to compiled conﬁgs

are two-line changes.

337

No. of co-authors Compiled Raw fbcode

conﬁg conﬁg

1 49.5% 70.0% 44.0%

2 30.1% 21.5% 37.7%

3 9.2% 5.1% 7.6%

4 3.9% 1.4% 3.6%

[5, 10] 5.7% 1.2% 5.6%

[11, 50] 1.3% 0.6% 1.4%

[51, 100] 0.2% 0.1% 0.02%

[101, ∞) 0.04% 0.002% 0.007%

Table 3: Number of co-authors of conﬁgs. How to read

the bold cell: 49.5% of the compiled conﬁgs have a single

author throughout their lifetime.

500

1000

1500

2000

2500

3000

3500

4000

4500

4/1/2014

5/1/2014

6/1/2014

7/1/2014

8/1/2014

9/1/2014

10/1/2014

11/1/2014

12/1/2014

1/1/2015

Commits / Day (thousands)

Configerator

www

fbcode

Christmas

Figure 11: Daily commit throughput of repositories.

portant difference is at the high end. 0.24% of the compiled

conﬁgs has more than 50 co-authors, whereas only 0.027%

of the ﬁles in fbcode has more than 50 co-authors. The rela-

tive difference is large.

6.3 Conﬁgerator & Gatekeeper Performance

Figure 11 compares Conﬁgerator’s daily commit through-

put (i.e., the number of times code/conﬁgs are checked into

the git repository) with those of www (frontend code repos-

itory) and fbcode (backend code repository). (Facebook has

other code repositories not shown here.) Figure 11 high-

lights the scaling challenge we are facing. In 10 months,

the peak daily commit throughput grows by 180%. The pe-

riodic peaks and valleys are due to the weekly pattern, i.e.,

less commits on weekends. Conﬁgerator has a high commit

throughput even on weekends, because a signiﬁcant fraction

of conﬁg changes are automated by tools and done continu-

ously. Speciﬁcally, Conﬁgerator’s weekend commit through-

put is about 33% of the busiest weekday commit throughput,

whereas this ratio is about 10% for www and 7% for fbcode.

Figure 12 shows Conﬁgerator’s hourly commit through-

put in a week. It exhibits both a weekly pattern (low ac-

tivities during the weekend) and a daily pattern (peaks be-

11/3

11/4

11/5

11/6

11/7

11/8

11/9

Commits / Hour

(hundreds)

Figure 12: Conﬁgerator’s hourly commit throughput.

100

150

200

250

0 200 400 600 800 1000

Commit Latency

(seconds)

Comit Throughput

(Commits / minute)

Number of files in the git repository (thousands)

Throughput

Latency

Figure 13: Conﬁgerator’s maximum commit throughput

(measured from a sandbox instead of production).

tween 10AM-6PM). There are a steady number of commits

throughout the nights and weekends. Those are automated

commits, which account for 39% of the total commits.

Figure 13 shows Conﬁgerator’s maximum commit

throughput as a function of the git repository size. It is mea-

sured from a sandbox setup under a synthetic stress load

test. The git repository is built up by replaying Conﬁgera-

tor’s production git history from the beginning. To go be-

yond Conﬁgerator’s current repository size, we project the

repository’s future growth by generating synthetic git com-

mits that follow the statistical distribution of past real git

commits. Figure 13 shows that the commit throughput is not

scalable with respect to the repository size, because the exe-

cution time of many git operations increases with the number

of ﬁles in the repository and the depth of the git history. The

right y-axis is simply latency =

throughput

, which makes

the trend more obvious. This “latency” is just the execution

time excluding any queueing time. To improve throughput

and reduce latency, Conﬁgerator is in the process of mi-

gration to multiple smaller git repositories that collectively

serve a partitioned global name space (see Section 3.6).

When an engineer saves a conﬁg change, it takes about

ten minutes to go through automated canary tests. This long

testing time is needed in order to reliably determine whether

the application is healthy under the new conﬁg. After ca-

nary tests, how long does it take to commit the change and

propagate it to all servers subscribing to the conﬁg? This la-

tency can be broken down into three parts: 1) It takes about 5

seconds to commit the change into the shared git repository,

because git is slow on a large repository; 2) The git tailer

(see Figure 3) takes about 5 seconds to fetch conﬁg changes

from the shared git repository; 3) The git tailer writes the

change to Zeus, which propagates the change to all subscrib-

ing servers through a distribution tree. The last step takes

about 4.5 seconds to reach hundreds of thousands of servers

distributed across multiple continents.

338

11/3

11/4

11/5

11/6

11/7

11/8

11/9

Propagation Latency

(sec)

Figure 14: Latency between committing a conﬁg change and

the new conﬁg reaching the production servers (the week of

11/3/2014).

2/15

2/16

2/17

2/18

2/19

2/20

2/21

Gatekeeper Check Rate

(billions / sec)

Figure 15: Gatekeeper check throughput (the week of

2/15/2015).

Figure 14 shows the end-to-end latency of the three steps

above. It is measured from production. The baseline latency

is about 14.5 seconds, but it increases with the load. It shows

both a daily pattern (due to the low load at nights) and

a weekly pattern (due to the low load on weekends). The

major challenge in reducing the latency is to speed up git

operations on a large repository. On the other hand, latency

is less critical for Conﬁgerator, because automated canary

tests take about ten minutes anyway.

Gatekeeper projects manage product feature rollouts.

When a user accesses facebook.com, the Gatekeeper projects

are checked in realtime to determine what features to en-

able for the user. Figure 15 shows the total number of Gate-

keeper checks across the site. Because the check throughput

is high (billions of checks per second) and some Gatekeeper

restraints are data intensive, currently Gatekeeper consumes

a signiﬁcant percentage of the total CPU of the frontend

clusters that consist of hundreds of thousands of servers.

We constantly work on improving Gatekeeper’s efﬁciency.

On the other hand, we consider this “overhead” worthwhile,

because it enables Facebook engineers to iterate rapidly on

new product features. This is evidenced by the popularity

of Gatekeeper. In 2014, tens of thousands of Gatekeeper

projects were created or updated to actively manage the roll-

outs of a huge number of micro product features.

6.4 Conﬁguration Errors

To illustrate the challenges in safeguarding conﬁgs, we dis-

cuss several real examples of conﬁguration errors as well as

a statistical analysis of conﬁguration errors.

The ﬁrst example is related to the conﬁg rollout process.

When both the client code and the conﬁg schema are updated

together, they may not get deployed to a server at the same

time. Typically, the new client code can read the old conﬁg

schema, but the old client code cannot read the new conﬁg

schema. The latter was what happened in an incident. An en-

gineer changed both the client code and the conﬁg schema.

She checked in the client code and thought it would be in-

cluded in the code deployment happening later on that day.

However, the release branch was cut earlier and her new

code unknowingly missed that release. Five days later, the

engineer committed a new conﬁg using the new schema.

The automated canary testing tool initially only deployed the

new conﬁg to 20 servers in production, and monitored their

health. It compared the error logs of those 20 servers with

those of the rest of the production servers, and detected a log

spew, i.e., rapid growth of error logs. It aborted the rollout

and prevented an outage.

Another incident was less lucky. An engineer made a con-

ﬁg change, which was rejected by the automated canary tool,

because it caused some instances of the application to crash.

The engineer stared at the conﬁg change. It seemed such a

trivial and innocent change that nothing could possibly go

wrong. “It must be a false positive of the canary tool!” She

overrode the tool’s rejection and deployed the conﬁg, which

caused more crashes. She mitigated the problem by immedi-

ately reverting the conﬁg change. It turned out that the con-

ﬁg change itself was indeed correct, but it caused the ap-

plication to exercise a new code path and triggered a subtle

race-condition bug in the code. This incident highlights that

problems could arise in any unexpected ways.

Enhancing automated canary tests is a never ending bat-

tle, as shown in the example below. An engineer introduced

a conﬁguration error that sent mobile requests down a rare

code path to fetch data from a backend store. It put too much

load on the data store and dramatically increased the latency.

At that time, automated canary tests did not catch the prob-

lem, because the testing was done on a limited number of

servers and the small scale testing was insufﬁcient to cause

any load issue. The latency increase became evident only af-

ter the conﬁg was deployed site wide. Since then, we added

a canary phase to test a new conﬁg on thousands of servers

in a cluster in order to catch cluster-level load issues.

In addition to the examples above, we manually analyzed

the high-impact incidents during a three-month period. We

found that 16% of the incidents were related to conﬁgura-

tion management, while the rest were dominated by soft-

ware bugs. The table below shows the breakdown of the

conﬁguration-related incidents.

Type of Conﬁg Issues Percentage

Type I: common conﬁg errors 42%

Type II: subtle conﬁg errors 36%

Type III: valid conﬁg changes exposing code bugs 22%

339

Type I errors are obvious once spotted, e.g., typos, out-of-

bound values, and referencing an incorrect cluster. They can

beneﬁt from more careful conﬁg reviews. Type II errors are

harder to anticipate ahead of time, e.g., load-related issues,

failure-induced issues, or butterﬂy effects. The root causes

of type III issues are actually in the code rather than in the

conﬁgs. The high percentage of type III issues was a surprise

to us initially. All types of conﬁg issues can beneﬁt from

better canary testing, which is a focus of our ongoing work.

6.5 Operational Experience

Facebook’s conﬁguration management team adopts the De-

vOps model. A small group of engineers (i.e., the authors)

are responsible for 1) implementing new features and bug

ﬁxes, 2) deploying new versions of the tools into production,

3) monitoring the health of the tools and resolving produc-

tion issues, and 4) supporting the engineering community,

e.g., answering questions and reviewing code related to the

use of the tools. Everyone does everything, without sepa-

ration of roles such as architect, software engineer, test en-

gineer, and system administrator. The small team is highly

leveraged, because we support thousands of engineers using

our tools and manage the conﬁgs on hundreds of thousands

of servers and more than one billion mobile devices.

Engineers on the team rotate through an oncall schedule.

The oncall shields the rest of the team to focus on devel-

opment work. Monitoring tools escalate urgent issues to the

oncall through automated phone calls. The tool users post

questions about non-urgent issues into the related Facebook

groups, which are answered by the oncall when she is avail-

able, or answered by users helping each other. We strive

to educate our large user community and make them self-

sufﬁcient. We give lectures and lab sessions in the bootcamp,

where new hires go through weeks of training.

Although Conﬁgerator’s architecture in Figure 3 seems

complex, we rarely have outages for the Conﬁgerator com-

ponents, thanks in part to the built-in redundancy and auto-

mated failure recovery. On the other hand, there are more

outages caused by conﬁguration errors breaking products,

and we strive to guard against them, e.g., by continuously

enhancing automated canary tests.

The Conﬁgerator proxy is the most sensitive component,

because it runs on almost every server and it is hard to antic-

ipate all possible problems in diverse environments. Many

product teams enroll their servers in a testing environment

to verify that a new proxy does not break their products. The

proxy rollout is always done carefully in a staged fashion.

6.6 Conﬁguration Management Culture

An Internet service’s conﬁguration management process and

tools reﬂect the company’s engineering culture. At the con-

servative extreme, all conﬁg changes must go through a cen-

tral committee for approval and are carefully executed by

a closed group of operation engineers. At the moving-fast

extreme, every engineer can change any conﬁg directly on

the site and claim “test it live!”, which is not uncommon in

startups. The authors experienced both extremes (as well as

something in-between) at different companies.

Over the years, Facebook has evolved from optional diff

review and optional manual testing for conﬁg changes, to

mandatory diff review and mandatory manual testing. We

also put more emphasis on automated canary tests and au-

tomated continuous integration tests. The tools do support

access control (i.e., only white-listed engineers can change

certain critical conﬁgs), but that is an exception rather than

the norm. We empower individual engineers to use their best

judgment to roll out conﬁg changes quickly, and build var-

ious automated validation or testing tools as the safety net.

We expect Facebook’s conﬁguration management culture to

further evolve, even in signiﬁcant ways.

Facebook engineers iterate rapidly on product features

by releasing software early and frequently. This is inherent

to Facebook’s engineering culture. It requires tools to man-

age product rollouts and run A/B tests to identify promising

product features. Gatekeeper and other A/B testing tools are

developed as the ﬁrst-class citizens to support this engineer-

ing culture, and are widely used by product engineers.

7. Related Work

There is limited publication on runtime conﬁguration man-

agement for Internet services. The closest to our work is

Akamai’s ACMS system [30]. ACMS is a conﬁguration stor-

age and distribution system, similar to the Zeus component

in our stack. ACMS uses the pull model whereas Zeus uses

the push model. Moreover, the scope of our work is much

broader, including conﬁguration authoring and validation,

code review, version control, and automated testing. Our

use cases are also much broader, from incremental product

launch to testing parameters on mobile devices. In terms of

conﬁg distribution, just on the server side, our system is 10-

100 times larger than ACMS; counting in mobile devices,

our system is 10,000-100,000 times larger.

Conﬁguration management is an overloaded term. It has

been used to mean different things: 1) controlling an applica-

tion’s runtime behavior; 2) source code version control [9];

and 3) software deployment, e.g., Chef [7], Puppet [28],

Autopilot [19], and other tools covered in the survey [10].

ACMS and our tools fall into the ﬁrst category.

“Conﬁguration-as-code” has different forms of realiza-

tion. Chef [7] executes code on the target server at the de-

ployment time, whereas Conﬁgerator executes code during

the development phase and then pushes the JSON ﬁle to all

servers during the deployment phase. The difference stems

from their focuses: software installation vs. managing run-

time behavior. Conﬁgerator indeed has some special use

cases where scripts are stored as raw conﬁgs and pushed to

servers for execution, but those are very rare.

Chubby [6] and ZooKeeper [18] provide coordination

services for distributed systems, and can be used to store

application metadata. We use Zeus, an enhanced version

340

ZooKeeper, to store conﬁgs, and use observers to form a dis-

tribution tree. Thialﬁ [1] delivers object update notiﬁcations

to clients that registered their interests in the objects. It po-

tentially can be used to deliver conﬁg change notiﬁcations.

Oppenheimer et al. [24] studied why Internet services

fail, and identiﬁed conﬁguration errors as a major source

of outages. Conﬁguration error is a well studied research

topic [17, 21, 25, 33, 35–37]. Our focus is to prevent conﬁg-

uration errors through validators, code review, manual tests,

automated canary tests, and automated integration tests.

Conﬁguration debugging tools [3, 31, 32] are complemen-

tary to our work. We can beneﬁt from these tools to diagnose

the root cause of a conﬁguration error. Spex [34] infers con-

ﬁguration constraints from software source code. It might

help automate the process of writing Conﬁgerator validators.

Like LBFS [23] and DTD [22], MobileConﬁg uses hash

to detect and avoid duplicate data transfer. It is unique in that

legacy versions of an app may access the same conﬁg using

different schemas and need to fetch different data.

Conﬁgerator not only uses git for version control, but also

uses a git push to trigger an immediate conﬁg deployment.

Many Cloud platforms [4, 15, 16] adopt a similar mecha-

nism, where a git push triggers a code deployment.

Our tools follow many best practices in software engi-

neering, including code review, veriﬁcation (in the form of

validators), continuous integration [12], canary test, and de-

ployment automation. Our contribution is to apply these

principles to large-scale conﬁguration management.

8. Conclusion

We presented Facebook’s conﬁguration management stack.

Our main contributions are 1) deﬁning the problem space

and the use cases, 2) describing a holistic solution, and

3) reporting usage statistics and operational experience from

Facebook’s production system. Our major future work in-

cludes scaling Conﬁgerator, enhancing automated canary,

expanding MobileConﬁg to cover more apps, improving the

conﬁg abstraction (e.g., introducing conﬁg inheritance), and

ﬂagging high-risk conﬁg updates based on historical data.

The technology we described is not exclusive for large

Internet services. It matters for small systems as well.

Anecdotally, the ﬁrst primitive version of the Sitevars tool

was introduced more than ten years ago, when Facebook

was still small. What are the principles or knowledge that

might be applied beyond Facebook? We summarize our

thoughts below, but caution readers the potential risk of over-

generalization.

• Agile conﬁguration management enables agile software

development. For example, gating product rollouts via

conﬁg changes reduces the risks associated with fre-

quent software releases. A/B testing tools allow en-

gineers to quickly prototype product features and fail

fast. Even if these speciﬁc techniques are not suitable

for your organization, consider other dramatic improve-

ments in your conﬁguration management tools to help

accelerate your software development process.

• With proper tool supports, it is feasible for even a large

organization to practice “open” conﬁguration manage-

ment, i.e., almost every engineer is allowed to make on-

line conﬁg changes. Although it seems risky and might

not be suitable for every organization, it is indeed feasi-

ble and can be beneﬁcial to agile software development.

• It takes a holistic approach to defend against conﬁgura-

tion errors, including conﬁg authoring, validation, code

review, manual testing, automated integration tests, and

automated canary tests.

• Although the use cases of conﬁguration management

can be very diverse (e.g., from gating product rollouts

to A/B testing), it is feasible and beneﬁcial to support

all of them on top of a uniform and ﬂexible founda-

tion, with additional tools providing specialized func-

tions (see Figure 1). Otherwise, inferior wheels will be

reinvented. At Facebook, it is a long history of frag-

mented solutions converging onto Conﬁgerator.

• For nontrivial conﬁgs, it is more productive and less

error prone for engineers to write programs to generate

the conﬁgs as opposed to manually editing the conﬁgs.

• In data centers, it is more efﬁcient to use the push model

to distribute conﬁg updates through a tree.

• The hybrid pull-push model is more suitable for mobile

apps, because push notiﬁcation alone is unreliable.

• Separating the distribution of a large conﬁg’s small

metadata from the distribution of its bulk content

(through a P2P protocol) makes the solution scalable,

without sacriﬁcing the data consistency guarantee.

• A typical git setup cannot provide sufﬁcient commit

throughput for large-scale conﬁguration management.

The solution is to use multiple git repositories to collec-

tively serve a partitioned global name space, and dele-

gate commits to the landing strip to avoid contention.

• The conﬁg use cases and usage statistics we reported

may motivate future research. For example, our data

show that old conﬁgs do get updated, and many con-

ﬁgs are updated multiple times. It would be helpful to

automatically ﬂag high-risk updates based on the past

history, e.g., a dormant conﬁg is suddenly changed in

an unusual way.

Acknowledgments

We would like to thank the anonymous reviewers and our

shepherd, Shan Lu, for their valuable feedback. The authors

are the current or the most recent members of Facebook’s

conﬁguration management team. In the past, many other

colleagues contributed to Conﬁgerator, Gatekeeper, Sitevars,

and Zeus. We thank all of them for their contributions. We

thank our summer intern, Peng Huang, for writing scripts to

replay the git history used in Figure 13.

341

References

[1] ADYA, A., COOPER, G., MYERS, D., AND PIATEK,M.

Thialﬁ: a client notiﬁcation service for internet-scale appli-

cations. In Proceedings of the 23rd ACM Symposium on Op-

erating Systems Principles (2011), pp. 129–142. SOSP’11.

[2] A

PACHE THRIFT. http://thrift.apache.org/.

[3] A

TTARIYAN, M., AND FLINN, J. Automating conﬁguration

troubleshooting with dynamic information ﬂow analysis. In

Proceedings of 9th USENIX Symposium on Operating Systems

Design and Implementation (2010), pp. 237–250. OSDI’10.

[4] AWS E

LASTIC BEANSTALK. http://aws.amazon.

com/elasticbeanstalk/.

[5] B

RONSON, N., AMSDEN, Z., CABRERA, G., CHAKKA,

P., D

IMOV,P.,DING, H., FERRIS, J., GIARDULLO, A.,

ULKARNI, S., LI, H., ET AL. TAO: Facebook’s distributed

data store for the social graph. In Proceedings of the 2013

USENIX Annual Technical Conference (2013). USENIX

ATC’13.

[6] B

URROWS, M. The Chubby lock service for loosely-coupled

distributed systems. In Proceedings of 7th USENIX Sym-

posium on Operating Systems Design and Implementation

(2006), pp. 335–350. OSDI’06.

[7] C

HEF. http://www.opscode.com/chef/.

[8] C

OHEN, B. Incentives build robustness in bittorrent. In

Proceedings of the 1st Workshop on Economics of Peer-to-

Peer systems (2003), pp. 68–72.

[9] C

ONRADI, R., AND WESTFECHTEL, B. Version models

for software conﬁguration management. ACM Computing

Surveys 30, 2 (1998), 232–282.

[10] D

ELAET,T.,JOOSEN,W.,AND VAN BRABANT, B. A survey

of system conﬁguration tools. In Proceedings of the 24th

Large Installation System Administration Conference (2010).

LISA’10.

[11] D

IBOWITZ, P. Really large scale systems conﬁguration:

conﬁg management @ Facebook, 2013. https://www.

socallinuxexpo.org/scale11x-supporting/

default/files/presentations/cfgmgmt.pdf.

[12] D

UVALL, P. M., MAT YAS , S., AND GLOVER,A.Continuous

integration: improving software quality and reducing risk.

Pearson Education, 2007.

[13] F

OWLER, M., AND HIGHSMITH, J. The agile manifesto.

Software Development 9, 8 (2001), 28–35.

[14] G

IT. http://git-scm.com/.

[15] G

OOGLE APP ENGINE. https://appengine.

google.com/.

[16] H

EROKU. https://www.heroku.com/.

[17] H

UA N G,P.,BOLOSKY, W. J., SINGH, A., AND ZHOU,Y.

ConfValley: A systematic conﬁguration validation framework

for cloud services. In Proceedings of the 10th European

Conference on Computer Systems (2015), p. 19. EuroSys’15.

[18] H

UNT,P.,KONAR, M., JUNQUEIRA,F.P.,AND REED,B.

ZooKeeper: wait-free coordination for internet-scale systems.

In Proceedings of the 2010 USENIX Annual Technical Con-

ference (2010), pp. 11–11. USENIX ATC’10.

[19] I

SARD, M. Autopilot: automatic data center management.

ACM SIGOPS Operating Systems Review 41, 2 (2007), 60–

67.

[20] J

AVA SCRIPT OBJECT NOTATION (JSON). http://www.

json.org/.

[21] M

AHAJAN, R., WETHERALL, D., AND ANDERSON, T. Un-

derstanding BGP misconﬁguration. In Proceedings of the

ACM SIGCOMM 2002 Conference on Applications, Tech-

nologies, Architectures, and Protocols for Computer Commu-

nication (2002), pp. 3–16. SIGCOMM’02.

[22] M

OGUL,J.C.,CHAN, Y.-M., AND KELLY, T. Design,

implementation, and evaluation of duplicate transfer detec-

tion in HTTP. In Proceedings of the 1st Symposium on Net-

worked Systems Design and Implementation (2004), pp. 43–

56. NSDI’04.

[23] M

UTHITACHAROEN, A., CHEN, B., AND MAZIERES,D.

A low-bandwidth network ﬁle system. In Proceedings of

the 18th ACM Symposium on Operating Systems principles

(2001), pp. 174–187. SOSP’01.

[24] O

PPENHEIMER, D., GANAPATHI, A., AND PATTERSON,

D. A. Why do Internet services fail, and what can be done

about it? In Proceedings of the 4th USENIX Symposium on

Internet Technologies and Systems (2003). USITS’03.

[25] P

APPAS,V.,XU, Z., LU, S., MASSEY, D., TERZIS, A.,

AND ZHANG, L. Impact of conﬁguration errors on DNS

robustness. In Proceedings of the ACM SIGCOMM 2004

Conference on Applications, Technologies, Architectures, and

Protocols for Computer Communication (2004), pp. 319–330.

SIGCOMM’04.

[26] P

HABRICATOR. http://phabricator.org/.

[27] P

OWER, A., 2011. Making Facebook

Self-Healing, https://www.facebook.

com/notes/facebook-engineering/

making-facebook-self-healing/

10150275248698920.

[28] P

UPPET. https://puppetlabs.com/.

[29] R

OSSI, C. Ship early and ship twice

as often. https://www.facebook.

com/notes/facebook-engineering/

ship-early-and-ship-twice-as-often/

10150985860363920.

[30] S

HERMAN, A., LISIECKI, P. A., BERKHEIMER, A., AND

WEIN, J. ACMS: the Akamai conﬁguration management

system. In Proceedings of the 2nd conference on Sympo-

sium on Networked Systems Design & Implementation (2005),

pp. 245–258. NSDI’05.

[31] S

U, Y.-Y., ATTARIYAN, M., AND FLINN, J. AutoBash:

improving conﬁguration management with operating system

causality analysis. In Proceedings of the 21st ACM Sympo-

sium on Operating Systems Principles (2007), pp. 237–250.

SOSP’07.

[32] W

HITAKER, A., COX, R. S., AND GRIBBLE, S. D. Con-

ﬁguration debugging as search: Finding the needle in the

haystack. In Proceedings of 6th USENIX Symposium on Oper-

ating Systems Design and Implementation (2004), pp. 77–90.

OSDI’04.

342

[33] WOOL, A. A quantitative study of ﬁrewall conﬁguration

errors. IEEE Computer 37, 6 (2004), 62–67.

[34] X

U,T.,ZHANG, J., HUANG,P.,ZHENG, J., SHENG,T.,

UA N , D., ZHOU,Y.,AND PASUPATHY, S. Do not blame

users for misconﬁgurations. In Proceedings of the 24th ACM

Symposium on Operating Systems Principles (2013), pp. 244–

259. SOSP’13.

[35] X

U,T.,AND ZHOU, Y. Systems approaches to tackling

conﬁguration errors: A survey. ACM Computing Surveys 47,

4 (2015), 70.

[36] Y

IN, Z., MA, X., ZHENG, J., ZHOU,Y.,BAIRAVASUN-

DARAM, L. N., AND PASUPATHY, S. An empirical study

on conﬁguration errors in commercial and open source sys-

tems. In Proceedings of the 23rd ACM Symposium on Oper-

ating Systems Principles (2011), pp. 159–172. SOSP’11.

[37] Z

HANG, J., RENGANARAYANA, L., ZHANG, X., GE, N.,

ALA,V.,XU,T.,AND ZHOU, Y. EnCore: Exploiting sys-

tem environment and correlation information for misconﬁg-

uration detection. In Proceedings of the 19th Architectural

Support for Programming Languages and Operating Systems

(2014), pp. 687–700. ASPLOS’14.

343