Principles of Site Reliability Engineering at Google

Over the last several years, the concept of “DevOps” has swept through the engineering ecosystem, but there is a new concept that is gaining momentum, namely, the concept of “Site Reliability Engineering.” This concept was created by Ben Treynor at Google. And, in 2014, a conference was created, called SREcon, to bring together the growing community of liked-minded engineers. Google has also released a free book. The purpose of this blog post is to describe the nine major principles of Site Reliability Engineering at Google.

The first principle is to hire coders. In practice, at Google, they often hire Systems Administrators as well as Developers for the Site Reliability Engineer (SRE) position. Nevertheless, the primary duty of an SRE is to write code. In fact, one of the main concepts of site reliability engineering is “what happens when one hires a developer to do operations?” Hopefully, the developer will attempt to automate him/herself out of a job.

As a compute cluster scales linearly to accommodate more users and as software scales by adding more features, human resources should also scale linearly to manage the additional systems and to troubleshoot the increased surface area of additional features. However, an alternative to hiring more and more engineers to accommodate linear growth is an intense focus on automation. If a small group of engineers can devote most of their time to automating manual tasks and to doing auto-remediation of issues, then a compute cluster can grow linearly while the engineering group can continue to remain small.

So, the first principle of site reliability engineering is to hire great coders and let them leave if they want to leave. The part about letting them leave without a penalty is also important. If the manual work continues to be overwhelming and not enough attention is being paid to automation, then let the engineer transfer back into a more traditional development role of adding features to a product.

The second principle of site reliability engineering is to hire your SREs and your developers from the same staffing pool and treat them all as developers. An SRE is a developer. But, rather than adding features, the SRE developer is working on improving the reliability of the system. At Google, it is common for a developer to do a rotational assignment as a SRE in Mission Control. If he/she likes the work, he/she can stay, if not, he/she can go back to doing traditional development.

It is also important that there is not a line of separation between SREs and developers. Rather, the developers, who are adding features, continue to share at least 5 percent of the operational on-call workload, and they handle the spillover from the SRE team.

So, the third principle of site reliability engineering is that about 5 percent of the ops work goes to the dev team, plus all overflow. The development team always remains in the operational loop. In fact, if a development team adds features that results in instability to the system–the software product produces a number of incidents in a short period of time–then it is possible for the SREs to kick a product (or software) back to the development team and say that it is not ready for SRE support. In other words, the developers who created the product have to assume full-time support of the product, if it is not ready for production support.

The fourth principle of site reliability engineering is to cap the SRE-operational load at 50 percent (usually 30 percent). In other words, at least half of their time, SREs should be working on automation and improving reliability. One way that Google enforces this is that they limit the number of issues that an SRE is able to work on for any given shift. Typically, an issue that results in an interruption (or a alert) takes six hours to process. Of course, the resolution to the problem typically takes minutes, but the resolution process takes approximately six hours. The process includes a postmortem document, a postmortem review meeting and a set of action items, which are placed into a ticketing system. So, an SRE can only handle a maximum of two operational issues during a 12-hour shift. If there are more issues, these issues spill over to the development teams.

The fifth principle is that an on-call team has a minimum of 8 engineers for one location (or 6 engineers in each of two locations), handling a maximum of 2 events per shift. The reason for a minimum of 8 engineers is so that each engineer is on-call two weeks out of every month with a 12-hour shift. Having enough engineers on the team results in a reasonable workload and minimizes burnout.

The sixth principle of site reliability engineering is that postmortems are blameless and focus on process and technology. The central idea is that when things go wrong, the problem is the system, the process, the environment and the technology stack. Of course, there could be some human error involved, and it is very likely that the quick remediation of the problem was a result of the outstanding talent on the SRE team. Nevertheless, the focus is on how to make things better, so the focus is on the strategy, the structure and the systems. Could our monitoring, alerting and tools be better? How can we fix problem so that it does not happen again?

Ideally, an SRE team should not face the same problems repeatedly. The result of a postmortem are a list of action items for changing and improving the system. And, there should be ample time in the schedule to work on these action items. One SRE adage is do it once manually, and the second time, automate it. Again, the primary job of an SRE is to work on automation so as to improve the system. So, as the SRE tries to work him/herself out of a job, the cluster can grow and more features can be introduced without having to grow the size of the team.

The seventh principle is to have a written Service Level Objective (SLO) for each service and to measure performance against it. A Service Level Agreement (SLA) is a contract between a service provider and a customer. SLOs are the agreed upon means of measuring the performance of a service provider. SLOs are composed of Service Level Indicators (SLI). An SLI is merely something that you measure–it is a graph on your dashboard. But, when you attach a threshold to an SLI and generate an alert, this should be tied to your SLO. Typically, we measure the availability of a service, and the SLO is a threshold for how much unavailability will be tolerated. Is your objective to have your service available 99.9 percent of the time? If so, this means that you can tolerate 10 minutes and 5 seconds of unavailability per week (and 43 minutes and 50 seconds per month).

Different services will have different SLOs, and the SLO should guide your behavior. For example, if your customer can only tolerate 4 minutes and 23 seconds of unavailability per month (or 99.99 percent availability), then when you roll out a change, you will only roll it out to ten percent of systems in the cluster. Leave it running for a few hours, and then roll it out to an additional 10 percent, and so on. In other words, you will be very conservative in your deployments. But, if a service is not mission critical and you have an SLO with only 99 percent availability, then you can afford to be less controlled and less conservative in your deployment. It is important to note that “availability” can be many faceted, but SLOs should be measurable, easily understandable and meaningful. The goal of an SLO is to guide behavior and to put guards on action..

The eighth principle is to use SLO budgets as your launch criteria. The best way to insure stability of a system is not to introduce any change into the system. Of course, we want to constantly add features to software, and usage growth demands that we continually upgrade the cluster. But, your SLOs should guide you with respect to how much change to introduce and on what schedule. The idea of a “budget” is similar to the idea of a bank account. One cannot make withdrawals on a bank account that has a zero balance. Likewise, if you are exceeding your SLO, you must stop introducing change. I believe that Google uses a monthly SLO. So, if a service has an availability of 99.9 percent, then that service has a budget of 43 minutes and 50 seconds of unavailability per month. So, feel free to launch new features as long as you have the budget for it. However, when you approach your budget in a given month, you must curtail adding new features and introducing change until your budget is replenished. By having an SLO budget and allowing it to dictate your behavior, you are ensuring quality and maintaining a high-level of customer satisfaction.

The ninth and final principle of site reliability engineering is “practice, practice, practice.” If you do your job correctly, then you should have a quiet system. In fact if your system is redundant and resilient, your troubleshooting skills can get rusty and operational readiness can diminish. Netflix introduced a “Chaos Monkey” into their system, not only to test for redundancy and resiliency, but to improve operational readiness. At Google, one of the most popular SRE events is called the “Wheel of Misfortune.” The game starts with a pie chart, that comprises a frequency distribution of the outages that they have seen in the last month or two. And, the engineers’ role play an outage from the pie chart. One engineer is selected as the on-call engineer, while another describes an outage scenario. As the two engineers do a dry run of an outage, the other engineers take notes, and there is a mini postmortem at the end. The overall goal is to cut the amount of time to resolve issues, and practice can help to dramatically reduce times to resolution.

To review, these are the nine principles of site reliability engineering.
To hire great coders and let them leave if they want to leave.
To hire your SREs and your developers from the same staffing pool and treat them all as developers.
About 5 percent of the ops work goes to the dev team, plus all overflow.
To cap the SRE-operational load at 50 percent (usually 30 percent)
An on-call team has a minimum of 8 engineers for one location (or 6 engineers in each of two locations).
Postmortems are blameless and focus on process and technology
To have a written Service Level Objective (SLO) for each service and to measure performance against it.
To use SLO budgets as your launch criteria.
Practice and make it fun.

These nine principles of site reliability engineering are not my own. I got them from Ben Treynor’s keynote address at SREcon 2014. These principles have been developed at Google and tested over time. I hope to make use these principles in my own work and to inform my future deliberations on the role of DevOps.

Nginx for Drupal and for Redis

Recently, I compiled nginx on CentOS 6 for a Drupal 7 installation and for a Redis installation. So, I thought that I would share my steps.

Whenever I compile nginx, I always use the nginx_syslog_patch to enable syslog logging. The only module that is required for Drupal 7 is nginx-upload-progress-module. Here is my configuration for Drupal 7.

First, I added two repositories to get the latest PHP binaries.
wget http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
rpm -Uhv epel-release-6-8.noarch.rpm
wget http://rpms.famillecollet.com/enterprise/remi-release-6.rpm
rpm -Uhv remi-release-6.rpm
cd /etc/yum.repos.d/
Enable the remi repository.

yum install git vim openssl php php-fpm php-common php-pear php-pdo php-mysql php-gd php-mbstring php-mcryp mysql gcc pcre pcre-devel zlib zlib-devel openssl openssl-devel make patch readline-devel pcre-devel openssl-devel

Second, get the modules:
git clone https://github.com/yaoweibin/nginx_syslog_patch
git clone https://github.com/masterzen/nginx-upload-progress-module.git

Download the latest stable release of the nginx source code.
wget http://nginx.org/download/nginx-1.2.7.tar.gz
tar -zxvf nginx-1.2.7.tar.gz
cd nginx-1.2.7
patch -p1 < /root/nginx_syslog_patch/syslog_1.2.7.patch After nginx has been patch, you can compile and install. ./configure --add-module=/root/nginx_syslog_patch --add-module=/root/nginx-upload-progress-module --prefix=/etc/nginx --sbin-path=/usr/sbin/nginx --conf-path=/etc/nginx/nginx.conf --error-log-path=/var/log/nginx/error.log --http-log-path=/var/log/nginx/access.log --pid-path=/var/run/nginx.pid --lock-path=/var/run/nginx.lock --http-client-body-temp-path=/var/cache/nginx/client_temp --http-proxy-temp-path=/var/cache/nginx/proxy_temp --http-fastcgi-temp-path=/var/cache/nginx/fastcgi_temp --http-uwsgi-temp-path=/var/cache/nginx/uwsgi_temp --http-scgi-temp-path=/var/cache/nginx/scgi_temp --user=nginx --group=nginx --with-http_ssl_module --with-http_realip_module --with-http_addition_module --with-http_sub_module --with-http_dav_module --with-http_flv_module --with-http_mp4_module --with-http_gzip_static_module --with-http_random_index_module --with-http_secure_link_module --with-http_stub_status_module --with-mail --with-mail_ssl_module --with-file-aio --with-ipv6 --with-cc-opt='-O2 -g' make; make install Next, you can get the latest stable release of drupal: wget http://ftp.drupal.org/files/projects/drupal-7.21.tar.gz You will need to configure php-fpm and nginx. I recommend starting with the following configuration bundle. https://github.com/perusio/drupal-with-nginx This will enable you to get up and running with drupal in a short period of time. I also decided to configure nginx for connecting to Redis. For me the best module for connected to Redis is lua-resty-redis. I recommend it. Download http://openresty.org/ and compile it. OpenResty is nginx with a bunch of nginx modules bundled in. After compiling openresty, you will have an nginx binary with many nice lua modules installed. Below is my openresty configuration. wget http://openresty.org/download/ngx_openresty-1.2.6.6.tar.gz tar -xzvf ngx_openresty-1.2.6.6.tar.gz cd ngx_openresty-1.2.6.6 ./configure --add-module=/root/nginx_syslog_patch --add-module=/root/nginx-upload-progress-module --without-lua_resty_memcached --without-lua_resty_mysql --with-luajit --prefix=/etc/nginx --sbin-path=/usr/sbin/nginx --conf-path=/etc/nginx/nginx.conf --error-log-path=/var/log/nginx/error.log --http-log-path=/var/log/nginx/access.log --pid-path=/var/run/nginx.pid --lock-path=/var/run/nginx.lock --user=apache --group=apache --with-http_realip_module --http-fastcgi-temp-path=/var/cache/nginx/fastcgi_temp --http-uwsgi-temp-path=/var/cache/nginx/uwsgi_temp --http-scgi-temp-path=/var/cache/nginx/scgi_temp --without-http_memcached_module --without-http_auth_basic_module --with-http_gzip_static_module --with-http_random_index_module --with-http_secure_link_module --with-http_stub_status_module --with-http_realip_module --with-http_addition_module --with-http_sub_module --with-http_flv_module --with-cc-opt='-O2 -g' Some good resources for nginx and lua are as follows: https://github.com/agentzh/lua-resty-redis http://wiki.nginx.org/HttpLuaModule http://wiki.nginx.org/NginxHttpCoreModule http://wiki.nginx.org/CommandLine http://www.lua.org/manual/5.1/manual.html http://www.kyne.com.au/~mark/software/lua-cjson-manual.html

Zend Framework 1.12.1 includes protection from cross-site request fogery

Recently, I upgraded from ZF 1.11.9 to 1.12.1, and I discovered that requests from a friendly MS IIS web server were not working. I am using ZF as an API backend and MS IIS as a friendly frontend. The problem was that ZF refused to honor the REDIRECT_URL and overrode REDIRECT_URL with HTTP_X_ORIGINAL_URL. The simple fix was to place the following code in the ZF bootstrap.

$_SERVER[‘HTTP_X_ORIGINAL_URL’] = $_SERVER[‘REDIRECT_URL’];

Red Hat reported a problem with XSS flaws (see https://bugzilla.redhat.com/show_bug.cgi?id=CVE-2012-4451), and the problem was fixed in ZF 2 and ZF 1. However, it created a problem for me.

Kernel tuning for the TCP stack

Below are some kernel tweaks that I use for CentOS 6.2 with a 10 GB NIC.


#Lower syn retry rates, default is 5
net.ipv4.tcp_synack_retries = 2
net.ipv4.tcp_syn_retries = 3

# Tune IPv6
net.ipv6.conf.default.router_solicitations = 0
net.ipv6.conf.default.accept_ra_rtr_pref = 0
net.ipv6.conf.default.accept_ra_pinfo = 0
net.ipv6.conf.default.accept_ra_defrtr = 0
net.ipv6.conf.default.autoconf = 0
net.ipv6.conf.default.dad_transmits = 0
net.ipv6.conf.default.max_addresses = 1

# Increase TCP max buffer size setable using setsockopt()
# default 4096 87380 4194304
net.ipv4.tcp_rmem = 4096 87380 8388608
net.ipv4.tcp_wmem = 4096 87380 8388608

# Increase Linux auto tuning TCP buffer limits
# min, default, and max number of bytes to use
# set max to at least 4MB, or higher if you use very high Bandwidth-delay product (BDP) paths
# Tcp Windows etc
# default 131071
net.core.rmem_max = 8388608
net.core.wmem_max = 8388608
# default 124928
net.core.rmem_default = 524287
net.core.wmem_default = 524287

# default net.core.netdev_max_backlog = 1000, set to 30000 for 10Gbit NICs
net.core.netdev_max_backlog = 32768
#
# default net.core.somaxconn = 128
net.core.somaxconn = 4096

# You might also try the following:
# default net.ipv4.tcp_max_syn_backlog = 1024
net.ipv4.tcp_max_syn_backlog = 4096

# metrics slows us down :)
net.ipv4.tcp_no_metrics_save = 1

# default net.ipv4.tcp_no_metrics_save = "32768 61000"
net.ipv4.ip_local_port_range = 1025 65535

# default tcp_fin_timeout = 60
net.ipv4.tcp_fin_timeout = 30

# default net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_intvl = 30

# default net.ipv4.tcp_max_tw_buckets = 180000
net.ipv4.tcp_max_tw_buckets = 1440000

# default net.ipv4.tcp_keepalive_time = 7200
net.ipv4.tcp_keepalive_time = 400

# default net.ipv4.tcp_keepalive_probe = 9
net.ipv4.tcp_keepalive_probes = 5

# default vm.swappiness = 60
vm.swappiness = 20

# ipcs -l
# default max seg size (kbytes) = 32768
kernel.shmmax = 500000000
# default max total shared memory (kbytes) = 8388608
kernel.shmall = 4000000000
# default max queues system wide = 1024
kernel.msgmni = 2048

net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_max_syn_backlog = 4096
net.ipv4.conf.all.arp_filter = 1
net.core.netdev_max_backlog = 10000

Also set your per user limits higher in /etc/security/limits.conf:
* soft nofile 65536
* hard nofile 65536
* soft nproc 131072
* hard nproc 131072

How to use Pageant and Putty from Windows

Here is a terse HowTo for using putty to ssh into a remote server without using a password.

1. Download Putty installer from the Putty download page. Make sure to grab the windows “Installer”.
2. Install Putty
3. Start PuttyGen from Start -> Putty-> PuttyGen
4. Generate a new key and save it as a .ppk file without a passphrase
5. Use Putty to login to the server you want to connect to
append the public key text from PuttyGen to the text of ~/.ssh/authorized_keys Tip: Copy and paste from the PuttyGen console.
6. Create a shortcut to your .ppk file from Start -> Startup.
7. Select the .ppk shortcut from the Startup menu to start Pageant (this will happen automatically at every startup).
8. See the Pageant icon in the system tray? Right-click it and select “New session”
9. Enter username@hostname in the “Host name” field.
You will now log in automatically.

If your key is not accepted, check your file permissions. SSH is very sensitive directory and file permissions.
chmod 700 ~/.ssh
chmod 600 ~/.ssh/authorized_keys

Did you get caught by the Leap-second bug?

I happened to be one of the many thousands of systems engineers that got a call at 9:00 p.m. on Saturday, June 30. All of our linux servers with Java applications were under heavy load. Restarting the Java applications did nothing to reduce the system load. After several hours of troubleshooting, we finally decided to reboot the server. Much to my surprise, this fixed the problem.

Some have reported that this was not a Java releated problem; rather, it was a linux problem. See: http://www.pcworld.com/businesscenter/article/258688/linux_is_culprit_in_leapsecond_lapses_cassandra_exec.html

Well, linux servers with Apache, MySQL, erlang, python, PHP, C or C++ applications did not have any problems. It was only our Java applications that had problems (e.g., Neo4J, Hbase, Kafka and so on). The following is what I discovered in my troubleshooting.

Java stopped closing threads efficiently, and threads remained open in a “wait” or a “sleep” state for a long time. As a result, Java was using the maximum amount of memory allowed, and garbage collection was not working efficiently. All of our java applications were affected. None of our other applications were effected.

Java applications were the culprit for loading the systems. I agree that the only thing that fixed the problem was resetting the Linux clock (or timer). So, the problem was certainly related to the way Linux updated its clocks during the leap second. However, Oracle needs to step up and take some responsibility. The problem may have started with the Linux clock, but the result was that Java freaked out.

I should have entitled this blog entry as: JAVA Is Culprit in Leap-second Lapses.

Installing ZeroMQ with node.js

Recently, I installed zeromq with node.js and zeromq.node. I had some difficulties, so I thought that I would generate a brief how to.

ZeroMQ installed without much difficulty. On Ubuntu, install the following prerequisite.
apt-get install uuid-dev
On CentOS, install:
yum install libuuid-devel uuid-devel

Then, unpack the tarball and execute: configure, make
As the super user, execute: make install
This will install ZeroMQ in /usr/local/bin

Download and install the latest stable version of node.js. Just unpack the tarball and execute: configure, make
As the super user, execute: make install

The node.js library for 0mq is zeromq.node, and it does not work with the latest development branch of node.js. Rather, you need it install a stable branch of node.js.

Before you install zeromq.node do the following.
1. Make sure that /usr/local/bin is in your path, since there is where node lives.
2. Create the following file: /etc/ld.so.conf.d/zmq.conf
The contents of the above file is ‘/usr/local/lib’.
3. Execute the command ‘ldconfig’.
4. export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
5. npm install zmq

If you install zeromq.node (now called ‘zmq’) globally as the super user (with ‘npm install -g zmq’), node.js will likely not be able to find it (in /usr/local/lib/node_modules). So, install it locally as above in your $HOME/node_modules directory.

Now, you should have a working version of zmq with node.js. You can test your installation using the example scripts (e.g., producer.js) on Justin’s github: https://github.com/JustinTulloss/zeromq.node

Resizing your root file system on Ubuntu with the logical volume manager (LVM)

I recently add some disk space to my root file system on a live, mounted file system using Ubuntu 10.10. I was using a virtual machine, so I changed the size of my disk drive in the virtual machine manager. Then, I logged into Ubuntu as root and executed the following commands.

Partitioned the new drive space with fdisk.
fdisk /dev/hda
In fdisk, “n” (new partition), “p” (primary), “3” (partition number), selected all remaining space on the drive, “t” change the type of partition to “8e” (LVM Volumn), “w” write the changes. I created a new partition on /dev/hda3.

  • partprobe
  • pvcreate /dev/hda3
  • vgextend POC-Ubuntu00 /dev/hda3
  • lvextend /dev/POC-Ubuntu00/root /dev/hda3
  • resize2fs /dev/POC-Ubuntu00/root
  • The volume group is “POC-Ubuntu00”. The logical volume name is “/dev/POC-Ubuntu00/root”.

    Some useful commands: pvdisplay, vgdisplay, lvdisplay.

    See:
    http://tldp.org/HOWTO/LVM-HOWTO/extendlv.html

MongoDB versus Riak Datastore: Some Benchmarks

I was doing some google searches, and I found some benchmarks of MongoDB and Riak.

Mr. Howe benchmarked MongoDB. He was able to get 1,750 GET requests per second from Mongo, when the keys could be stored in memory.
http://www.colinhowe.co.uk/2011/02/23/mongodb-performance-for-data-bigger-than-memor/

In contrast, the folks at Joyeur were able to get 6,650 GET requests per second on a five-node Riak cluster and 13,700 GET requests per second on a 10-node cluster. These are “highly optimized” Riak clusters, using the “protocol buffers” interface.
http://joyeur.com/2010/10/31/riak-smartmachine-benchmark-the-technical-details/

In reality, most people do not run “highly optimized” Riak clusters with “protocol buffers”; rather, they run non-optimized Riak clusters over HTTP. I did a quick and dirty informal test of my Riak cluster (Erlang: R13B04, Riak: 0.14.2). I have a three node cluster, and I was testing from a remote server over an internal network, using node.js and riak-js client. I have about 200,000 records in Riak, which is well within the memory requirements of the cluster. In my informal tests, I got about 250 GET requests per second.

One interesting thing to note is that Basho, the company that supports Riak, admits that MongoDB is “more peformant.” The entire quote: “Mongo is more performant because it uses memory-mapped files. The tradeoff is that it fsyncs (flushes in-memory data to disk) only every 60 seconds by default, so you run the risk of losing data if your MongoDB server goes down. The solution for increasing durability in MongoDB is to replicate.” See http://wiki.basho.com/Riak-Compared-to-MongoDB.html

The advantage of Riak over Mongo is that Riak automatically replicates and rebalances.

The advantage of MongoDB over Riak is that Mongo supports secondary indexes and a more robust query language.

Both Riak and MongoDB support MapReduce via JavaScript, and both use the SpiderMonkey JavaScript engine. However, Riak’s MapReduce framework is more powerful than MongoDB’s framework because Riak allows you to run MapReduce jobs on a filtered set of keys. By contrast, in Mongo, you have to run MapReduce jobs across an entire database.