# Benchmarking

# Installation

Bootstrapper (opens new window) includes a standardized release to evaluate build performance across hosting providers.

curl https://raw.githubusercontent.com/apisnetworks/apnscp-bootstrapper/master/bootstrap.sh | env RELEASE=benchmark bash

Check back in ~2 hours once installation completes, then run the following command:

IFS=$'\n' ; DATES=($((tail -n 1 /root/apnscp-bootstrapper.log | grep failed=0 ; grep -m 1 -P '^\d{4}-.*[u|p]=root' /root/apnscp-bootstrapper.log ) | awk '{print $1, $2}')) ; [[ ${#DATES[@]} -eq 2 ]] && python -c 'from datetime import datetime; import sys; format="%Y-%m-%d %H:%M:%S,%f";print datetime.strptime(sys.argv[1], format)-datetime.strptime(sys.argv[2], format)' "${DATES[0]}" "${DATES[1]}" || (echo -e "\n>>> Unable to verify Bootstrapper completed - is Ansible still running or did it fail? Last 10 lines follow" && tail -n 10 /root/apnscp-bootstrapper.log)

A duration will appear or the last 10 lines of /root/apnscp-bootstrapper.log if it failed. This duration tests network/IO/CPU.

A second test of backend performance once apnscp is setup gives the baseline performance between frontend/backend communication to a single vCPU. This can be tested as follows.

First update the shell with helpers from .bashrc,

exec $SHELL -i
# Then run the cpcmd helper
cpcmd scope:set cp.debug true  ; sleep 5 ; cpcmd test_backend_performance ; cpcmd scope:set cp.debug false

Enabling debug mode opens up the test module for instrumentation utilities.

# Converting to production

A server provisioned using the benchmark branch can be converted to a normal build without resetting the server. Use cpcmd to set any apnscp-vars.yml (opens new window) value; use the Customization Utility (opens new window) on ApisCP as cross-reference.

# Launch new bash shell with apnscp helper functions
exec $SHELL -i
cd /usr/local/apnscp
# Save remote URL, should be gitlab.com/apisnetworks/apnscp.git
REMOTE="$(git config --get remote.origin.url)"
git remote remove origin
git remote add -f -t master origin "$REMOTE"
git reset --hard origin/master
cpcmd scope:set cp.bootstrapper populate_filesystem_template true
# Set any other Bootstrapper values from apnscp-vars.yml...
# cpcmd config_set apnscp.bootstrapper varname varval
upcp -sb
# After Bootstrapper completes - it will take 5-30 minutes to do so
cpcmd scope:set cp.bootstrapper populate_filesystem_template auto
cpcmd auth:change-password newadminpassword
cpcmd common:set-email your@email.address

populate_filesystem_template must be enabled to update any packages that have been added/removed in ApisCP. Once everything is done, access ApisCP's interface (opens new window) to get started.

Bootstrapper project on GitHub contains a reference for provider performance (opens new window). Not all providers are created equal and often times a "CPU" on VPS is shared n-ways sometimes with very noisy neighbors.

# HTTP

Before evaluating HTTP performance, it is necessary to disable Evasive's HTTP request protection. Disabling HTTP/1.0 protection may also be necessary depending upon the benchmark utility, such as ab, which sends requests using HTTP/1.0.

By default, a 2 minute delay is imposed before applying configuration changes to the HTTP server. This may be changed by adjusting [httpd] => reload_delay (see Apache.md). For example, a value of "now" disables any reload delay, but may expose a server to unintended DoS attacks during bulk account edits.

cpcmd scope:set apache.evasive enabled false
cpcmd scope:set apache.block10 false
# Now benchmark a site
ab -n 1000 -c $(nproc) http://mydomain.com/

# Baseline benchmark

Before diving into complex benchmarks, start with a simple benchmark: static resources (index.html) and PHP (index.php). These represent the maximal throughput of resources and PHP without further tweaks. For most use cases this is adequate without optimizing .htaccess parsing that may introduce appreciable overhead.

DeleteDomain benchmark.test 2>/dev/null || AddDomain -c siteinfo,domain=benchmark.test -c crontab,enabled=1 -c crontab,permit=1 -c ssh,enabled=1 -c dns,provider=null -c mail,provider=null -c siteinfo,admin_user=benchmark-user
grep -Eq "benchmark.test\b" /etc/hosts || (echo "$(cpcmd -d benchmark.test site:ip-address) benchmark.test" >> /etc/hosts)

cpcmd scope:set apache.evasive enabled false
cpcmd scope:set apache.block10 false
sleep 120

Separate requests into static without caching and PHP without overhead parsing.

echo "hello" > /home/virtual/benchmark.test/var/www/html/index.html
ab -k -n 1000 -c $(nproc) http://benchmark.test/index.html

Likewise a simple PHP script:

echo "<?php Hello from PHP" > /home/virtual/benchmark.test/var/www/html/index.php
ab -k -n 1000 -c $(nproc) http://benchmark.test/index.php

These benchmarks represent the highest feasible throughput on an idle system. Advanced benchmarks follow, including relocating per-request .htaccess parsing in memory.

Static vs Zend VM

Static will be faster than PHP because it avoids Zend VM (opens new window) overhead to render dynamic content. Static files are memory-mapped segments (opens new window) without additional overhead beyond memory costs.

It's a deep dive for those curious.

# Fast WordPress benchmark

Creating a test account to benchmark WordPress is simple with a few CLI commands. Once your done benchmarking, run DeleteDomain benchmark.test to remove it (or keep it around for a rainy day).

DeleteDomain benchmark.test 2>/dev/null || AddDomain -c siteinfo,domain=benchmark.test -c crontab,enabled=1 -c crontab,permit=1 -c ssh,enabled=1 -c dns,provider=null -c mail,provider=null -c siteinfo,admin_user=benchmark-user
grep -Eq "benchmark.test\b" /etc/hosts || (echo "$(cpcmd -d benchmark.test site:ip-address) benchmark.test" >> /etc/hosts)
cpcmd -d benchmark.test wordpress:install benchmark.test
cpcmd scope:set apache.evasive enabled false
cpcmd scope:set apache.block10 false
sleep 120
ab -k -n 1000 -c $(nproc) http://benchmark.test/
DETAILS

Run the following commands to create a new domain named "benchmark.test". DNS and email will be disabled for the domain. Install WordPress, disable Evasive and HTTP/1.0 protection on the account. Sleep for 2 minutes for [httpd] => reload_delay to expire (at -l shows pending jobs), then run 1000 requests in serial against the domain.

nproc runs up to n parallel workers equal to the number of CPU cores available on the machine. See concurrency section below for rationale.

TIP

Overriding your hosts file (opens new window) would allow you to access the WordPress administrative portal as if it were a real, resolvable domain. Use the IP from cpcmd -d benchmark.test site:ip-address.

# Extending WordPress

Let's take this one step further, configuring a WP Redis object cache. Use redis:create to create a new Redis instance for the account, then wp-cli to install a Redis cache plugin.

# Create a new Redis instance for benchmark.test named "wp-test" listening on /tmp/redis.sock
cpcmd -d benchmark.test crontab:toggle-status 1
# PHP-FPM uses a private /tmp mount, expose Redis to shared location
# See PHP-FPM.md#privatetmp for more information
cpcmd -d benchmark.test redis:create wp-test '[unixsocket:/var/www/redis.sock,unixsocketperm:0660]'
# Switch to benchmark.test account to configure plugin
su benchmark.test
cd /var/www/html
# Install Redis object cache plugin
wp-cli plugin install --activate redis-cache
# Define Redis path
wp-cli config set WP_REDIS_PATH /var/www/redis.sock
wp-cli config set WP_REDIS_SCHEME unix

Redis configuration

Activate Redis cache and you're set!

wp-cli redis enable
# Verify it is running
wp-cli redis status

Bug in 2.0.12

wp-cli redis status will throw an error in 2.0.12. Add the following to wp-content/plugins/redis-cache/includes/ui/diagnostics.php after global $wp_object_cache;:

$roc = \Rhubarb\RedisCache\Plugin::instance();

Exit out of the subshell, run as an unrestricted user, then verify data is cached:

exit
ab -n 10 -c 4 http://benchmark.test/
echo "KEYS *" | redis-cli -s /home/virtual/benchmark.test/tmp/redis.sock

# Output cache

5 ms may be fast, but what if we want to make WordPress faster? A simple solution is to reuse previously rendered output. We can easily accomplish this using Apache's builtin cache. Moreover, we can serve optimized content that has passed through Google PageSpeed (opens new window).

Any caching plugin will work adequately for this task. We'll use W3TC as it provides additional WP-CLI commands. In continuation from the above Redis example, let's enable caching in Apache, then switch back to benchmark.test to install/configure W3TC:.

# Enable memory-backed caching
cpcmd scope:set apache.cache memory
# Require sites to opt-in with "UnsetEnv no-cache"
cpcmd scope:set apache.cachetype explicit

Now switch to benchmark.test account,

su benchmark.test
cd /var/www/html
wp-cli plugin install --activate w3-total-cache
wp-cli w3-total-cache fix_environment
wp-cli w3-total-cache option set pgcache.enabled true --type=boolean

Exit back out and reapply Fortification so wp-content/cache permissions are corrected as needed. Note if Fortification was set to min, this operation is unnecessary.

cpcmd -d benchmark.test wordpress:fortify benchmark.test

Then run ab against the site. Note it's expected to encounter some transient response size anomalies while PageSpeed optimizes the content. From a simple testing dual-core server, page throughput jumped from 116 req/second to 3782 req/second.

ab  -k -n 1000 -c 4 http://benchmark.test/

# Trimming .htaccess

One last thing we can do is put the .htaccess on a diet. Remove all of the superfluous AddType directives. These are negotiated automatically by TypesConfig (opens new window) in the server configuration. Remember, for each page request Apache must enumerate all rules in .htaccess. Shaving 250 lines can greatly improve throughput!

After removing the unnecessary directives, .htaccess shrunk by 32.8% (8960 bytes to 6014 bytes). Page throughput likewise improved to 4224.36 req/second, a gain of 11.6% just by removing superfluous directives.

.htaccess size matters
#### Removing .htaccess

Apache's biggest strength is too its biggest weakness: flexibility. Because users may override settings in .htaccess at their discretion, Apache must backtrack and check all previous directories before reaching a verdict. With SSD and NVMe SSD, the overhead of these stat() checks is greatly ameliorated, but we can achieve higher throughput in the name of benchmarks.

Taking the .htaccess one step further, let's remove it from the equation entirely and convert it to startup directives saved in memory whenever Apache boots much like NGINX. Assume that get_site_id benchmark.test returns "130" in the following example.

Copy your .htaccess from /home/virtual/benchmark.test/var/www/html/ into /etc/httpd/conf/site130/. Next, we'll change the dispatcher location and disable overrides.

mv /home/virtual/benchmark.test/var/www/html/.htaccess /etc/httpd/conf/site130/wp-test.raw

Edit wp-test surrounding the dispatcher rules in a <Directory>... </Directory> clause adding AllowOverride none and UnsetEnv no-cache inside the clause as depicted in the screenshot.

.raw suffix

When compiling modular Apache directives into a monolithic file, all paths are translated to their direct shadow layer. Append .raw to the file to prevent this translation.

CacheQuickHandler

Optionally, add CacheQuickHandler on outside <Directory>...</Directory> to bypass additional axis processing. This will further improve processing times to the values arrived at in this article at the expense of brute-force protection. CacheQuickHandler usage blocks the effects of mod_evasive, but static content has nothing to interact with. Regardless, use at your own risk.

Rule promotion in Apache

Run htrebuild, then benchmark. Performance skyrocketed from 4224 req/second to a blistering 15271 req/second!

We've switched from 4 to 8 core concurrency to ensure proper saturation. At -c4 output is ~13583 req/second. At -c8 output saturates at 15271 req/second.

# Optimizing render

ApisCP ships with PageSpeed to transparently rewrite and cache HTML. 70 microseconds is quite fast. The difference between 70 and 7 microseconds (14k and 140k pages per second) isn't appreciable to cross the just noticeable difference threshold, but what is is rendering performance that can be orders of magnitude higher. Apache has excellent integration with PageSpeed which can be combined with its cache to save rendered, optimized copies.

Edit /etc/httpd/conf/site130/wp-test as noted in the above example and add ModPagespeedRewriteLevel CoreFilters after UnsetEnv no-cache. Next, run htrebuild to rebuild the configuration.

Next, install the OceanWP (opens new window) theme for a cursory evaluation.

su benchmark.test
cd /var/www/html
wp-cli theme install --activate oceanwp
# Flush PageSpeed cache
curl -X PURGE http://benchmark.test
# Flush disk cache
wp-cli w3-total-cache flush all
sleep 5

It may take a few moments for PageSpeed to rebuild the rendered content. For comparison, the following two blocks are before and after PageSpeed's optimizer has rewritten output using Safe optimizations in Web > Site Optimizer (CoreFilters setting).

OceanWP unoptimized

OceanWP optimized

Next, we can use Lighthouse to evaluate performance before and after.

Lighthouse OceanWP unoptimized

Lighthouse OceanWP optimized

The difference is staggering: compared to decreasing page transmission time by 65 microseconds, we've decreased rendering time by 10,000 microseconds - a 153x improvement - using a single action in the control panel, enabling PageSpeed.

In 2017, Akamai found a 100 ms delay (opens new window) in website load time can decrease conversion rates by 7%. Improving page transmission by 65 microseconds has less of a net effect than improving render times. Moreover, the JND threshold in humans in approximately 50 milliseconds (opens new window) (769x more than page transmission gains). Page throughput is important, but it's not everything.

Always focus optimizations where the yield can be greatest.
### Concurrency

Benchmarks are designed to model real-world scenarios with artificial, deterministic usage patterns. It's an oxymoron to believe any such correlation exists between benchmarks and typical usage scenarios, but what benchmarks provide is the theoretical peak throughput. It's all downhill from there!

When evaluating the peak throughput do not run more than NPROC+2 instances. Linux has an intelligent scheduling algorithm to interleave parcels of work (threads). If for example a site is handling 5 concurrent requests over 250 ms, the processing is rarely contiguous due to network latency/output buffering. Benchmarking locally removes this barrier.

Let's assume a WordPress site on a two-core machine. Following this logic, benchmark figures should begin to stabilize after 3 concurrent requests. All requests are generated using ab as outlined above. The PHP-FPM pool was reconfigured from ondemand to static and the total worker count (pm.max_children) changed from 3 to 20. The table below summarizes benchmarks with various concurrency (-c) levels.

Concurrency Throughput (req/sec) % Δ Time per req (ms) % Δ
1 83.18 12.022
2 163.01 95.97% 6.135 -48.97%
3 169.12 3.75% 5.913 -3.63%
4 174.86 3.39% 5.719 -3.28%
5 174.61 -0.14% 5.727 0.14%
6 175.53 0.53% 5.697 -0.52%
7 174.73 -0.46% 5.723 0.46%
8 176.24 0.86% 5.674 -0.86%
9 175.65 -0.33% 5.693 0.33%
10 177.18 0.87% 5.644 -0.86%
15 175.89 -0.73% 5.685 0.73%
20 177.87 1.23% 5.622 -1.11%

ApisCP uses NPROC + 2 workers per site. Typically this is sufficient for optimal throughput except in high latency environments. PHP requests operate synchronously, which means the workers is only freed to handle a new request at the conclusion of the previous request. Theoretically, in the above benchmark, PHP could serve ~170 concurrent users per second with 4 PHP-FPM workers assuming a uniform distribution or 14.6 million pageviews per day. Often these figures are much lower in real-world, hampered by network latency both on the request and response.