Poor-guys CDN (ish)

[toc]
For fun, I setup a couple of Squid proxies in reverse-proxy fashion to see how it performed. Overall, I am happy with the result but a key thought behind the idea is to provide front-end resilience to the resources they publish. To that end, I made what can be best described as a poor-guys CDN. It’s not truly a CDN in the sense of global presence nor the ability to choose a front-end server closer to the end-user (not least because some of that functionality is patent-encumbered), but it does provide some degree of resilience.

Once you have a reverse-proxy setup (which I am not covering here) the key elements are testing the front-end servers and then updating a dynamic-DNS entry with the results. Then you may feel the need to pre-load your caches. This is all simple to achieve – it can be done with some shell scripting, a tiny bit of Perl and the BIND `nsupdate` utility.

Testing front-end availability

To probe each front-end reverse-proxy we use the `GET` tool that is installed with the `LWP` package in Perl. This provides a very simple no-bells mechanism to perform an HTTP request. We configure `]GET` with a proxy server, that of the reverse-proxy we wish to test, and the URL of a resource that it should succeed in fetching. Since we want to run this often, we choose something small and with minimal CPU impact. You do need to decide whether to fetch a cacheable object or one that will always force a fetch from the origin-server – this depends on how deeply you want to test the system. For this example, we’ll just fetch this blogs `favicon.ico`.

#!/bin/sh
target_url="http://blog.flirble.org/favicon.ico"
target_proxies="10.1.0.0.1 10.2.0.0.2"

for proxy in ${target_proxies}; do
        echo "Testing proxy: \"${proxy}\"..."
        GET -t 10 -P -p "http://${proxy}:${proxy_port}/" -d "${target_url}" && \
                final_proxies="${final_proxies} ${proxy}"
done

# Strip leading space
final_proxies=$(echo "${final_proxies}" | sed -e 's/^ //')

echo "Final proxies: \"${final_proxies}\""

Yes, I am using fake IP addresses there.

Fixing LWP for IPv6

Unfortunately, `LWP` doesn’t handle IPv6 very well. There are two workarounds needed to make it work. Firstly, the code that parses HTTP proxy configuration doesn’t understand the URI form `http://[2001:0db8::1]:80/` – anything with square brackets makes it croak with `Bad http proxy specification`. This is a simple regexp fix, per the diff below. Hopefully this will be fixed in CPAN sometime.

*** UserAgent-orig.pm	Tue Oct 19 10:28:43 2010
--- UserAgent.pm	Tue Oct 19 10:15:47 2010
***************
*** 914,920 ****
          my $url = shift;
          if (defined($url) && length($url)) {
              Carp::croak("Proxy must be specified as absolute URI; '$url' is not") unless $url =~ /^$URI::scheme_re:/;
!             Carp::croak("Bad http proxy specification '$url'") if $url =~ /^https?:/ && $url !~ m,^https?://\w,;
          }
          $self->{proxy}{$key} = $url;
          $self->set_my_handler("request_preprepare", \&_need_proxy)
--- 914,920 ----
          my $url = shift;
          if (defined($url) && length($url)) {
              Carp::croak("Proxy must be specified as absolute URI; '$url' is not") unless $url =~ /^$URI::scheme_re:/;
!             Carp::croak("Bad http proxy specification '$url'") if $url =~ /^https?:/ && $url !~ m,^https?://[\w\[],;
          }
          $self->{proxy}{$key} = $url;
          $self->set_my_handler("request_preprepare", \&_need_proxy)

The second problem is more fundamental, but also easier to fix. LWP doesn’t know that IPv6 connections use a different module library from IPv4 (IPv6 in Perl is approximately broken in general as a result). Thankfully someone has a workaround for this, the module Net::INET6Glue::INET_is_INET6 (available in Ubuntu/Debian as package `libnet-inet6glue-perl`) which does some low-level Perl hackery to make the normal socket routines work for IPv6 also.

Armed with a modified UserAgent.pm and this INET6Glue module, we can do this:

perl -MNet::INET6Glue::INET_is_INET6 $(which GET) -t 10 -P 
        -p "http://[${proxy}]:${proxy_port}/" -d "${target_url}"

and achieve the desired result.

Dynamic DNS

There are several aspects to updating a DNS record dynamically. You need to have a DNS server that is authoritative for a zone, is the primary authoritative server for that zone, and it must be configured to allow updates for a record, or records, or a sub-zone.
Then you can use a client utility to cause DNS records to be updated with the results of the tests performed above.

DNS server configuration

My setup uses BIND (for better or for worse) and I keep my dynamic records under the zone “dyn.flirble.org“. The record for this reverse-proxy setup has the name “ac“.

The configuration for this zone looks a bit like this:

key "ac-key" {
    algorithm hmac-md5;
    secret "encryptedkeytexthere==";
};
...
zone "dyn.flirble.org" {
        type master;
        file "dynamic/dyn.flirble.org";
        allow-update {
                key ac-key;
        };
        also-notify {
                1.2.3.5; ...
        };
};

There are other, stronger, crypto schemes, but this works for my purposes. You can generate a key simply with `dnssec-keygen` as follows:

$ dnssec-keygen -a HMAC-MD5 -b 128 -n HOST ac-key
Kac-key.+157+53816

$ cat Kac-key.+157+53816.key
ac-key. IN KEY 512 3 157 kmLKD48bOaodPm0vkUyLqQ==

$ 

Delete the two files generated when you’re done. That string `kmL…Q==` is the one you want. It’s random and you can generate it however you like.

Sending in updates

We’ll use another BIND tool for this, `nsupdate`. It works by collecting together a batch of commands and them sending them as a unit to the name server. As a result, the operation is (probably) atomic, meaning you can simply erase the existing records and add the new set without worrying about a period of time when you return no A or AAAA records.

I do this as follows:

ns_ttl=60
ns_server=1.2.3.4
ns_zone=dyn.flirble.org
ns_hostname=ac.${ns_zone}

nu_key="ac-key:encryptedkeytexthere=="
nu_cmd="/usr/bin/nsupdate -v -y ${nu_key}"
...
{
        echo server ${ns_server}
        echo zone ${ns_zone}
        echo update delete ${ns_hostname} A
        echo update delete ${ns_hostname} AAAA
        for proxy in ${final_proxies}; do
                echo update add ${ns_hostname} ${ns_ttl} A ${proxy}
        done
        for proxy in ${final_proxies6}; do
                echo update add ${ns_hostname} ${ns_ttl} AAAA ${proxy}
        done
        echo send
        echo
} | ${nu_cmd}

Putting it all together

Here’s the entire script:

#!/bin/sh

ns_ttl=60
ns_server=1.2.3.4
ns_zone=dyn.flirble.org
ns_hostname=ac.${ns_zone}

nu_key="ac-key:encryptedkeytexthere=="
nu_cmd="/usr/bin/nsupdate -v -y ${nu_key}"

target_url="http://blog.flirble.org/favicon.ico"
target_proxies="10.1.0.1 10.2.0.2"
target_proxies6="2001:0db8::1 2001:0db8::2"
proxy_port=80
final_proxies=""
final_proxies6=""

for proxy in ${target_proxies}; do
        echo "Testing proxy: \"${proxy}\"..."
        GET -t 10 -P -p "http://${proxy}:${proxy_port}/" -d "${target_url}" && \
                final_proxies="${final_proxies} ${proxy}"
done
for proxy in ${target_proxies6}; do
        echo "Testing proxy: \"${proxy}\"..."
        # libnet-inet6glue-perl
        perl -MNet::INET6Glue::INET_is_INET6 $(which GET) -t 10 -P \
                -p "http://[${proxy}]:${proxy_port}/" -d "${target_url}" && \
                final_proxies6="${final_proxies} ${proxy}"
done

final_proxies=$(echo "${final_proxies}" | sed -e 's/^ //')
final_proxies6=$(echo "${final_proxies6}" | sed -e 's/^ //')
echo "Final proxies: \"${final_proxies}\" \"${final_proxies6}\""

if [ -z "${final_proxies}" ]; then
        # Oh dear. Just point at them all, in case they come back.
        final_proxies="${target_proxies}"
fi

{
        echo server ${ns_server}
        echo zone ${ns_zone}
        echo update delete ${ns_hostname} A
        echo update delete ${ns_hostname} AAAA
        for proxy in ${final_proxies}; do
                echo update add ${ns_hostname} ${ns_ttl} A ${proxy}
        done
        for proxy in ${final_proxies6}; do
                echo update add ${ns_hostname} ${ns_ttl} AAAA ${proxy}
        done
        echo send
        echo
} | ${nu_cmd}

Then you just CNAME your resources to this dynamic entry. Presto, site resilience. I run this script every two minutes from `cron`.

Pre-loading the cache

Finally, I also have another simple script that I use to pre-load the contents of the caches. This is as simple as recursively iterating the page structure downloading the contents – in this case using `wget`. There are two options to ensure the cache is loaded: A forced load which always fetches from the origin server, or one that validates the cached object and fetches from the origin only if it’s out of date or not loaded.

This script specifically takes only the reverse-proxies currently listed as available by the dynamic DNS method above. At the moment it skips IPv6 addresses since they point to the same servers as the IPv4 addresses and would be redundant.

#!/bin/sh

force=no
quiet=yes

args=
while [ ! -z "$1" ]; do
	case "$1" in
	--force|-f)
		force=yes
		;;
	--noforce)
		force=no
		;;
	--quiet|-q)
		quiet=yes
		;;
	--verbose|-v)
		quiet=no
		;;
	--help|-h|-*)
		cat << EOT
Usage: $0 [options]  ...
Options:
	--force   Force cache refresh
	--verbose Be noisy

EOT
		exit 1
		;;
	*)
		if [ -z "${args}" ]; then
			args="$1"
		else
			args="${args} $1"
		fi
		;;
	esac
	shift
done
set -- ${args}


if [ -z "$1" ]; then
	echo "You need to give a base url on the command line!"
	exit 1
fi

host=ac.dyn.flirble.org
proxies=$(host ${host}. | sort -u | grep -v IPv6 | awk '/address/{print $4;}')
if [ -z "${proxies}" ]; then
	echo "Can't resolve proxies, ${host} is empty!"
	exit 1
fi

echo Proxies to refresh: ${proxies}

export no_proxy=
export NO_PROXY=
export http_proxy=
export HTTP_PROXY=
export ftp_proxy=
export FTP_PROXY=

cache_opt="--header=Pragma:max-age=0 --header=Cache-Control:max-age=0"
[ "${force}" = yes ] && cache_opt="--no-cache"

quiet_opt=
[ "${quiet}" = yes ] && quiet_opt="--no-verbose --progress=dot"

other_opt="--user-agent=tfo-prefetch --recursive --no-directories --no-parent --delete-after"
for url in $*; do
	for proxy in ${proxies}; do
		export http_proxy=http://${proxy}:80/
		export HTTP_PROXY=${http_proxy}

		tmpdir=/tmp/force-cache-load
		rm -rf "${tmpdir}"
		mkdir -p "${tmpdir}"
		cd "${tmpdir}"

		wget ${cache_opt} ${quiet_opt} ${other_opt} "${url}"
	done
done

Other thoughts

And that’s my poor-guys CDN. It does have limitations, of course. It does not replace origin-server resilience. The reverse-proxy, if it follows RFCs, will frequently re-validate cached objects and if it cannot reach the origin server, will say so. There are tweaks to overcome this but they can have side effects. And there’s of course the issue that the net bandwidth benefit from using a reverse-proxy only comes from cacheable objects, but since images are generally the larger items and generally static, the benefit should be there.

If you’re using something like WordPress and one of the caching plugins, they will make pages look like static html to unregistered users and you may get a caching benefit there, but note that this means that any other plugins have to work well with a static page (server-side functionality is limited on a static cached page!)

It works for me, for now. YMMV.