When your apt-mirror is always downloading

Thursday 24 January 2008 by Bradley M. Kuhn

When I started building our apt-mirror, I ran into a problem: the machine was throttled against ubuntu.com's servers, but I had completed much of the download (which took weeks to get multiple distributions). I really wanted to roll out the solution quickly, particularly because the service from the remote servers was worse than ever due to the throttling that the mirroring created. But, with the mirror incomplete, I couldn't so easily make available incomplete repositories.

The solution was to simply let apache redirect users on to the real servers if the mirror doesn't have the file. The first order of business for that is to rewrite and redirect URLs when files aren't found. This is a straightforward Apache configuration:

   RewriteEngine on
   RewriteLogLevel 0
   RewriteCond %{REQUEST_FILENAME} !^/cgi/
   RewriteCond /var/spool/apt-mirror/mirror/archive.ubuntu.com%{REQUEST_FILENAME} !-F
   RewriteCond /var/spool/apt-mirror/mirror/archive.ubuntu.com%{REQUEST_FILENAME} !-d
   RewriteCond %{REQUEST_URI} !(Packages|Sources)\.bz2$
   RewriteCond %{REQUEST_URI} !/index\.[^/]*$ [NC]
   RewriteRule ^(http://%{HTTP_HOST})?/(.*) http://91.189.88.45/$2 [P]
 

Note a few things there:

  • I have to hard-code an IP number, because as I mentioned in the last post on this subject, I've faked out DNS for archive.ubuntu.com and other sites I'm mirroring. (Note: this has the unfortunate side-effect that I can't easily take advantage of round-robin DNS on the other side.)

  • I avoid taking Packages.bz2 from the other site, because apt-mirror actually doesn't mirror the bz2 files (although I've submitted a patch to it so it will eventually).

  • I make sure that index files get built by my Apache and not redirected.

  • I am using Apache proxying, which gives me Yet Another type of cache temporarily while I'm still downloading the other packages. (I should actually work out a way to have these caches used by apt-mirror itself in case a user has already requested a new package while waiting for apt-mirror to get it.)

Once I do a rewrite like this for each of the hosts I'm replacing with a mirror, I'm almost done. The problem is that if for any reason my site needs to give a 403 to the clients, I would actually like to double-check to be sure that the URL doesn't happen to work at the place I'm mirroring from.

My hope was that I could write a RewriteRule based on what the HTTP return code would be when the request completed. This was really hard to do, it seemed, and perhaps undoable. The quickest solution I found was to write a CGI script to do the redirect. So, in the Apache config I have:

ErrorDocument 403 /cgi/redirect-forbidden.cgi

And, the CGI script looks like this:

#!/usr/bin/perl

use strict;
use CGI qw(:standard);

my $val = $ENV{REDIRECT_SCRIPT_URI};

$val =~ s%^http://(\S+).sflc.info(/.*)$%$2%;
if ($1 eq "ubuntu-security") {
   $val = "http://91.189.88.37$val";
} else {
   $val = "http://91.189.88.45$val";
}

print redirect($val);

With these changes, the user will be redirected to the original when the files aren't available on the mirror, and as the mirror gets more accurate, they'll get more files from the mirror.

I still have problems if for any reason the user gets a Packages or Sources file from the original site before the mirror is synchronized, but this rarely happens since apt-mirror is pretty careful. The only time it might happen is if the user did an apt-get update when not connected to our VPN and only a short time later did one while connected.

Posted on Thursday 24 January 2008 at 13:55 by Bradley M. Kuhn.

Submit comments on this post to <bkuhn@ebb.org>.



Creative Commons License This website and all documents on it are licensed under a Creative Commons Attribution-Share Alike 3.0 United States License .


#include <std/disclaimer.h>
use Standard::Disclaimer;
from standard import disclaimer
SELECT full_text FROM standard WHERE type = 'disclaimer';

Both previously and presently, I have been employed by and/or done work for various organizations that also have views on Free, Libre, and Open Source Software. As should be blatantly obvious, this is my website, not theirs, so please do not assume views and opinions here belong to any such organization. Since I do co-own ebb.org with my wife, it may not be so obvious that these aren't her views and opinions, either.

— bkuhn


ebb ® is a registered service mark of Bradley M. Kuhn.

Bradley M. Kuhn <bkuhn@ebb.org>