Re: [mirrorbrain] How to make Squid work with mirrorbrain

From: Jack Bates <grx28t_at_nottheoilrig.com>
Date: Sat, 02 Jun 2012 01:12:42 -0700
Hello Per, this writeup is really well done, thank you for it!

The approach so far taken by the Apache Traffic Server plugin is to 
examine "Link: <...>; rel=duplicate" response headers. For example here 
are response headers from download.services.openoffice.org, which also 
uses MirrorBrain:

> $ curl -D - -o /dev/null -s http://download.services.openoffice.org/files/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz
> HTTP/1.1 302 Found
> Date: Sat, 02 Jun 2012 06:24:15 GMT
> Server: Apache/2.2.22 (Linux/SUSE)
> X-Prefix: 41.197.0.0/16
> X-AS: 36934
> X-MirrorBrain-Mirror: halifax.rwth-aachen.de
> X-MirrorBrain-Realm: other_country
> Link: <http://download.services.openoffice.org/files/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz.meta4>; rel=describedby; type="application/metalink4+xml"
> Link: <http://download.services.openoffice.org/files/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz.torrent>; rel=describedby; type="application/x-bittorrent"
> Link: <http://ftp.halifax.rwth-aachen.de/openoffice/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz>; rel=duplicate; pri=1; geo=de
> Link: <http://ftp5.gwdg.de/pub/openoffice/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz>; rel=duplicate; pri=2; geo=de
> Link: <http://ftp3.gwdg.de/pub/openoffice/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz>; rel=duplicate; pri=3; geo=de
> Link: <http://ftp.cc.uoc.gr/openoffice.org/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz>; rel=duplicate; pri=4; geo=gr
> Link: <http://ftp.ntua.gr/pub/OpenOffice/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz>; rel=duplicate; pri=5; geo=gr
> Digest: MD5=chZROzRjy791zYb5mUhk3A==
> Digest: SHA=nRgEtguiGxDlu8PKSxyBSc7TlGw=
> Digest: SHA-256=VO2S9pgCq1lqgTFTKssVj6amn0npNdagtjI8ziDtiRQ=
> Location: http://ftp.halifax.rwth-aachen.de/openoffice/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz
> Content-Length: 395
> Connection: close
> Content-Type: text/html; charset=iso-8859-1
>
> $

If a response has a "Location: ..." header and a "Link: <...>; 
rel=duplicate" header then the Traffic Server plugin will check if the 
URLs in these headers are already cached. If the "Location: ..." URL is 
not already cached but a "Link: <...>; rel=duplicate" URL is cached, 
then the plugin will rewrite the "Location: ..." header with the cached URL

This should redirect clients that are not Metalink aware to a mirror 
that is already cached. I would love any feedback on this approach

The code so far is up on GitHub [1]

We are also thinking of examining "Digest: ..." headers. If a response 
has a "Location: ..." header that's not already cached and a "Digest: 
..." header, then the plugin would check the cache for a matching 
digest. If found then it would rewrite the "Location: ..." header with 
the cached URL

This plugin is motivated by a similar problem to the one in your 
writeup. We run a caching proxy here at a rural village in Rwanda to 
improve our slow internet access. But many web sites don't predictably 
redirect users to the same download mirror, which defeats our cache

> When you say "we're using Metalink as the mirror list", what do you
> mean?  One annoying item in my setup is the parsing of the HTML mirror
> page - you wouldn't happen to know of a way of retrieving the mirror
> list in XML format?

You can retrieve a Metalink/XML resource that includes information about 
where a file is mirrored, in XML format. I think the correct way to 
*discover* this resource is through a 'Link: <...>; rel=describedby; 
type="application/metalink4+xml"' header. Can anyone (Anthony?) confirm 
that this is the correct way?

So for example, in the above download.services.openoffice.org example: 
http://download.services.openoffice.org/files/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz.meta4

However I can't seem to get these same headers from 
download.opensuse.org. Both download.services.openoffice.org and 
download.opensuse.org seem to use MirrorBrain, anyone know why might 
download.services.openoffice.org responses include a 'Link: <...>; 
rel=describedby; type="application/metalink4+xml"' header but 
download.opensuse.org responses not?

> $ curl -D - -o /dev/null -s http://download.opensuse.org/factory/repo/oss/suse/x86_64/CharLS-devel-1.0-2.3.x86_64.rpm
> HTTP/1.1 302 Found
> Date: Sat, 02 Jun 2012 07:22:30 GMT
> Server: Apache/2.2.12 (Linux/SUSE)
> X-Prefix: 41.197.0.0/16
> X-AS: 36934
> X-MirrorBrain-Mirror: ftp5.gwdg.de
> X-MirrorBrain-Realm: other_country
> Location: http://ftp5.gwdg.de/pub/opensuse/factory/repo/oss/suse/x86_64/CharLS-devel-1.0-2.3.x86_64.rpm
> Content-Length: 368
> Content-Type: text/html; charset=iso-8859-1
>
> $

More information on the "Link: <...>; rel=duplicate" and 'Link: <...>; 
rel=describedby; type="application/metalink4+xml"' headers is in RFC 
6249, Metalink/HTTP: Mirrors and Hashes [2]. More information on the XML 
format that includes information about where a file is mirrored is in 
RFC 5854, The Metalink Download Description Format [3]

> Switching off segmented downloading is interesting too, but I wanted an
> environment where the regular openSUSE install process would work with
> zero modifications.  For instance, imagine a student wanting to install
> a PC in the lab - grab the NET-install ISO, copy it to a USB stick and
> boot.  No need to know the proxy, no need to know about a switch for
> segmented downloading, just pop in the USB stick and go with the
> defaults.  Same goes for later updates and additional software - that
> Squid is helping out in the background should be 100% transparent.

I've only considered complete downloads so far, although I can see 
segmented downloads will be an issue for our cache also. I'm not sure 
what is the current status of support for partial responses in Traffic 
Server. I know it is an issue, it comes up on the mailing list fairly 
regularly, and some improvements to handling partial responses have 
recently been made

It would be neat if, after the cache is aware of requests for the same 
content from different mirrors, and after it is able to cache segmented 
downloads, it could be made aware of requests for the same segment from 
different mirrors. Then after one client assembled a complete download 
from segments from possibly many different mirrors, the cache would also 
contain this complete content, and could respond to requests from 
subsequent clients for any segment from any mirror

Your solution to log partial downloads and then download them completely 
sounds like a good workaround

[1] https://github.com/jablko/dedup
[2] http://tools.ietf.org/html/rfc6249
[3] http://tools.ietf.org/html/rfc5854


_______________________________________________
mirrorbrain mailing list
Archive: http://mirrorbrain.org/archive/mirrorbrain/

Note: To remove yourself from this mailing list, send a mail with the content
 	unsubscribe
to the address mirrorbrain-request_at_mirrorbrain.org
Received on Sat Jun 02 2012 - 11:08:14 GMT

This archive was generated by hypermail 2.3.0 : Mon Jun 04 2012 - 11:47:02 GMT