Hello Per, this writeup is really well done, thank you for it! The approach so far taken by the Apache Traffic Server plugin is to examine "Link: <...>; rel=duplicate" response headers. For example here are response headers from download.services.openoffice.org, which also uses MirrorBrain: > $ curl -D - -o /dev/null -s http://download.services.openoffice.org/files/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz > HTTP/1.1 302 Found > Date: Sat, 02 Jun 2012 06:24:15 GMT > Server: Apache/2.2.22 (Linux/SUSE) > X-Prefix: 41.197.0.0/16 > X-AS: 36934 > X-MirrorBrain-Mirror: halifax.rwth-aachen.de > X-MirrorBrain-Realm: other_country > Link: <http://download.services.openoffice.org/files/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz.meta4>; rel=describedby; type="application/metalink4+xml" > Link: <http://download.services.openoffice.org/files/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz.torrent>; rel=describedby; type="application/x-bittorrent" > Link: <http://ftp.halifax.rwth-aachen.de/openoffice/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz>; rel=duplicate; pri=1; geo=de > Link: <http://ftp5.gwdg.de/pub/openoffice/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz>; rel=duplicate; pri=2; geo=de > Link: <http://ftp3.gwdg.de/pub/openoffice/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz>; rel=duplicate; pri=3; geo=de > Link: <http://ftp.cc.uoc.gr/openoffice.org/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz>; rel=duplicate; pri=4; geo=gr > Link: <http://ftp.ntua.gr/pub/OpenOffice/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz>; rel=duplicate; pri=5; geo=gr > Digest: MD5=chZROzRjy791zYb5mUhk3A== > Digest: SHA=nRgEtguiGxDlu8PKSxyBSc7TlGw= > Digest: SHA-256=VO2S9pgCq1lqgTFTKssVj6amn0npNdagtjI8ziDtiRQ= > Location: http://ftp.halifax.rwth-aachen.de/openoffice/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz > Content-Length: 395 > Connection: close > Content-Type: text/html; charset=iso-8859-1 > > $ If a response has a "Location: ..." header and a "Link: <...>; rel=duplicate" header then the Traffic Server plugin will check if the URLs in these headers are already cached. If the "Location: ..." URL is not already cached but a "Link: <...>; rel=duplicate" URL is cached, then the plugin will rewrite the "Location: ..." header with the cached URL This should redirect clients that are not Metalink aware to a mirror that is already cached. I would love any feedback on this approach The code so far is up on GitHub [1] We are also thinking of examining "Digest: ..." headers. If a response has a "Location: ..." header that's not already cached and a "Digest: ..." header, then the plugin would check the cache for a matching digest. If found then it would rewrite the "Location: ..." header with the cached URL This plugin is motivated by a similar problem to the one in your writeup. We run a caching proxy here at a rural village in Rwanda to improve our slow internet access. But many web sites don't predictably redirect users to the same download mirror, which defeats our cache > When you say "we're using Metalink as the mirror list", what do you > mean? One annoying item in my setup is the parsing of the HTML mirror > page - you wouldn't happen to know of a way of retrieving the mirror > list in XML format? You can retrieve a Metalink/XML resource that includes information about where a file is mirrored, in XML format. I think the correct way to *discover* this resource is through a 'Link: <...>; rel=describedby; type="application/metalink4+xml"' header. Can anyone (Anthony?) confirm that this is the correct way? So for example, in the above download.services.openoffice.org example: http://download.services.openoffice.org/files/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz.meta4 However I can't seem to get these same headers from download.opensuse.org. Both download.services.openoffice.org and download.opensuse.org seem to use MirrorBrain, anyone know why might download.services.openoffice.org responses include a 'Link: <...>; rel=describedby; type="application/metalink4+xml"' header but download.opensuse.org responses not? > $ curl -D - -o /dev/null -s http://download.opensuse.org/factory/repo/oss/suse/x86_64/CharLS-devel-1.0-2.3.x86_64.rpm > HTTP/1.1 302 Found > Date: Sat, 02 Jun 2012 07:22:30 GMT > Server: Apache/2.2.12 (Linux/SUSE) > X-Prefix: 41.197.0.0/16 > X-AS: 36934 > X-MirrorBrain-Mirror: ftp5.gwdg.de > X-MirrorBrain-Realm: other_country > Location: http://ftp5.gwdg.de/pub/opensuse/factory/repo/oss/suse/x86_64/CharLS-devel-1.0-2.3.x86_64.rpm > Content-Length: 368 > Content-Type: text/html; charset=iso-8859-1 > > $ More information on the "Link: <...>; rel=duplicate" and 'Link: <...>; rel=describedby; type="application/metalink4+xml"' headers is in RFC 6249, Metalink/HTTP: Mirrors and Hashes [2]. More information on the XML format that includes information about where a file is mirrored is in RFC 5854, The Metalink Download Description Format [3] > Switching off segmented downloading is interesting too, but I wanted an > environment where the regular openSUSE install process would work with > zero modifications. For instance, imagine a student wanting to install > a PC in the lab - grab the NET-install ISO, copy it to a USB stick and > boot. No need to know the proxy, no need to know about a switch for > segmented downloading, just pop in the USB stick and go with the > defaults. Same goes for later updates and additional software - that > Squid is helping out in the background should be 100% transparent. I've only considered complete downloads so far, although I can see segmented downloads will be an issue for our cache also. I'm not sure what is the current status of support for partial responses in Traffic Server. I know it is an issue, it comes up on the mailing list fairly regularly, and some improvements to handling partial responses have recently been made It would be neat if, after the cache is aware of requests for the same content from different mirrors, and after it is able to cache segmented downloads, it could be made aware of requests for the same segment from different mirrors. Then after one client assembled a complete download from segments from possibly many different mirrors, the cache would also contain this complete content, and could respond to requests from subsequent clients for any segment from any mirror Your solution to log partial downloads and then download them completely sounds like a good workaround [1] https://github.com/jablko/dedup [2] http://tools.ietf.org/html/rfc6249 [3] http://tools.ietf.org/html/rfc5854 _______________________________________________ mirrorbrain mailing list Archive: http://mirrorbrain.org/archive/mirrorbrain/ Note: To remove yourself from this mailing list, send a mail with the content unsubscribe to the address mirrorbrain-request_at_mirrorbrain.orgReceived on Sat Jun 02 2012 - 11:08:14 GMT
This archive was generated by hypermail 2.3.0 : Mon Jun 04 2012 - 11:47:02 GMT