On Sat, Jun 2, 2012 at 4:12 AM, Jack Bates <grx28t_at_nottheoilrig.com> wrote: > Hello Per, this writeup is really well done, thank you for it! > > The approach so far taken by the Apache Traffic Server plugin is to examine > "Link: <...>; rel=duplicate" response headers. For example here are response > headers from download.services.openoffice.org, which also uses MirrorBrain: > >> $ curl -D - -o /dev/null -s >> http://download.services.openoffice.org/files/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz >> HTTP/1.1 302 Found >> Date: Sat, 02 Jun 2012 06:24:15 GMT >> Server: Apache/2.2.22 (Linux/SUSE) >> X-Prefix: 41.197.0.0/16 >> X-AS: 36934 >> X-MirrorBrain-Mirror: halifax.rwth-aachen.de >> X-MirrorBrain-Realm: other_country >> Link: >> <http://download.services.openoffice.org/files/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz.meta4>; >> rel=describedby; type="application/metalink4+xml" >> Link: >> <http://download.services.openoffice.org/files/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz.torrent>; >> rel=describedby; type="application/x-bittorrent" >> Link: >> <http://ftp.halifax.rwth-aachen.de/openoffice/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz>; >> rel=duplicate; pri=1; geo=de >> Link: >> <http://ftp5.gwdg.de/pub/openoffice/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz>; >> rel=duplicate; pri=2; geo=de >> Link: >> <http://ftp3.gwdg.de/pub/openoffice/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz>; >> rel=duplicate; pri=3; geo=de >> Link: >> <http://ftp.cc.uoc.gr/openoffice.org/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz>; >> rel=duplicate; pri=4; geo=gr >> Link: >> <http://ftp.ntua.gr/pub/OpenOffice/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz>; >> rel=duplicate; pri=5; geo=gr >> Digest: MD5=chZROzRjy791zYb5mUhk3A== >> Digest: SHA=nRgEtguiGxDlu8PKSxyBSc7TlGw= >> Digest: SHA-256=VO2S9pgCq1lqgTFTKssVj6amn0npNdagtjI8ziDtiRQ= >> Location: >> http://ftp.halifax.rwth-aachen.de/openoffice/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz >> Content-Length: 395 >> Connection: close >> Content-Type: text/html; charset=iso-8859-1 >> >> $ > > > If a response has a "Location: ..." header and a "Link: <...>; > rel=duplicate" header then the Traffic Server plugin will check if the URLs > in these headers are already cached. If the "Location: ..." URL is not > already cached but a "Link: <...>; rel=duplicate" URL is cached, then the > plugin will rewrite the "Location: ..." header with the cached URL > > This should redirect clients that are not Metalink aware to a mirror that is > already cached. I would love any feedback on this approach > > The code so far is up on GitHub [1] > > We are also thinking of examining "Digest: ..." headers. If a response has a > "Location: ..." header that's not already cached and a "Digest: ..." header, > then the plugin would check the cache for a matching digest. If found then > it would rewrite the "Location: ..." header with the cached URL > > This plugin is motivated by a similar problem to the one in your writeup. We > run a caching proxy here at a rural village in Rwanda to improve our slow > internet access. But many web sites don't predictably redirect users to the > same download mirror, which defeats our cache > > >> When you say "we're using Metalink as the mirror list", what do you >> mean? One annoying item in my setup is the parsing of the HTML mirror >> page - you wouldn't happen to know of a way of retrieving the mirror >> list in XML format? > > > You can retrieve a Metalink/XML resource that includes information about > where a file is mirrored, in XML format. I think the correct way to > *discover* this resource is through a 'Link: <...>; rel=describedby; > type="application/metalink4+xml"' header. Can anyone (Anthony?) confirm that > this is the correct way? yes, Jack. and that is what I meant, Per, that you could examine the metalink to construct a mirror list. > So for example, in the above download.services.openoffice.org example: > http://download.services.openoffice.org/files/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz.meta4 > > However I can't seem to get these same headers from download.opensuse.org. > Both download.services.openoffice.org and download.opensuse.org seem to use > MirrorBrain, anyone know why might download.services.openoffice.org > responses include a 'Link: <...>; rel=describedby; > type="application/metalink4+xml"' header but download.opensuse.org responses > not? yes, download.opensuse.org is running a version or 2 behind the latest MB release probably. >> $ curl -D - -o /dev/null -s >> http://download.opensuse.org/factory/repo/oss/suse/x86_64/CharLS-devel-1.0-2.3.x86_64.rpm >> HTTP/1.1 302 Found >> Date: Sat, 02 Jun 2012 07:22:30 GMT >> Server: Apache/2.2.12 (Linux/SUSE) >> X-Prefix: 41.197.0.0/16 >> X-AS: 36934 >> X-MirrorBrain-Mirror: ftp5.gwdg.de >> X-MirrorBrain-Realm: other_country >> Location: >> http://ftp5.gwdg.de/pub/opensuse/factory/repo/oss/suse/x86_64/CharLS-devel-1.0-2.3.x86_64.rpm >> Content-Length: 368 >> Content-Type: text/html; charset=iso-8859-1 >> >> $ > > > More information on the "Link: <...>; rel=duplicate" and 'Link: <...>; > rel=describedby; type="application/metalink4+xml"' headers is in RFC 6249, > Metalink/HTTP: Mirrors and Hashes [2]. More information on the XML format > that includes information about where a file is mirrored is in RFC 5854, The > Metalink Download Description Format [3] > > >> Switching off segmented downloading is interesting too, but I wanted an >> environment where the regular openSUSE install process would work with >> zero modifications. For instance, imagine a student wanting to install >> a PC in the lab - grab the NET-install ISO, copy it to a USB stick and >> boot. No need to know the proxy, no need to know about a switch for >> segmented downloading, just pop in the USB stick and go with the >> defaults. Same goes for later updates and additional software - that >> Squid is helping out in the background should be 100% transparent. > > > I've only considered complete downloads so far, although I can see segmented > downloads will be an issue for our cache also. I'm not sure what is the > current status of support for partial responses in Traffic Server. I know it > is an issue, it comes up on the mailing list fairly regularly, and some > improvements to handling partial responses have recently been made > > It would be neat if, after the cache is aware of requests for the same > content from different mirrors, and after it is able to cache segmented > downloads, it could be made aware of requests for the same segment from > different mirrors. Then after one client assembled a complete download from > segments from possibly many different mirrors, the cache would also contain > this complete content, and could respond to requests from subsequent clients > for any segment from any mirror > > Your solution to log partial downloads and then download them completely > sounds like a good workaround > > [1] https://github.com/jablko/dedup > [2] http://tools.ietf.org/html/rfc6249 > [3] http://tools.ietf.org/html/rfc5854 in response to Per, curl metalink support just landed. I think zypper supported it on top of libcurl? not sure. what I said about segmented downloads, never mind, didn't fully understand...I like how you're doing things transparently. much nicer! -- (( Anthony Bryan ... Metalink [ http://www.metalinker.org ] )) Easier, More Reliable, Self Healing Downloads _______________________________________________ mirrorbrain mailing list Archive: http://mirrorbrain.org/archive/mirrorbrain/ Note: To remove yourself from this mailing list, send a mail with the content unsubscribe to the address mirrorbrain-request_at_mirrorbrain.orgReceived on Sat Jun 02 2012 - 16:50:46 GMT
This archive was generated by hypermail 2.3.0 : Mon Jun 04 2012 - 11:47:02 GMT