Re: [mirrorbrain] How to make Squid work with mirrorbrain

From: Anthony Bryan <anthonybryan_at_gmail.com>
Date: Sat, 2 Jun 2012 12:50:28 -0400
On Sat, Jun 2, 2012 at 4:12 AM, Jack Bates <grx28t_at_nottheoilrig.com> wrote:
> Hello Per, this writeup is really well done, thank you for it!
>
> The approach so far taken by the Apache Traffic Server plugin is to examine
> "Link: <...>; rel=duplicate" response headers. For example here are response
> headers from download.services.openoffice.org, which also uses MirrorBrain:
>
>> $ curl -D - -o /dev/null -s
>> http://download.services.openoffice.org/files/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz
>> HTTP/1.1 302 Found
>> Date: Sat, 02 Jun 2012 06:24:15 GMT
>> Server: Apache/2.2.22 (Linux/SUSE)
>> X-Prefix: 41.197.0.0/16
>> X-AS: 36934
>> X-MirrorBrain-Mirror: halifax.rwth-aachen.de
>> X-MirrorBrain-Realm: other_country
>> Link:
>> <http://download.services.openoffice.org/files/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz.meta4>;
>> rel=describedby; type="application/metalink4+xml"
>> Link:
>> <http://download.services.openoffice.org/files/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz.torrent>;
>> rel=describedby; type="application/x-bittorrent"
>> Link:
>> <http://ftp.halifax.rwth-aachen.de/openoffice/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz>;
>> rel=duplicate; pri=1; geo=de
>> Link:
>> <http://ftp5.gwdg.de/pub/openoffice/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz>;
>> rel=duplicate; pri=2; geo=de
>> Link:
>> <http://ftp3.gwdg.de/pub/openoffice/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz>;
>> rel=duplicate; pri=3; geo=de
>> Link:
>> <http://ftp.cc.uoc.gr/openoffice.org/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz>;
>> rel=duplicate; pri=4; geo=gr
>> Link:
>> <http://ftp.ntua.gr/pub/OpenOffice/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz>;
>> rel=duplicate; pri=5; geo=gr
>> Digest: MD5=chZROzRjy791zYb5mUhk3A==
>> Digest: SHA=nRgEtguiGxDlu8PKSxyBSc7TlGw=
>> Digest: SHA-256=VO2S9pgCq1lqgTFTKssVj6amn0npNdagtjI8ziDtiRQ=
>> Location:
>> http://ftp.halifax.rwth-aachen.de/openoffice/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz
>> Content-Length: 395
>> Connection: close
>> Content-Type: text/html; charset=iso-8859-1
>>
>> $
>
>
> If a response has a "Location: ..." header and a "Link: <...>;
> rel=duplicate" header then the Traffic Server plugin will check if the URLs
> in these headers are already cached. If the "Location: ..." URL is not
> already cached but a "Link: <...>; rel=duplicate" URL is cached, then the
> plugin will rewrite the "Location: ..." header with the cached URL
>
> This should redirect clients that are not Metalink aware to a mirror that is
> already cached. I would love any feedback on this approach
>
> The code so far is up on GitHub [1]
>
> We are also thinking of examining "Digest: ..." headers. If a response has a
> "Location: ..." header that's not already cached and a "Digest: ..." header,
> then the plugin would check the cache for a matching digest. If found then
> it would rewrite the "Location: ..." header with the cached URL
>
> This plugin is motivated by a similar problem to the one in your writeup. We
> run a caching proxy here at a rural village in Rwanda to improve our slow
> internet access. But many web sites don't predictably redirect users to the
> same download mirror, which defeats our cache
>
>
>> When you say "we're using Metalink as the mirror list", what do you
>> mean?  One annoying item in my setup is the parsing of the HTML mirror
>> page - you wouldn't happen to know of a way of retrieving the mirror
>> list in XML format?
>
>
> You can retrieve a Metalink/XML resource that includes information about
> where a file is mirrored, in XML format. I think the correct way to
> *discover* this resource is through a 'Link: <...>; rel=describedby;
> type="application/metalink4+xml"' header. Can anyone (Anthony?) confirm that
> this is the correct way?

yes, Jack.

and that is what I meant, Per, that you could examine the metalink to
construct a mirror list.


> So for example, in the above download.services.openoffice.org example:
> http://download.services.openoffice.org/files/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz.meta4
>
> However I can't seem to get these same headers from download.opensuse.org.
> Both download.services.openoffice.org and download.opensuse.org seem to use
> MirrorBrain, anyone know why might download.services.openoffice.org
> responses include a 'Link: <...>; rel=describedby;
> type="application/metalink4+xml"' header but download.opensuse.org responses
> not?

yes, download.opensuse.org is running a version or 2 behind the latest
MB release probably.

>> $ curl -D - -o /dev/null -s
>> http://download.opensuse.org/factory/repo/oss/suse/x86_64/CharLS-devel-1.0-2.3.x86_64.rpm
>> HTTP/1.1 302 Found
>> Date: Sat, 02 Jun 2012 07:22:30 GMT
>> Server: Apache/2.2.12 (Linux/SUSE)
>> X-Prefix: 41.197.0.0/16
>> X-AS: 36934
>> X-MirrorBrain-Mirror: ftp5.gwdg.de
>> X-MirrorBrain-Realm: other_country
>> Location:
>> http://ftp5.gwdg.de/pub/opensuse/factory/repo/oss/suse/x86_64/CharLS-devel-1.0-2.3.x86_64.rpm
>> Content-Length: 368
>> Content-Type: text/html; charset=iso-8859-1
>>
>> $
>
>
> More information on the "Link: <...>; rel=duplicate" and 'Link: <...>;
> rel=describedby; type="application/metalink4+xml"' headers is in RFC 6249,
> Metalink/HTTP: Mirrors and Hashes [2]. More information on the XML format
> that includes information about where a file is mirrored is in RFC 5854, The
> Metalink Download Description Format [3]
>
>
>> Switching off segmented downloading is interesting too, but I wanted an
>> environment where the regular openSUSE install process would work with
>> zero modifications.  For instance, imagine a student wanting to install
>> a PC in the lab - grab the NET-install ISO, copy it to a USB stick and
>> boot.  No need to know the proxy, no need to know about a switch for
>> segmented downloading, just pop in the USB stick and go with the
>> defaults.  Same goes for later updates and additional software - that
>> Squid is helping out in the background should be 100% transparent.
>
>
> I've only considered complete downloads so far, although I can see segmented
> downloads will be an issue for our cache also. I'm not sure what is the
> current status of support for partial responses in Traffic Server. I know it
> is an issue, it comes up on the mailing list fairly regularly, and some
> improvements to handling partial responses have recently been made
>
> It would be neat if, after the cache is aware of requests for the same
> content from different mirrors, and after it is able to cache segmented
> downloads, it could be made aware of requests for the same segment from
> different mirrors. Then after one client assembled a complete download from
> segments from possibly many different mirrors, the cache would also contain
> this complete content, and could respond to requests from subsequent clients
> for any segment from any mirror
>
> Your solution to log partial downloads and then download them completely
> sounds like a good workaround
>
> [1] https://github.com/jablko/dedup
> [2] http://tools.ietf.org/html/rfc6249
> [3] http://tools.ietf.org/html/rfc5854

in response to Per, curl metalink support just landed. I think zypper
supported it on top of libcurl? not sure.

what I said about segmented downloads, never mind, didn't fully
understand...I like how you're doing things transparently. much nicer!

-- 
(( Anthony Bryan ... Metalink [ http://www.metalinker.org ]
  )) Easier, More Reliable, Self Healing Downloads

_______________________________________________
mirrorbrain mailing list
Archive: http://mirrorbrain.org/archive/mirrorbrain/

Note: To remove yourself from this mailing list, send a mail with the content
 	unsubscribe
to the address mirrorbrain-request_at_mirrorbrain.org
Received on Sat Jun 02 2012 - 16:50:46 GMT

This archive was generated by hypermail 2.3.0 : Mon Jun 04 2012 - 11:47:02 GMT