Re: [mirrorbrain] Mirrorbrain handing of modified files.

From: Dr. Peter Pöml <peter_at_poeml.de> Date: Wed, 8 Aug 2012 00:35:27 +0200 · This archive was generated by hypermail 2.3.0 : Wed Aug 08 2012 - 01:17:06 GMT

Am 06.08.2012 um 23:07 schrieb "Dr. Peter Pöml" <peter_at_poeml.de>:

> Looking around in the tree, I see several files named "Packages" or "Release", thus with no versioning indication in their name, and on the other hand, package files with version info (including build or release numbering). The latter files are easy to handle for MirrorBrain, because a file is assumed to never change - if a package is rebuilt, the release counter is incremented. The "version-less" files however are more difficult to handle. MirrorBrain doesn't store modification times or file size for what it finds on mirrors. (It would have been possible to implement it that way, but it was decided against for performance reasons [which might not affect some users in fact...].) 
> 
> There are some ways to deal with it: 
> 
> If the files are small, simply don't redirect for them; just deliver them. You can use the MirrorBrainExcludeFileMask directive with a regexp, e.g. "\.(xml|asc)". This can save the client an additional roundtrip to the mirror. And for certain files, like crypto-hashes and signatures, it makes sense to deliver them directly as well, for security reasons.
> 
> For larger files, a workaround is required, so the load can be offloaded to mirrors by redirection. Say you have a file named "Packages" which is updated frequently. You could rename the file to make its name unique. It would be named e.g. "Packages-${sha1sum}", or "Packages-${timestamp}". A symlink named "Packages" would point to the current file. Whenever the file is updated, you update the symlink. This works because MirrorBrain resolves links before considering redirection. Therefore, MirrorBrain would send the client only to mirrors that have the current file. It wouldn't matter anymore if there are files names "Packages" on the mirrors; only the current uniquely named file would be downloaded. 
> 
> http://mirrors.xbmc.org/nightlies/osx/ppc/ is an example of this technique.
> 
> When mirrors sync and get the new uniquely named files, and you scan them, redirection to them will commence. To scan subdirectories, the "mb scan -d <dir>" command can be very useful (saving a full scan).
> 
> If it is important to spread files quickly, to save load, mirrors either need to sync very frequently (and need to be scanned frequently), or the content need to be pushed (which gives better control, and one knows exactly when to scan the mirrors). This little script
> http://svn.mirrorbrain.org/viewvc/mirrorbrain/trunk/tools/push2mirrors?revision=8255&view=markup
> is used by some people to implement content pushing to mirrors in parallel. It requires write access on mirrors. Otherwise, triggered pull syncing is another possible way to implement this (as Debian does). 
> 
> Does this help?
> 
> Thanks,
> Peter

By the way, to make all users happy, it can be quite difficult to deal with the edge cases like mirrors with update in transit, master with update in transit, and downright outdated mirrors. Unless one properly controls all the components, it is hardly possible to guarantee a 100% usable state of the system. 

On the master site, you can work with atomic updates (rsync and a little scripting). Still, the preparedness of the download client to do a a second try is a must. 

Syncing the mirrors is never done atomically, it rather takes considerable time, even if you push-sync content actively yourself. The most rewarding strategy is to keep sync times short by doing just this, push-syncing content. That eliminates more sources of pain, resp. limits the fraction of users affected by being quick. 

The final step to robustness is to have a robust download client. The master site should offer crypto-hashes that the client uses to verify if it got the correct content. This is a good idea for security reasons anyway, but it's also extremely useful to rule out that stale content from a mirror has been downloaded. This can be achieved either with Metalinks, or Torrents, or with crypto-hashes in HTTP headers. The latter, using HTTP headers to supply the information, is the most powerful way. See http://mirrorbrain.org/news/2170-release-supporting-ipv6-instance-digests-and-web-linking/ for details. Any client which sees these HTTP response headers (which are sent along with the HTTP redirect to a mirror) can use the crypto-hashes to know whether the content it downloads is fresh, and fall back to the other supplied mirrors in case it's not.

Any serious operating system update client / download tool should do something like this, IMHO. I hope that this functionality is integrated in as many clients as possible in the future, including web browsers used by humans.

...
I know, this might not help so much right now. But I feel that it is important to point out possibilities and chances for the future, because we are working on the future today, right? :-)

Peter

_______________________________________________
mirrorbrain mailing list
Archive: http://mirrorbrain.org/archive/mirrorbrain/

Note: To remove yourself from this mailing list, send a mail with the content
 	unsubscribe
to the address mirrorbrain-request_at_mirrorbrain.org