Hi, lately, I have been experimenting with ways to create dummy file trees locally. The idea is that once a file tree exists, Apache can not only serve files, but also act as a well-behaved HTTP server and handle "if-modified-since" requests, can generate directory indexes and so on. Now, everything but the actual content of the files is pretty substantial, so how could be have the files without their content? It would be possible to put that kind of metadata in a database, but that would be quite some work to implement and get right. However, I experimented (rather successfully) with an addition to the mirror scanner, which creates all files that it sees remotely as local files filled with zeros. Not only does this save the bandwidth, it's also possible to create the files as sparse files because, which means that only the metadata will occupy actual disk space. This seemed to work pretty nicely, provided that the upstream mirror offers rsync (because then all metadata is available and there are no timestamp issues to deal with). The advantage was that this seems to be maintainable in a very automatic way, but there were things left to be desired. Especially, the order in which files are processed is important e.g. for setting mtimes on directories; that must happen after changes inside the directories. The null-rsync script improves on this, and I think it is ready for prime time. It is as a standalone script instead of being integrated into the mirror scanner. It uses rsync behind the scene, and makes use of rsync's "itemize" output in customized format. It replicates sizes, mtimes, directory mtimes, symlinks. It doesn't recreate hardlinks because they don't matter for the purpose, it intentionally doesn't replicate device files and world-writable directories/files. It can't copy mtimes on symlinks because of limitations of Python's os module, but those don't matter either. Code: http://svn.mirrorbrain.org/viewvc/mirrorbrain/trunk/tools/null-rsync?view=markup Here's an example. Let's start with an empty directory: mirror_at_doozer:~> l /srv/mirrors/pseudo-ue/ total 0 drwxr-xr-x 2 mirror mirror 6 2009-11-28 01:53 ./ drwxr-xr-x 10 root root 126 2009-11-20 15:54 ../ Now, we call null-rsync: mirror_at_doozer:~> null-rsync ultimateedition.unixheads.org::UltimateEdition/ /srv/mirrors/pseudo-ue recv rwxr-xr-x .d..t...... 2009/11/19-23:59:44 4096 ./ recv rw-r--r-- >f+++++++++ 2009/11/19-23:56:33 272 .htaccess recv rw-r--r-- >f+++++++++ 2007/12/09-00:34:32 4143315728 ultimate-edition-1.6-gamers.iso recv rw-r--r-- >f+++++++++ 2008/08/19-15:26:47 1773684736 ultimate-edition-1.9-x64.iso recv rw-r--r-- >f+++++++++ 2008/08/21-18:39:17 1809027072 ultimate-edition-1.9-x86.iso recv rw-r--r-- >f+++++++++ 2008/11/13-08:35:59 4243374080 ultimate-edition-2.0-gamers.iso recv rw-r--r-- >f+++++++++ 2008/11/07-18:42:12 1788930048 ultimate-edition-2.0-x64.iso recv rw-r--r-- >f+++++++++ 2008/11/10-21:30:53 1757392896 ultimate-edition-2.0-x86.iso recv rw-r--r-- >f+++++++++ 2009/02/25-06:38:19 1579673600 ultimate-edition-2.1-x64.iso recv rw-r--r-- >f+++++++++ 2009/02/22-07:41:32 1697480704 ultimate-edition-2.1-x86.iso recv rw-r--r-- >f+++++++++ 2009/06/16-09:25:54 2122715136 ultimate-edition-2.2-x64.iso recv rw-r--r-- >f+++++++++ 2009/06/12-07:07:41 2132037632 ultimate-edition-2.2-x86.iso recv rw-r--r-- >f+++++++++ 2009/09/18-09:31:14 4049158144 ultimate-edition-2.3-gamers-x86.iso recv rw-r--r-- >f+++++++++ 2009/07/20-21:02:59 2378180608 ultimate-edition-2.3-x64.iso recv rw-r--r-- >f+++++++++ 2009/07/22-08:42:37 2268733440 ultimate-edition-2.3-x86.iso recv rw-r--r-- >f+++++++++ 2009/11/07-20:22:27 2562793472 ultimate-edition-2.4-x64.iso recv rw-r--r-- >f+++++++++ 2009/11/07-20:25:23 2534909952 ultimate-edition-2.4-x86.iso delayed setting of mtime on '/srv/mirrors/pseudo-ue' rsync command for validation: rsync --no-motd -rlpt --chmod=o-w ultimateedition.unixheads.org::UltimateEdition/ /srv/mirrors/pseudo-ue -i -n That took only a few seconds. The "rsync command for validation" is a suggestion to ask rsync what it thinks about the file tree -- does it look correct? mirror_at_doozer:~> rsync --no-motd -rlpt --chmod=o-w ultimateedition.unixheads.org::UltimateEdition/ /srv/mirrors/pseudo-ue -i -n mirror_at_doozer:~> rsync has nothing to add as, we see. The rsync module we just "synced" is 36.84G in size. However, locally it takes only a few bytes: mirror_at_doozer:~> du -sch /srv/mirrors/pseudo-ue/ 68K /srv/mirrors/pseudo-ue/ 68K total mirror_at_doozer:~> Thus, it gives a great tree for testing. The 400G openSUSE tree, which has a substantially higher amount of files, takes less than half a gig that way. But there is another potential use. A MirrorBrain instance could run from the tree, provided that it never delivers the dummy files, but always redirects to some mirror. That could effectively get rid of the limitation, which we had so far, that the file tree needs to exist locally. I haven't tested this scenario yet, but I believe that it should be straightforward to configure MirrorBrain without any exceptions for redirection. Then, there must always be at least one mirror. Otherwise the normal fallback behaviour would be deleterious, whic is to deliver a file directly when there is no mirror. This must be prevented, and I think it is best done (other than having more than enough mirrors) by implementing a config directive for this which defines an URL of a mirror for that purpose. A little limitation is that no local checksumming can be done, but that doesn't matter that much in that usecase. Alternatively, there could be a rule that makes sure that critical files are always redirected to a certain mirror. Looking forward to experiment more with this... Peter _______________________________________________ mirrorbrain mailing list Archive: http://mirrorbrain.org/archive/mirrorbrain/ Note: To remove yourself from this mailing list, send a mail with the content unsubscribe to the address mirrorbrain-request_at_mirrorbrain.orgReceived on Sat Nov 28 2009 - 02:22:48 GMT
This archive was generated by hypermail 2.3.0 : Thu Mar 25 2010 - 19:30:55 GMT