[mirrorbrain] null-rsync (pseudo file trees)

From: Peter Poeml <poeml_at_cmdline.net>
Date: Sat, 28 Nov 2009 03:22:44 +0100
Hi,

lately, I have been experimenting with ways to create dummy file trees
locally. The idea is that once a file tree exists, Apache can not only
serve files, but also act as a well-behaved HTTP server and handle
"if-modified-since" requests, can generate directory indexes and so on.
Now, everything but the actual content of the files is pretty
substantial, so how could be have the files without their content?

It would be possible to put that kind of metadata in a database, but
that would be quite some work to implement and get right.

However, I experimented (rather successfully) with an addition to the
mirror scanner, which creates all files that it sees remotely as local
files filled with zeros. Not only does this save the bandwidth, it's
also possible to create the files as sparse files because, which means
that only the metadata will occupy actual disk space. This seemed to
work pretty nicely, provided that the upstream mirror offers rsync
(because then all metadata is available and there are no timestamp
issues to deal with). The advantage was that this seems to be
maintainable in a very automatic way, but there were things left to be
desired. Especially, the order in which files are processed is important
e.g. for setting mtimes on directories; that must happen after changes
inside the directories.

The null-rsync script improves on this, and I think it is ready for
prime time. It is as a standalone script instead of being integrated
into the mirror scanner. It uses rsync behind the scene, and makes use
of rsync's "itemize" output in customized format. It replicates sizes,
mtimes, directory mtimes, symlinks. It doesn't recreate hardlinks
because they don't matter for the purpose, it intentionally doesn't
replicate device files and world-writable directories/files. It can't
copy mtimes on symlinks because of limitations of Python's os module,
but those don't matter either.

Code:
http://svn.mirrorbrain.org/viewvc/mirrorbrain/trunk/tools/null-rsync?view=markup


Here's an example. Let's start with an empty directory:

mirror_at_doozer:~> l /srv/mirrors/pseudo-ue/
total 0
drwxr-xr-x  2 mirror mirror   6 2009-11-28 01:53 ./
drwxr-xr-x 10 root   root   126 2009-11-20 15:54 ../


Now, we call null-rsync:

mirror_at_doozer:~> null-rsync ultimateedition.unixheads.org::UltimateEdition/ /srv/mirrors/pseudo-ue
recv rwxr-xr-x .d..t...... 2009/11/19-23:59:44 4096 ./
recv rw-r--r-- >f+++++++++ 2009/11/19-23:56:33 272 .htaccess
recv rw-r--r-- >f+++++++++ 2007/12/09-00:34:32 4143315728 ultimate-edition-1.6-gamers.iso
recv rw-r--r-- >f+++++++++ 2008/08/19-15:26:47 1773684736 ultimate-edition-1.9-x64.iso
recv rw-r--r-- >f+++++++++ 2008/08/21-18:39:17 1809027072 ultimate-edition-1.9-x86.iso
recv rw-r--r-- >f+++++++++ 2008/11/13-08:35:59 4243374080 ultimate-edition-2.0-gamers.iso
recv rw-r--r-- >f+++++++++ 2008/11/07-18:42:12 1788930048 ultimate-edition-2.0-x64.iso
recv rw-r--r-- >f+++++++++ 2008/11/10-21:30:53 1757392896 ultimate-edition-2.0-x86.iso
recv rw-r--r-- >f+++++++++ 2009/02/25-06:38:19 1579673600 ultimate-edition-2.1-x64.iso
recv rw-r--r-- >f+++++++++ 2009/02/22-07:41:32 1697480704 ultimate-edition-2.1-x86.iso
recv rw-r--r-- >f+++++++++ 2009/06/16-09:25:54 2122715136 ultimate-edition-2.2-x64.iso
recv rw-r--r-- >f+++++++++ 2009/06/12-07:07:41 2132037632 ultimate-edition-2.2-x86.iso
recv rw-r--r-- >f+++++++++ 2009/09/18-09:31:14 4049158144 ultimate-edition-2.3-gamers-x86.iso
recv rw-r--r-- >f+++++++++ 2009/07/20-21:02:59 2378180608 ultimate-edition-2.3-x64.iso
recv rw-r--r-- >f+++++++++ 2009/07/22-08:42:37 2268733440 ultimate-edition-2.3-x86.iso
recv rw-r--r-- >f+++++++++ 2009/11/07-20:22:27 2562793472 ultimate-edition-2.4-x64.iso
recv rw-r--r-- >f+++++++++ 2009/11/07-20:25:23 2534909952 ultimate-edition-2.4-x86.iso
delayed setting of mtime on '/srv/mirrors/pseudo-ue'
rsync command for validation:
rsync --no-motd -rlpt --chmod=o-w ultimateedition.unixheads.org::UltimateEdition/ /srv/mirrors/pseudo-ue -i -n

That took only a few seconds.

The "rsync command for validation" is a suggestion to ask rsync what it
thinks about the file tree -- does it look correct? 

mirror_at_doozer:~> rsync --no-motd -rlpt --chmod=o-w ultimateedition.unixheads.org::UltimateEdition/ /srv/mirrors/pseudo-ue -i -n
mirror_at_doozer:~> 


rsync has nothing to add as, we see.

The rsync module we just "synced" is 36.84G in size. However, locally it
takes only a few bytes:

mirror_at_doozer:~> du -sch /srv/mirrors/pseudo-ue/
68K	/srv/mirrors/pseudo-ue/
68K	total
mirror_at_doozer:~> 

Thus, it gives a great tree for testing.

The 400G openSUSE tree, which has a substantially higher amount of
files, takes less than half a gig that way.


But there is another potential use. A MirrorBrain instance could run
from the tree, provided that it never delivers the dummy files, but
always redirects to some mirror. That could effectively get rid of the
limitation, which we had so far, that the file tree needs to exist
locally.

I haven't tested this scenario yet, but I believe that it should be
straightforward to configure MirrorBrain without any exceptions for
redirection. Then, there must always be at least one mirror. Otherwise
the normal fallback behaviour would be deleterious, whic is to deliver a
file directly when there is no mirror. This must be prevented, and I
think it is best done (other than having more than enough mirrors) by
implementing a config directive for this which defines an URL of a
mirror for that purpose.

A little limitation is that no local checksumming can be done, but that
doesn't matter that much in that usecase. Alternatively, there could be
a rule that makes sure that critical files are always redirected to a
certain mirror.

Looking forward to experiment more with this...
Peter


_______________________________________________
mirrorbrain mailing list
Archive: http://mirrorbrain.org/archive/mirrorbrain/

Note: To remove yourself from this mailing list, send a mail with the content
 	unsubscribe
to the address mirrorbrain-request_at_mirrorbrain.org
Received on Sat Nov 28 2009 - 02:22:48 GMT

This archive was generated by hypermail 2.3.0 : Thu Mar 25 2010 - 19:30:55 GMT