Re: [mirrorbrain] thoughts about multi-instance setups

From: <>
Date: Wed, 9 Dec 2009 10:15:19 -0600 (CST)
I don't know if it helps.... but have you seen the work being done with cacheboy[1].  They are taking a rather different approach to the problem.

The question that comes to mind when thinking about a federated mirror system(Great Idea) is the ability to automatically control _what_ gets mirrored.  Just like other caches there appeares to be significant time and spacial locality in our file download.  On any given day it looks like 90% of Sugar Labs downloads come from 10% of our content.  A way for mirrors to determine what to cache seems very useful.


On Tue, Dec 8, 2009 at 10:53 PM, Peter Poeml <> wrote:
> Hi,
> on a single host, it's possible to run multiple MirrorBrain instances,
> to take care of redirections for several hosted (or mirrored) projects.
> Each MB instance has its database with mirrors and files, as if it would
> be running on a separate host, even though they are just separate
> virtual hosts on one box.
> When such setups employ a significant amount or mirrors, it becomes
> quickly apparent that many mirrors participate in several projects, so
> these "usual suspects" are present with their metadata in each MB
> instance. In each database, mirrors are stored with admin contacts and
> similar metadata; for each database, mirrors are probed, and their
> status maintained in the database.
> The same duplication happens across different machines, of course - when
> a big mirror's contact address changes, I have to change it in the
> database of the openSUSE MB setup, in the OOo MB setup and possibly in
> others. It would be interesting to think about some kind of shared
> database for that purpose (or some kind of synchronization, or at least
> notifications); however, especially when the duplication happens on a
> single box, it would be quite attractive to get rid of it.
> I had some ideas about that in the past, and the idea of this post is to
> jot them down for the archive.
> 1)
> Earlier this year, I experimented with a new database scheme, designed
> from the ground up, which covers that scenario. It also adds a facility
> of tracking events (similar to a log) about the mirrors. MirrorBrain
> "3.0", so to speak. For some reason, I never committed the code anywhere
> (because I fooled around with new technologies, and didn't really
> accomplish much). Sorry. That was the setup that I had in mind in my
> FOSDEM talk this year, as sketched here:

> ( , page 44)
> On the plus side, that would be a data model that most developers would
> be familiar with; very easy to implement, because commonplace.
> However, it also adds a whole lot of complexity that we don't have
> today. A good web frontend could probably compensate for it, but
> still...
> A file with Django models is attached to this mail.
> However, when I designed that, I just had improved the existing database
> (by migration from MySQL to PostgreSQL and a total redesign) to be
> several times smaller and faster. I just had gotten rid of all relations
> -- which contributed significantly to space usage:

> And learning from that, I don't think that's something that one wants to
> give up easily.
> A fully relational data model would definitely be a step back,
> performance-wise; and that would be bothersome because if we want to
> combine many instances into one, we might need more performance, not
> less.
> 2)
> A few days ago, when waking up in the morning, I thought about it anew.
> (And it's not thought through very far, but I thought I should write it
> down before its lost.)
> Basically, a single file table is kept, like before, but across all
> mirrored projects. One table is added that maps mirrors to the projects
> that they mirror, and for the URLs to use.
> Elegantly, few changes would be needed in mod_mirrorbrain.
> Even better would be if it could just be a different mode of operation
> -- most people will only want to run one instance really, and they won't
> want to care about multi-instance setups. Increased complexity could
> become a serious hurdle for new users and so on; I think that's in
> important concern.
> So, the following could be done:
> - change the scanner to add a the project name as prefix to all files
>  that it stores in the file table; i.e., let it store the path
>  $PROJECT/path/to/file instead of /path/to/file.
> - depending on the virtualhost (or directory) it runs in,
>  mod_mirrorbrain will add the respective prefix in front of the
>  filename when querying the database.
> - in the existing mirror database records, remove the base URLs, but
>  keep all the rest.
> - create a new table, where each URL for a project on a mirror is
>  registered:
>  mirrorid  prjname    http_base   ftp_base     rsync_base
>      1     foo      http://....  ftp://....   rsync://....
>      1     Bar      http://....  ftp://....   rsync://....
>      2     Bar      http://....  ftp://....   rsync://....
>      3     foo      http://....  ftp://....   rsync://....
>      3     OpenBaz  http://....  ftp://....   rsync://....
>      3     Buzzz    http://....  ftp://....   rsync://....
>      4     foo      http://....  ftp://....   rsync://....
>      4     Bar      http://....  ftp://....   rsync://....
> - there are some fields stored per mirror which should likely be per
>  project then, and also be moved into the new table, like "prio" and
>  "enabled". In fact, they could exist mirror-wide and project-wide.
> - the scanner uses that table to map the URLs
> - the mirror probe only needs to probe one of the URLs; or it could
>  probe round-robin, or use KeepAlive
> - I'm not sure right now whether some mapping from path names to the
>  local filesystem would be needed.
> Advantages:
> + one database, not many (administrative overhead)
> + The file table stays compact; no joins; few code changes needed
>  (that's good, because when a number of projects accumulate, it could
>  really become large, and then the _real_ performance challenge begins)
> + the mirrors table and project/url table are tiny, and can be joined
>  without significant overhead
> + only one table with ASN data needed; not one per database, as it is
>  the case now. Saves about 50MB per instance.
> + In fact, one could stick the whole setup into a single virtual host.
> As indicated above, the idea might need more thinking, but maybe it has
> a future.
> Peter

mirrorbrain mailing list

Note: To remove yourself from this mailing list, send a mail with the content
to the address
Received on Wed Dec 09 2009 - 16:14:49 GMT

This archive was generated by hypermail 2.3.0 : Thu Mar 25 2010 - 19:30:56 GMT