[mirrorbrain] thoughts about multi-instance setups

From: Peter Poeml <poeml_at_cmdline.net>
Date: Wed, 9 Dec 2009 05:53:57 +0100

on a single host, it's possible to run multiple MirrorBrain instances,
to take care of redirections for several hosted (or mirrored) projects.
Each MB instance has its database with mirrors and files, as if it would
be running on a separate host, even though they are just separate
virtual hosts on one box.

When such setups employ a significant amount or mirrors, it becomes
quickly apparent that many mirrors participate in several projects, so
these "usual suspects" are present with their metadata in each MB
instance. In each database, mirrors are stored with admin contacts and
similar metadata; for each database, mirrors are probed, and their
status maintained in the database.

The same duplication happens across different machines, of course - when
a big mirror's contact address changes, I have to change it in the
database of the openSUSE MB setup, in the OOo MB setup and possibly in
others. It would be interesting to think about some kind of shared
database for that purpose (or some kind of synchronization, or at least
notifications); however, especially when the duplication happens on a
single box, it would be quite attractive to get rid of it. 

I had some ideas about that in the past, and the idea of this post is to
jot them down for the archive.


Earlier this year, I experimented with a new database scheme, designed
from the ground up, which covers that scenario. It also adds a facility
of tracking events (similar to a log) about the mirrors. MirrorBrain
"3.0", so to speak. For some reason, I never committed the code anywhere
(because I fooled around with new technologies, and didn't really
accomplish much). Sorry. That was the setup that I had in mind in my
FOSDEM talk this year, as sketched here:
(http://www.poeml.de/~poeml/talks/free_software_CDN_vision.pdf , page 44)

On the plus side, that would be a data model that most developers would
be familiar with; very easy to implement, because commonplace. 

However, it also adds a whole lot of complexity that we don't have
today. A good web frontend could probably compensate for it, but

A file with Django models is attached to this mail.

However, when I designed that, I just had improved the existing database
(by migration from MySQL to PostgreSQL and a total redesign) to be
several times smaller and faster. I just had gotten rid of all relations
-- which contributed significantly to space usage:
And learning from that, I don't think that's something that one wants to
give up easily.

A fully relational data model would definitely be a step back,
performance-wise; and that would be bothersome because if we want to
combine many instances into one, we might need more performance, not


A few days ago, when waking up in the morning, I thought about it anew.
(And it's not thought through very far, but I thought I should write it
down before its lost.)

Basically, a single file table is kept, like before, but across all
mirrored projects. One table is added that maps mirrors to the projects
that they mirror, and for the URLs to use.
Elegantly, few changes would be needed in mod_mirrorbrain.

Even better would be if it could just be a different mode of operation
-- most people will only want to run one instance really, and they won't
want to care about multi-instance setups. Increased complexity could
become a serious hurdle for new users and so on; I think that's in
important concern.

So, the following could be done:

- change the scanner to add a the project name as prefix to all files
  that it stores in the file table; i.e., let it store the path
  $PROJECT/path/to/file instead of /path/to/file.

- depending on the virtualhost (or directory) it runs in,
  mod_mirrorbrain will add the respective prefix in front of the
  filename when querying the database.

- in the existing mirror database records, remove the base URLs, but
  keep all the rest.

- create a new table, where each URL for a project on a mirror is

  mirrorid  prjname    http_base   ftp_base     rsync_base
      1     foo      http://....  ftp://....   rsync://....
      1     Bar      http://....  ftp://....   rsync://....

      2     Bar      http://....  ftp://....   rsync://....

      3     foo      http://....  ftp://....   rsync://....
      3     OpenBaz  http://....  ftp://....   rsync://....
      3     Buzzz    http://....  ftp://....   rsync://....

      4     foo      http://....  ftp://....   rsync://....
      4     Bar      http://....  ftp://....   rsync://....

- there are some fields stored per mirror which should likely be per
  project then, and also be moved into the new table, like "prio" and
  "enabled". In fact, they could exist mirror-wide and project-wide.

- the scanner uses that table to map the URLs

- the mirror probe only needs to probe one of the URLs; or it could
  probe round-robin, or use KeepAlive

- I'm not sure right now whether some mapping from path names to the
  local filesystem would be needed.


+ one database, not many (administrative overhead)

+ The file table stays compact; no joins; few code changes needed
  (that's good, because when a number of projects accumulate, it could
  really become large, and then the _real_ performance challenge begins)

+ the mirrors table and project/url table are tiny, and can be joined
  without significant overhead

+ only one table with ASN data needed; not one per database, as it is
  the case now. Saves about 50MB per instance.

+ In fact, one could stick the whole setup into a single virtual host.

As indicated above, the idea might need more thinking, but maybe it has
a future.


mirrorbrain mailing list
Archive: http://mirrorbrain.org/archive/mirrorbrain/

Note: To remove yourself from this mailing list, send a mail with the content
to the address mirrorbrain-request_at_mirrorbrain.org

Received on Wed Dec 09 2009 - 04:54:00 GMT

This archive was generated by hypermail 2.3.0 : Thu Mar 25 2010 - 19:30:55 GMT