Re: [mirrorbrain] thoughts about multi-instance setups

From: <dfarning_at_sugarlabs.org> Date: Wed, 9 Dec 2009 10:15:19 -0600 (CST) · This archive was generated by hypermail 2.3.0 : Thu Mar 25 2010 - 19:30:56 GMT

I don't know if it helps.... but have you seen the work being done with cacheboy[1].  They are taking a rather different approach to the problem.

The question that comes to mind when thinking about a federated mirror system(Great Idea) is the ability to automatically control _what_ gets mirrored.  Just like other caches there appeares to be significant time and spacial locality in our file download.  On any given day it looks like 90% of Sugar Labs downloads come from 10% of our content.  A way for mirrors to determine what to cache seems very useful.

david 

http://cacheboy.net/

On Tue, Dec 8, 2009 at 10:53 PM, Peter Poeml <poeml_at_cmdline.net> wrote:
> Hi,
>
> on a single host, it's possible to run multiple MirrorBrain instances,
> to take care of redirections for several hosted (or mirrored) projects.
> Each MB instance has its database with mirrors and files, as if it would
> be running on a separate host, even though they are just separate
> virtual hosts on one box.
>
> When such setups employ a significant amount or mirrors, it becomes
> quickly apparent that many mirrors participate in several projects, so
> these "usual suspects" are present with their metadata in each MB
> instance. In each database, mirrors are stored with admin contacts and
> similar metadata; for each database, mirrors are probed, and their
> status maintained in the database.
>
> The same duplication happens across different machines, of course - when
> a big mirror's contact address changes, I have to change it in the
> database of the openSUSE MB setup, in the OOo MB setup and possibly in
> others. It would be interesting to think about some kind of shared
> database for that purpose (or some kind of synchronization, or at least
> notifications); however, especially when the duplication happens on a
> single box, it would be quite attractive to get rid of it.
>
> I had some ideas about that in the past, and the idea of this post is to
> jot them down for the archive.
>
>
> 1)
>
> Earlier this year, I experimented with a new database scheme, designed
> from the ground up, which covers that scenario. It also adds a facility
> of tracking events (similar to a log) about the mirrors. MirrorBrain
> "3.0", so to speak. For some reason, I never committed the code anywhere
> (because I fooled around with new technologies, and didn't really
> accomplish much). Sorry. That was the setup that I had in mind in my
> FOSDEM talk this year, as sketched here:
> http://mirrorbrain.org/static/images/misc/mb_new_schema-fosdem09.jpg

> (http://www.poeml.de/~poeml/talks/free_software_CDN_vision.pdf , page 44)
>
> On the plus side, that would be a data model that most developers would
> be familiar with; very easy to implement, because commonplace.
>
> However, it also adds a whole lot of complexity that we don't have
> today. A good web frontend could probably compensate for it, but
> still...
>
> A file with Django models is attached to this mail.
>
> However, when I designed that, I just had improved the existing database
> (by migration from MySQL to PostgreSQL and a total redesign) to be
> several times smaller and faster. I just had gotten rid of all relations
> -- which contributed significantly to space usage:
> http://mirrorbrain.org/news/27-release-smaller-and-faster-database/

> And learning from that, I don't think that's something that one wants to
> give up easily.
>
> A fully relational data model would definitely be a step back,
> performance-wise; and that would be bothersome because if we want to
> combine many instances into one, we might need more performance, not
> less.
>
>
>
> 2)
>
> A few days ago, when waking up in the morning, I thought about it anew.
> (And it's not thought through very far, but I thought I should write it
> down before its lost.)
>
> Basically, a single file table is kept, like before, but across all
> mirrored projects. One table is added that maps mirrors to the projects
> that they mirror, and for the URLs to use.
> Elegantly, few changes would be needed in mod_mirrorbrain.
>
> Even better would be if it could just be a different mode of operation
> -- most people will only want to run one instance really, and they won't
> want to care about multi-instance setups. Increased complexity could
> become a serious hurdle for new users and so on; I think that's in
> important concern.
>
> So, the following could be done:
>
> - change the scanner to add a the project name as prefix to all files
>  that it stores in the file table; i.e., let it store the path
>  $PROJECT/path/to/file instead of /path/to/file.
>
> - depending on the virtualhost (or directory) it runs in,
>  mod_mirrorbrain will add the respective prefix in front of the
>  filename when querying the database.
>
> - in the existing mirror database records, remove the base URLs, but
>  keep all the rest.
>
> - create a new table, where each URL for a project on a mirror is
>  registered:
>
>
>  mirrorid  prjname    http_base   ftp_base     rsync_base
>
>      1     foo      http://....  ftp://....   rsync://....
>      1     Bar      http://....  ftp://....   rsync://....
>
>      2     Bar      http://....  ftp://....   rsync://....
>
>      3     foo      http://....  ftp://....   rsync://....
>      3     OpenBaz  http://....  ftp://....   rsync://....
>      3     Buzzz    http://....  ftp://....   rsync://....
>
>      4     foo      http://....  ftp://....   rsync://....
>      4     Bar      http://....  ftp://....   rsync://....
>
>
> - there are some fields stored per mirror which should likely be per
>  project then, and also be moved into the new table, like "prio" and
>  "enabled". In fact, they could exist mirror-wide and project-wide.
>
> - the scanner uses that table to map the URLs
>
> - the mirror probe only needs to probe one of the URLs; or it could
>  probe round-robin, or use KeepAlive
>
> - I'm not sure right now whether some mapping from path names to the
>  local filesystem would be needed.
>
>
> Advantages:
>
> + one database, not many (administrative overhead)
>
> + The file table stays compact; no joins; few code changes needed
>  (that's good, because when a number of projects accumulate, it could
>  really become large, and then the _real_ performance challenge begins)
>
> + the mirrors table and project/url table are tiny, and can be joined
>  without significant overhead
>
> + only one table with ASN data needed; not one per database, as it is
>  the case now. Saves about 50MB per instance.
>
> + In fact, one could stick the whole setup into a single virtual host.
>
>
> As indicated above, the idea might need more thinking, but maybe it has
> a future.
>
> Peter
>

_______________________________________________
mirrorbrain mailing list
Archive: http://mirrorbrain.org/archive/mirrorbrain/

Note: To remove yourself from this mailing list, send a mail with the content
 	unsubscribe
to the address mirrorbrain-request_at_mirrorbrain.org