I don't know if it helps.... but have you seen the work being done with cacheboy[1]. They are taking a rather different approach to the problem. The question that comes to mind when thinking about a federated mirror system(Great Idea) is the ability to automatically control _what_ gets mirrored. Just like other caches there appeares to be significant time and spacial locality in our file download. On any given day it looks like 90% of Sugar Labs downloads come from 10% of our content. A way for mirrors to determine what to cache seems very useful. david http://cacheboy.net/ On Tue, Dec 8, 2009 at 10:53 PM, Peter Poeml <poeml_at_cmdline.net> wrote: > Hi, > > on a single host, it's possible to run multiple MirrorBrain instances, > to take care of redirections for several hosted (or mirrored) projects. > Each MB instance has its database with mirrors and files, as if it would > be running on a separate host, even though they are just separate > virtual hosts on one box. > > When such setups employ a significant amount or mirrors, it becomes > quickly apparent that many mirrors participate in several projects, so > these "usual suspects" are present with their metadata in each MB > instance. In each database, mirrors are stored with admin contacts and > similar metadata; for each database, mirrors are probed, and their > status maintained in the database. > > The same duplication happens across different machines, of course - when > a big mirror's contact address changes, I have to change it in the > database of the openSUSE MB setup, in the OOo MB setup and possibly in > others. It would be interesting to think about some kind of shared > database for that purpose (or some kind of synchronization, or at least > notifications); however, especially when the duplication happens on a > single box, it would be quite attractive to get rid of it. > > I had some ideas about that in the past, and the idea of this post is to > jot them down for the archive. > > > 1) > > Earlier this year, I experimented with a new database scheme, designed > from the ground up, which covers that scenario. It also adds a facility > of tracking events (similar to a log) about the mirrors. MirrorBrain > "3.0", so to speak. For some reason, I never committed the code anywhere > (because I fooled around with new technologies, and didn't really > accomplish much). Sorry. That was the setup that I had in mind in my > FOSDEM talk this year, as sketched here: > http://mirrorbrain.org/static/images/misc/mb_new_schema-fosdem09.jpg > (http://www.poeml.de/~poeml/talks/free_software_CDN_vision.pdf , page 44) > > On the plus side, that would be a data model that most developers would > be familiar with; very easy to implement, because commonplace. > > However, it also adds a whole lot of complexity that we don't have > today. A good web frontend could probably compensate for it, but > still... > > A file with Django models is attached to this mail. > > However, when I designed that, I just had improved the existing database > (by migration from MySQL to PostgreSQL and a total redesign) to be > several times smaller and faster. I just had gotten rid of all relations > -- which contributed significantly to space usage: > http://mirrorbrain.org/news/27-release-smaller-and-faster-database/ > And learning from that, I don't think that's something that one wants to > give up easily. > > A fully relational data model would definitely be a step back, > performance-wise; and that would be bothersome because if we want to > combine many instances into one, we might need more performance, not > less. > > > > 2) > > A few days ago, when waking up in the morning, I thought about it anew. > (And it's not thought through very far, but I thought I should write it > down before its lost.) > > Basically, a single file table is kept, like before, but across all > mirrored projects. One table is added that maps mirrors to the projects > that they mirror, and for the URLs to use. > Elegantly, few changes would be needed in mod_mirrorbrain. > > Even better would be if it could just be a different mode of operation > -- most people will only want to run one instance really, and they won't > want to care about multi-instance setups. Increased complexity could > become a serious hurdle for new users and so on; I think that's in > important concern. > > So, the following could be done: > > - change the scanner to add a the project name as prefix to all files > that it stores in the file table; i.e., let it store the path > $PROJECT/path/to/file instead of /path/to/file. > > - depending on the virtualhost (or directory) it runs in, > mod_mirrorbrain will add the respective prefix in front of the > filename when querying the database. > > - in the existing mirror database records, remove the base URLs, but > keep all the rest. > > - create a new table, where each URL for a project on a mirror is > registered: > > > mirrorid prjname http_base ftp_base rsync_base > > 1 foo http://.... ftp://.... rsync://.... > 1 Bar http://.... ftp://.... rsync://.... > > 2 Bar http://.... ftp://.... rsync://.... > > 3 foo http://.... ftp://.... rsync://.... > 3 OpenBaz http://.... ftp://.... rsync://.... > 3 Buzzz http://.... ftp://.... rsync://.... > > 4 foo http://.... ftp://.... rsync://.... > 4 Bar http://.... ftp://.... rsync://.... > > > - there are some fields stored per mirror which should likely be per > project then, and also be moved into the new table, like "prio" and > "enabled". In fact, they could exist mirror-wide and project-wide. > > - the scanner uses that table to map the URLs > > - the mirror probe only needs to probe one of the URLs; or it could > probe round-robin, or use KeepAlive > > - I'm not sure right now whether some mapping from path names to the > local filesystem would be needed. > > > Advantages: > > + one database, not many (administrative overhead) > > + The file table stays compact; no joins; few code changes needed > (that's good, because when a number of projects accumulate, it could > really become large, and then the _real_ performance challenge begins) > > + the mirrors table and project/url table are tiny, and can be joined > without significant overhead > > + only one table with ASN data needed; not one per database, as it is > the case now. Saves about 50MB per instance. > > + In fact, one could stick the whole setup into a single virtual host. > > > As indicated above, the idea might need more thinking, but maybe it has > a future. > > Peter > _______________________________________________ mirrorbrain mailing list Archive: http://mirrorbrain.org/archive/mirrorbrain/ Note: To remove yourself from this mailing list, send a mail with the content unsubscribe to the address mirrorbrain-request_at_mirrorbrain.orgReceived on Wed Dec 09 2009 - 16:14:49 GMT
This archive was generated by hypermail 2.3.0 : Thu Mar 25 2010 - 19:30:56 GMT