Hi, on a single host, it's possible to run multiple MirrorBrain instances, to take care of redirections for several hosted (or mirrored) projects. Each MB instance has its database with mirrors and files, as if it would be running on a separate host, even though they are just separate virtual hosts on one box. When such setups employ a significant amount or mirrors, it becomes quickly apparent that many mirrors participate in several projects, so these "usual suspects" are present with their metadata in each MB instance. In each database, mirrors are stored with admin contacts and similar metadata; for each database, mirrors are probed, and their status maintained in the database. The same duplication happens across different machines, of course - when a big mirror's contact address changes, I have to change it in the database of the openSUSE MB setup, in the OOo MB setup and possibly in others. It would be interesting to think about some kind of shared database for that purpose (or some kind of synchronization, or at least notifications); however, especially when the duplication happens on a single box, it would be quite attractive to get rid of it. I had some ideas about that in the past, and the idea of this post is to jot them down for the archive. 1) Earlier this year, I experimented with a new database scheme, designed from the ground up, which covers that scenario. It also adds a facility of tracking events (similar to a log) about the mirrors. MirrorBrain "3.0", so to speak. For some reason, I never committed the code anywhere (because I fooled around with new technologies, and didn't really accomplish much). Sorry. That was the setup that I had in mind in my FOSDEM talk this year, as sketched here: http://mirrorbrain.org/static/images/misc/mb_new_schema-fosdem09.jpg (http://www.poeml.de/~poeml/talks/free_software_CDN_vision.pdf , page 44) On the plus side, that would be a data model that most developers would be familiar with; very easy to implement, because commonplace. However, it also adds a whole lot of complexity that we don't have today. A good web frontend could probably compensate for it, but still... A file with Django models is attached to this mail. However, when I designed that, I just had improved the existing database (by migration from MySQL to PostgreSQL and a total redesign) to be several times smaller and faster. I just had gotten rid of all relations -- which contributed significantly to space usage: http://mirrorbrain.org/news/27-release-smaller-and-faster-database/ And learning from that, I don't think that's something that one wants to give up easily. A fully relational data model would definitely be a step back, performance-wise; and that would be bothersome because if we want to combine many instances into one, we might need more performance, not less. 2) A few days ago, when waking up in the morning, I thought about it anew. (And it's not thought through very far, but I thought I should write it down before its lost.) Basically, a single file table is kept, like before, but across all mirrored projects. One table is added that maps mirrors to the projects that they mirror, and for the URLs to use. Elegantly, few changes would be needed in mod_mirrorbrain. Even better would be if it could just be a different mode of operation -- most people will only want to run one instance really, and they won't want to care about multi-instance setups. Increased complexity could become a serious hurdle for new users and so on; I think that's in important concern. So, the following could be done: - change the scanner to add a the project name as prefix to all files that it stores in the file table; i.e., let it store the path $PROJECT/path/to/file instead of /path/to/file. - depending on the virtualhost (or directory) it runs in, mod_mirrorbrain will add the respective prefix in front of the filename when querying the database. - in the existing mirror database records, remove the base URLs, but keep all the rest. - create a new table, where each URL for a project on a mirror is registered: mirrorid prjname http_base ftp_base rsync_base 1 foo http://.... ftp://.... rsync://.... 1 Bar http://.... ftp://.... rsync://.... 2 Bar http://.... ftp://.... rsync://.... 3 foo http://.... ftp://.... rsync://.... 3 OpenBaz http://.... ftp://.... rsync://.... 3 Buzzz http://.... ftp://.... rsync://.... 4 foo http://.... ftp://.... rsync://.... 4 Bar http://.... ftp://.... rsync://.... - there are some fields stored per mirror which should likely be per project then, and also be moved into the new table, like "prio" and "enabled". In fact, they could exist mirror-wide and project-wide. - the scanner uses that table to map the URLs - the mirror probe only needs to probe one of the URLs; or it could probe round-robin, or use KeepAlive - I'm not sure right now whether some mapping from path names to the local filesystem would be needed. Advantages: + one database, not many (administrative overhead) + The file table stays compact; no joins; few code changes needed (that's good, because when a number of projects accumulate, it could really become large, and then the _real_ performance challenge begins) + the mirrors table and project/url table are tiny, and can be joined without significant overhead + only one table with ASN data needed; not one per database, as it is the case now. Saves about 50MB per instance. + In fact, one could stick the whole setup into a single virtual host. As indicated above, the idea might need more thinking, but maybe it has a future. Peter _______________________________________________ mirrorbrain mailing list Archive: http://mirrorbrain.org/archive/mirrorbrain/ Note: To remove yourself from this mailing list, send a mail with the content unsubscribe to the address mirrorbrain-request_at_mirrorbrain.org
This archive was generated by hypermail 2.3.0 : Thu Mar 25 2010 - 19:30:55 GMT