Feature List
Open source
MirrorBrain is available under the terms of the Apache License 2.0 (the Apache modules) and the GPLv2 (rest of the framework).
Scalability
MirrorBrain is very fast and scales extremely well, even for extraordinarily large file trees. It is perfect for high-traffic sites and has a small memory footprint. Performance-critical parts are implemented in C; no PHP or anything that bloats a web server.
Security
MirrorBrain was written with security primarily in mind. This concerns not only the server, but also its users. For a busy download server, security is a matter of the utmost importance, because thousands or even millions of users could quickly become victims. In addition, MirrorBrain has features that can protect users from rogue mirrors.
Stability
MirrorBrain has been serving all openSUSE downloads since the beginning of 2007, without any downtime that was not caused by hardware failure or human error. It is rock-solid and dependable.
Sophisticated mirror selection
Geolocation and other techniques are used to select a mirror that can be expected to work well for the particular client:
- Using the free GeoIP database (see acknowledgement) to look up the country of the client's IP address, MirrorBrain can choose a mirror from the same country. If no mirror in the same country is available, one from the same continent is picked. In addition, it is possible to define fallback mirrors that are known to be suitable for a certain country, thereby optimizing over a random choice. Those fallback mirrors are used only in reserve, when a mirror in the country itself does not exist. So, if New Zealand has a local mirror, but some content is not present on it, we can make sure that clients are sent to an Australian mirror, instead -- and not to Taiwan.
- In addition, MirrorBrain can use a compagnion Apache module named mod_asn to find out more about the client and to exploit network locality. mod_asn looks at global routing data, which is obtained by BGP (border gateway protocol). By looking up the clients network prefix and autonomous system, MirrorBrain can match these to the mirrors. Thus, a request from the mirror's own network will go to exactly that mirror, and the same is true for the autonomous system the mirror is in.
- Additional load balancing between mirrors is achieved by a weighted randomization; a priority defined for each mirror can be tweaked and determines the likelihood of its selection
- Individual mirrors can be configured to serve only requests from their continent, their country, their autonomous system, or their network prefix. This is important for mirrors with limited bandwidth, or for countries with poor Internet connectivity.
- A maximum file size can be configured per mirror, that limits requests that it gets. An easy way to prevent overly slow downloads of large files from certain mirrors. Some mirrors are really useful, just for downloading a DVD their bandwidth isn't sufficient...
RFC compliancy
HTTP redirection works transparently with any client that conforms to HTTP standards, like web browsers, download managers, wget, ... If desired, MirrorBrain can also redirect HTTP requests to FTP servers, but note that such a "protocol switch" is discouraged by RFCs. MirrorBrain aims to be fully RFC-compliant.
Exceptions for special clients, special files, or special (security) reasons
Redirects can be made optional, depending on criteria like filenames matching a pattern, file size, MIME type, user agent, request origin, or others. This is great because of
- security reasons: deliver crucial things yourself, like PGP signature files, MD5/SHA1 hashes; things that you don't want to give up control on. (And these files are typically small anyway.)
- optimization: it doesn't make sense to reply with a redirect for a small file which may not be larger than the whole redirect reply. It would just increase the latency for the client. So, just send the file. (This also saves a database lookup.)
File level granularity
MirrorBrain operates on file level - not based on directories or "file sets" - for several reasons:
- To play well with partial mirrors. Partial mirrors are commonplace
nowadays when several reasons come together:
- content with high turnover rate which might even change before it's fully synced
- large file trees that are difficult to keep in sync across mirrors due to their sheer size (there is always a sync lag)
- very large file trees have a tendency to not find complete mirrors anyway. Instead, there might be lots of mirrors that host a popular subtree, and a few that host more.
- to be a fully capable web server, and not a "dumb" redirector. You can expect MirrorBrain to do things like setting Last-Modified header and handling If-Modified-Since requests 100% correctly. Furthermore, there mare be files that you don't want or can't mirror, and MirrorBrain will simply deliver files itself which are not on any mirror
- it's handy for security reasons, because you can exactly define which files you want to be served by mirrors, and which ones you rather want to deliver by yourself (or only by a handful of trusted macheins)
Mirror list generation
Mirror lists can be generated in many flavors.
- A basic, but often very useful type of mirror list can be generated
per file, as shown in this example. Note that this list
- is made in realtime, and always reflects the current state of mirror status
- is sorted according to the mirror selection algorithm, thereby respecting both the location of the client as well as the priorities that each mirror is assigned.
- The provided Apache module mod_autoindex_mb, a variant of the stock mod_autoindex, can "spice up" Apache's autogenerated index with user-visible links to mirror lists per file as well as metalinks; see the mod_autoindex_mb example. Note that the module just adds simple links, which don't slow anything down.
- It is trivial to generate mirror lists from the database. These could just list all the mirrors, showing their location and such, but they can alternatively do more fine-grained reports showing which "file set" a mirror has, by defining "marker files". You can look at the auto-updated lists that the openSUSE project generates in this way at mirror lists.
- Per-file mirror lists that mod_mirrorbrain produces can be used to run it behind a frontend server. If you want a frontend server to take care of the user interface, showing a list of mirrors to the user for chosing, or similar things, MirrorBrain can be used as backend for mirror selection. Through the per-file mirror lists, the frontend servers get everything they need. Metalinks are perfect for this; see below.
Metalink generation
MirrorBrain is also a very powerful Metalink generator. It generates both old-style ("v3") Metalinks, as well as Meta4 Metalinks standardized as per RFC-to-be 5854. A Metalink, in essence, is a machine-readable mirror list to be used by metalink clients (a very clever, next-generation, download regime). But it's more than just another type of mirror list. Metalinks can contain cryptographic hashes to give the client a way to verify downloaded content, and more. Here's a list of supported features:
- include MD5, SHA1 and SHA256 hashes for verification, both as full and as segmental hashes. The piece-wise hashes allow metalink clients to verify the content already while in transit. Since hashes cannot practically be generated on the fly, at least not for huge files, they are generated offline and stored on disk. When a request comes in the stored snippet with the hashes is injected into the metalink. If a request comes in where no hashes exist yet, it'll get a functional metalink without hashes.
- automatic embedding of links to .torrent files, for hybrid clients that can download both via HTTP/FTP and P2P. This feature will require Apache to stat() on a file named somefile.torrent, but it is possible to enable the check only for files matching a pattern, like .iso, for instance.
- automatic embedding of PGP signatures. If a file named .asc exists, its content is injected into the metalink. This does not require Apache to check for the existance of these files. Instead, the signature file is picked up during the creation of verification hashes and later injected into the metalink together with them.
- sorted metalink, with mirrors closer to the client at the top.
- transparent negotiation of metalinks: a key feature which puts both the server-side and the client-side in full control. It works like this: The client indicates that it would accept a metalink. A non-MirrorBrain server wouldn't notice, and it and the client would talk standard HTTP. MirrorBrain however may return a metalink - if it decides to do so. It may decide to not return a metalink for certain files - again, for security reasons, just as it decides to not redirect certain requests to mirrors.
- works in real-time.
Mirror monitoring
Mirrors are monitored for content and functionality.
- Vitality probing. Mirrors are probed every 60 seconds to see whether they are alive.
- Scanning. MirrorBrain comes with a powerful scanner to scan mirrors via rsync, FTP or HTTP. The list of files is stored in a well-packed database. This doesn't require scripts to run on the mirrors and allows to integrate mirrors even if you never manage to establish a contact.
- Functionality testing. Tools are available to integrity check files from mirrors, by doing sample downloads and verifying the content. MirrorBrain does not rely on this being done in an automated way and is designed to support the secure delivery of content with verification hashes and signatures. But MirrorBrain supports everything which is needed to easily implement such an automatic check.
- Assessing large file support. During scanning, mirrors are automatically tested if they can correctly deliver files which are larger in size than the magical limit of 2 or 4 GB. The checking is done efficiently by downloading just a few bytes from around those limits; if a mirror is found to be broken in this regard, it will still be fully used for all other files. As of the beginning of 2009, about one third of all mirrors are not capable of serving large files.
- Permission checking. While scanning, the scanner takes note of funky permissions that will likely prevent serving files. In addition, there are tools for probing files and checking MD5 hashes, and for spreading files for testing purposes.
- Integration with mirmon. Easy integration with mirmon, a popular tool written by Henk P. Penning that monitors mirrors freshness. MirrorBrain creates the mirror list that mirmon needs, directly from the database. Thus, no separate mirror list needs to be maintained.
Tools for mirror database maintenance
In a system like MirrorBrain, metadata about mirrors needs to be collected and kept up to date.
- Metadata export for archival/auditing. There are tools to export metadata periodically, for instance in order to commit them into an SVN repository and send changes to a mailing list.
- Commandline tools. There are tools for creating mirrors, editing their metadata, listing mirrors and their contents, probing files and much more.
- [PLANNED] Web frontend. A web frontend for these maintenance tasks is planned. For many tasks, commandline tools work so much better, but a web frontend usually can lower the entry bar, especially for the more casual user of the system. Normally, the system runs quite effortless, so when you need to do a little change once a month the chance is high that you are not familiar enough (anymore) with the tools. So a web frontend would be a nice addition, and furthermore could allow for nice and colorful status overviews. Contributions are very welcome. See the development status page.
Integration with other delivery networks
If there is another download redirector, or a commercial content delivery network (CDN) to be integrated, requests can be passed on to it by defining it as a "catch-all" mirror. This allows to effectively bypass the mirror selection criteria and to redirect a portion of the requests (or all) to the other system.
Flexibility
MirrorBrain is highly configurable and modular. The configuration is done within Apache. This means that everything which Apache can do is available to you, like authentication for instance, should you need it. Furthermore, this gives you
- benefitting from Apache's automatic per-directory configuration merging
- multiple instances are supported to run in one Apache - one per virtual host
- flexible logging options. You can log the mirror that a request is sent to, the clients location (continent, country, autonomous system, network prefix), the criterium which lead to mirror selection, actual size of the requested file. It's also possible to log files that no mirror was found for.
- a debug mode can be enabled directory-wise: even under high load, it is possible to get a detailed debug log for only a dedicated part of the directory tree (which doesn't get as many requests). Thus, trouble-shooting is compatible with running in production.
- for diagostic purposes, it is possible to override the client country or autonomous system (which determines the mirror choice)
- optional stickiness of the client-mirror association (implemented via memcache daemon), in case that it is useful to prevent clients from changing mirrors and rather stick to one.
If you miss a feature, please get in contact, and we'll see what we can do.