Feature List

Overview

  • Open Source
  • scalable, secure, stable & RFC-compliant
  • mirror selection by
    • network topology (same network, same autonomous system)
    • geolocation by country
    • geolocation by estimated geographical distance of the mirrors to the client
  • configurable load balancing, giving mirrors different weight
  • ability to limit requests for a mirror to its own network or country
  • ability to NOT redirect certain requests, for security reasons
  • file level granularity (mirrors don't have to mirror the full file tree - they can choose what they want)
  • reliably assess large file support of mirrors (do they correctly serve files > 2 or 4 GB?)
  • content on mirrors can be protected by URL signing (clients can only download from mirrors if they successfully authenticated with the MirrorBrain server)
  • more than a redirector
    • serve automatically generated cryptohashes (MD5, SHA1, SHA256)
    • generation of RFC 5854 Metalinks
    • support for RFC 3230 - Instance Digests in HTTP
    • support for RFC 6249 - Metalink/HTTP: Mirrors and Hashes
    • generation of Torrents (including the closest mirrors as seeds)
    • support for zsync
    • support for native Yum mirror lists, compatible to Fedora and CentOS
  • commandline tools and Python module for maintenance tasks
  • mirror monitoring (integrated with mirmon)
  • mirror list generation for overview
  • support for running behind a load balancer, using e.g. X-Forwarded-for header for the clients IP address
  • IPv6 support (provided a recent version of mod_geoip is used)
  • integrated in Apaches module API, for compatibility with numerous other existing Apache modules, e.g. SSL
  • multiple instances are supported to run in one Apache (one per virtual host)
  • flexible logging

Open source

MirrorBrain is available under the terms of the Apache License 2.0 (the Apache modules) and the GPLv2 (rest of the framework).

Scalability

MirrorBrain is fast and scales well, even for extraordinarily large file trees. Made for high-traffic sites, it has a very small memory footprint. Performance-critical parts are implemented in C. No PHP that bloats the web server.

Security

MirrorBrain was written with security primarily in mind. This concerns not only the server, but also its users. For a busy download server, security is a matter of the utmost importance, because thousands or even millions of users could quickly become victims. In addition, MirrorBrain has features that can protect users from rogue mirrors.

Stability

As of summer 2010, MirrorBrain has been serving all openSUSE.org downloads since the beginning of 2007, without any downtime that was not caused by hardware failure or human error. Since February 2010, OpenOffice.org is another high-volume user. It proved solid and dependable also for other organizations.

Sophisticated mirror selection

Selecting the most suitable mirror is the core business of MirrorBrain. Geolocation and other techniques are used to select a mirror that can be expected to work well for the particular client:

  • Using the free GeoIP database (see acknowledgement) to look up the country of the client's IP address, MirrorBrain can choose a mirror from the same country. If no mirror in the same country is available, one from the same continent is picked. In addition, it is possible to define fallback mirrors that are known to be suitable for a certain country, thereby optimizing over a random choice. Those fallback mirrors are used only in reserve, when a mirror in the country itself does not exist. So, if New Zealand has a local mirror, but some content is not present on it, we can make sure that clients are sent to an Australian mirror, instead — and not to Taiwan.
  • In addition, MirrorBrain can use a compagnion Apache module named mod_asn to find out more about the client and to exploit network locality. mod_asn looks at global routing data, which is obtained by BGP (border gateway protocol). By looking up the clients network prefix and autonomous system, MirrorBrain can match these to the mirrors. Thus, a request from the mirror's own network will go to exactly that mirror, and the same is true for the autonomous system the mirror is in.
  • Furthermore, geographical distance to the mirrors is considered, using a computationally lightweight approximation. This provides important refinement for large countries (like the U.S.), and countries with many mirrors. It also helps for countries that have no mirror themselves, where only a random mirror from the continent could be selected otherwise.
  • Additional load balancing between mirrors is achieved by a weighted randomization; a priority defined for each mirror can be tweaked and determines the likelihood of its selection.
  • Individual mirrors can be configured to serve only requests from their continent, their country, their autonomous system, or their network prefix. This is important for mirrors with limited bandwidth, or for countries with poor Internet connectivity. It also allows for private mirrors that don't serve the public.
  • A maximum file size can be configured per mirror, that limits requests that it gets. An easy way to prevent overly slow downloads of large files from certain mirrors. Some mirrors are really useful, just for downloading a DVD their bandwidth isn't sufficient.
  • In cases of regions that don't have mirrors, and where a geographical choice would lead to insufficient results, fallback mirrors can be defined that are known to handle that region well.

Support for running behind load balancers

Running as a backend behind load balancers or reverse proxies is fully supported. Incoming connections will not be from the original client IP, but the client IP can be passed to MirrorBrain in a standard way using a HTTP header. MirrorBrain can return detailed information to the frontend via XML.

RFC compliancy

HTTP redirection works transparently with any client that conforms to HTTP standards, like web browsers, download managers, wget, ... MirrorBrain can also redirect HTTP requests to FTP servers (even though such a "protocol switch" is not really encouraged by Internet standards). MirrorBrain aims to be fully RFC-compliant.

Flexible integration into Apache

MirrorBrain is integrated in Apache as an Apache module (mod_mirrorbrain). It can be used seamlessly together with standard modules such as mod_rewrite, mod_deflate, mod_headers, mod_proxy and mod_limitipconn. It can also be combined with a multitude of available authentication mechanisms.

Exceptions for special clients, special files, or special (security) reasons

Redirects can be made optional, depending on criteria like filenames matching a pattern, file size, MIME type, user agent, request origin, or others. This is important because of

  • security reasons: deliver crucial things yourself, like PGP signature files, MD5/SHA1 hashes; things that you don't want to give up control on. (And these files are typically small anyway.)
  • optimization: it doesn't make sense to reply with a redirect for a small file which may not be larger than the whole redirect reply. It would just increase the latency for the client. So, just send the file. (This also saves a database lookup.)

File level granularity

MirrorBrain operates on file level — not based on directories or "file sets" — for several reasons:

  • To play well with partial mirrors. Partial mirrors are commonplace nowadays when several reasons come together:
    • content with high turnover rate which might even change before it's fully synced
    • large file trees that are difficult to keep in sync across mirrors due to their sheer size (there is always a sync lag)
    • very large file trees have a tendency to not find complete mirrors anyway. Instead, there might be lots of mirrors that host a popular subtree, and a few that host more.
  • to be a fully capable web server, and not a "dumb" redirector. You can expect MirrorBrain to do things like setting Last-Modified header and handling If-Modified-Since requests 100% correctly. Furthermore, there mare be files that you don't want or can't mirror, and MirrorBrain will simply deliver files itself which are not on any mirror
  • it's handy for security reasons, because you can exactly define which files you want to be served by mirrors, and which ones you rather want to deliver by yourself (or only by a handful of trusted macheins)

Mirror list generation

Mirror lists can be generated in many flavors.

  • A basic, but often very useful type of mirror list can be generated per file, as shown in this example. Note that this list
    • is made in realtime, and always reflects the current state of mirror status
    • is sorted according to the mirror selection algorithm, thereby respecting both the location of the client as well as the priorities that each mirror is assigned.
  • The mirror lists can be styled individually by specifying a HTML header and footer.
  • The provided Apache module mod_autoindex_mb, a variant of the stock mod_autoindex, can "spice up" Apache's autogenerated index with user-visible links to mirror lists per file as well as metalinks; see the mod_autoindex_mb example. Note that the module just adds simple links, which don't slow anything down.
  • It is trivial to generate mirror lists from the database. These could just list all the mirrors, showing their location and such, but they can alternatively do more fine-grained reports showing which "file set" a mirror has, by defining "marker files". You can look at the auto-updated lists that the openSUSE project generates in this way at mirror lists.
  • Per-file mirror lists that mod_mirrorbrain produces can be used to run it behind a frontend server. If you want a frontend server to take care of the user interface, showing a list of mirrors to the user for chosing, or similar things, MirrorBrain can be used as backend for mirror selection. Through the per-file mirror lists, the frontend servers get everything they need. Metalinks are perfect for this; see below.

Serving metadata and cryptographic hashes

There is a tool to efficiently collect cryptographic hashes and other metadata from the files, like:

  • file size and modification time
  • SHA256 hash
  • SHA1 hash
  • MD5 hashes
  • BitTorrent infohash
  • link to Metalink
  • link to Torrent
  • zsync link
  • Magnet link (needs testing)
  • link to PGP signature (if available)

This metadata is cached, and can served in a variety of ways:

  • Hashes can be included and shown on a "details" page per file (see the example here)
  • Individual hashes can be requested directly from the server by appending an extension like .md5 or .sha256 to an URL.
  • All hashes are perused in Metalinks and Torrents, and Magnet links.

Support for URL signing

The redirector can generate redirection URLs that supply a signature with temporary validity. The signature can be checked on the mirror servers to restrict access to authenticated clients. This means that authentication and authorization of clients can be validated centrally (by the redirector), and the content on mirror servers protected thereby. See See Configuring URL signatures for details.

Torrent generation

MirrorBrain has a torrent generator embedded. Torrents are generated in realtime (from hashes cached in the database). See Generating Torrents for details.

Yum-style mirror list generation

MirrorBrain can natively serve Yum mirror lists, compatible to the way in which Fedora or CentOS use them. See Serving Yum-style mirror lists for details.

Experimental zsync support

EXPERIMENTAL. MirrorBrain has experimental zsync support. The zsync distribution method is rsync over HTTP, so to speak, and MirrorBrain can generate zsync files on-the-fly. MirrorBrain supports the simpler variant which doesn’t look into compressed content. It is compatible to the the latest zsync releases (0.6.1 and 0.6.2).

See Configuring zsync support for details.

Mirror monitoring

Mirrors are monitored for content and functionality.

  • Vitality probing. Mirrors are probed every 60 seconds to see whether they are alive.
  • Scanning. MirrorBrain comes with a powerful scanner to scan mirrors via rsync, FTP or HTTP. The list of files is stored in a well-packed database. This doesn't require scripts to run on the mirrors and allows to integrate mirrors even if you never manage to establish a contact.
  • Functionality testing. Tools are available to integrity check files from mirrors, by doing sample downloads and verifying the content. MirrorBrain does not rely on this being done in an automated way and is designed to support the secure delivery of content with verification hashes and signatures. But MirrorBrain supports everything which is needed to easily implement such an automatic check.
  • Assessing large file support. During scanning, mirrors are automatically tested if they can correctly deliver files which are larger in size than the magical limit of 2 or 4 GB. The checking is done efficiently by downloading just a few bytes from around those limits; if a mirror is found to be broken in this regard, it will still be fully used for all other files. As of the beginning of 2009, about one third of all mirrors are not capable of serving large files.
  • Permission checking. While scanning, the scanner takes note of funky permissions that will likely prevent serving files. In addition, there are tools for probing files and checking MD5 hashes, and for spreading files for testing purposes.
  • Integration with mirmon. Easy integration with mirmon, a popular tool written by Henk P. Penning that monitors mirrors freshness. MirrorBrain creates the mirror list that mirmon needs, directly from the database. Thus, no separate mirror list needs to be maintained.

Tools for mirror database maintenance

In a system like MirrorBrain, metadata about mirrors needs to be collected and kept up to date.

  • Metadata export for archival/auditing. There are tools to export metadata periodically, for instance in order to commit them into an SVN repository and send changes to a mailing list.

  • Commandline tools. There are tools for creating mirrors, editing their metadata, listing mirrors and their contents, probing files and other tasks:

    • creation of a new mirror in the database
    • adding and editing comments about them (so one can keep notes)
    • triggering scans
    • functional tests of mirrors
    • calculating hashes
    • listing mirrors per country, per region, list disabled mirrors, ...
    • listing, adding, deleting files in the database
    • creating mirror lists for web pages
    • exporting data for backup, reports or migrations
    • etc.

    The commandline tool is written in Python in a modularized way and comes with a Python library that can easily be used from other scripts. The Python module is planned to be the basis for a future web frontend.

    The database could easily be accessed directly by your own scripts or a web application.

  • [PLANNED] Web frontend. A web frontend for these maintenance tasks is planned. For many tasks, commandline tools work so much better, but a web frontend usually can lower the entry bar, especially for the more casual user of the system. Normally, the system runs quite effortless, so when you need to do a little change once a month the chance is high that you are not familiar enough (anymore) with the tools. So a web frontend would be a nice addition, and furthermore could allow for nice and colorful status overviews. Contributions are very welcome. See the development status page.

Integration with other delivery networks

If there is another download redirector, or a commercial content delivery network (CDN) to be integrated, requests can be passed on to it by defining it as a "catch-all" mirror. This allows to effectively bypass the mirror selection criteria and to redirect a portion of the requests (or all) to the other system.

Flexibility

MirrorBrain is highly configurable and modular. The configuration is done within Apache. This means that everything which Apache can do is available to you, like authentication for instance, should you need it. Furthermore, this gives you

  • benefitting from Apache's automatic per-directory configuration merging
  • multiple instances are supported to run in one Apache - one per virtual host
  • flexible logging options. You can log the mirror that a request is sent to, the clients location (continent, country, autonomous system, network prefix), the criterium which lead to mirror selection, actual size of the requested file. It's also possible to log files that no mirror was found for.
  • a debug mode can be enabled directory-wise: even under high load, it is possible to get a detailed debug log for only a dedicated part of the directory tree (which doesn't get as many requests). Thus, trouble-shooting is compatible with running in production.
  • for diagostic purposes, it is possible to override the client country or autonomous system (which determines the mirror choice)
  • optional stickiness of the client-mirror association (implemented via memcache daemon), in case that it is useful to prevent clients from changing mirrors and rather stick to one.

If you miss a feature, please get in contact, and we'll see what we can do.