Maintaining the mirror database

Concepts – the mb command

mb is a commandline tool to do maintain the mirror database, create mirrors, edit them, work with files and other tasks.

It has several subcommands, and it is typically used in one the following forms:

mb <command>
mb <command> <identifier>

A typical example would be:

mb edit

Note the first argument (after edit), which is the mirror identifier. It serves as a name that uniquely identifies a single mirror. It can be useful if these identifiers are memorizable by a human.

For all mb commands where a mirror (or several) needs to be specified, you can abbreviate the identifier by typing part of it. For instance, instead of:

mb show

you could just type:

mb show uib

as long as uib is uniquely identifying a mirrors among the others.

The mb command is extensible. See the developers documentation for instructions. (To be written yet.) .. TODO: add reference

Built-in help

mb has reference documentation built-in. If you just call mb or mb -h or mb help, it will print out the list of known subcommands:

 % mb
    mb COMMAND [ARGS...]
    mb help [COMMAND]

    --version           show program's version number and exit
    -h, --help          show this help message and exit
    -d, --debug         print info useful for debugging
    -b BRAIN_INSTANCE, --brain-instance=BRAIN_INSTANCE
                        the mirrorbrain instance to use. Corresponds to a
                        section in /etc/mirrorbrain.conf which is named the
                        same. Can also specified via environment variable MB.

    commentadd     add a comment about a mirror
    db (vacuum)    perform database maintenance
    delete         delete a mirror from the database
    dirs           show directories that are in the database
    disable        disable a mirror
    edit           edit a new mirror entry in $EDITOR
    enable         enable a mirror
    export         export the mirror list as text file
    file           operations on files: ls/rm/add
    help (?)       give detailed help on a specific sub-command
    instances      list all configured mirrorbrain instances
    iplookup       lookup stuff about an IP address
    list           list mirrors
    markers        show or edit marker files
    mirrorlist     generate a mirror list
    new            insert a new mirror into the database
    probefile      list mirrors on which a given file is present by probing...
    rename         rename a mirror's identifier
    scan           scan mirrors
    score          show or change the score of a mirror
    show           show a mirror entry
    test           test if a mirror is working
    update         update mirrors network data in the database

By typing mb <command> -h or mb help <command>, help for the individual command will be printed:

 % mb help list
list: list mirrors

    mb list [IDENTIFIER]
    -h, --help          show this help message and exit
    -r XY               show only mirrors whose region matches XY (possible
                        values: sa,na,oc,af,as,eu)
    -c XY               show only mirrors whose country matches XY
    -a, --show-disabled
                        do not hide disabled mirrors
    --disabled          show only disabled mirrors
    --prio              also display priorities
    --asn               also display the AS
    --prefix            also display the network prefix
    --region            also display the region
    --country           also display the country
    --other-countries   also display other countries that a mirror is
                        configured to handle

Creating a new mirror

As necessary ingredient, there need to be mirror servers. They need to serve content via HTTP or FTP. To be scanned, they need to run rsync, FTP or HTTP. rsync is most efficient for this. FTP is second choice. At last, HTTP may be used, however it’ll work only if the HTTP server provides a reasonable “standard” directory index.

To make a new mirror known to the database, you use the mb command, specifically the mb new subcommand. An example would be the following:

mb new -H \
                       -F \
                       -R rsync://

This creates a new entry in the mirror database with the data provided on the commandline.

Because providing a lot of data on the commandline can be tiresome, and incremental changes are often needed to get the data right, there is a command to edit the data later: mb edit.

A new mirror created this way is disabled in the beginning, because it needs to be scanned first before it can be useful.

Enabling mirror

Enabling a mirror, or more correctly enabling redirections to a mirror, can be done with the command mb enable.

Before doing this for the first time, the mirror needs to be scanned to be useful; see below (Scanning mirrors).

Another way to enable a mirror is to edit its database record directly (see below, where this is explained).

Disabling a mirror

Using the mb disable command, a mirror can be disabled, and MirrorBrain will immediately stop to send requests to it.

Another way to disable a mirror is to use mb edit to edit its database record, and changing the enabled field to False or 0. At the same time, a comment about the reason could be left in the comment field.

Disabled mirrors are not scanned. Thus, it is usually advisable to scan a mirror before reenabling it after inactivity for prolonged time, using mb scan -e.

A mirror will also effectively be disabled if the score is set to 0.

Deleting a mirror

A mirror is deleted with the mb delete command. This command is an exception of the rule of abbreviating mirror identifiers; here, the full and exact identifier of the mirror to be deleted must be specified. This is to prevent typos.

A deleted mirror is permanently pruned from the database upon completion of the command.

Displaying details about a mirror

mb show will print out the metadata of a mirror. Example:

 % mb show uib
identifier     :
operatorName   : UiB - University of Bergen, IT services
operatorUrl    :
baseurl        :
baseurlFtp     :
baseurlRsync   : rsync://
region         : eu
country        : no
asn            : 224
prefix         :
regionOnly     : False
countryOnly    : False
asOnly         : False
prefixOnly     : False
otherCountries :
fileMaxsize    : 0
publicNotes    :
score          : 100
enabled        : True
statusBaseurl  : True
admin          : X, Y, ...
adminEmail     :
---------- comments ----------
Added - Wed May  6 14:36:10 2009

*** scanned and enabled at Wed May  6 14:47:56 2009.

Gave stage access.
poeml, Mon May 11 16:11:56 CEST 2009

Adjusted FTP URL after they switched to stage. (appended "opensuse").
rsync down at the moment.
poeml, Mon May 11 17:18:06 CEST 2009
---------- comments ----------

A mirror record explained

Field Explanation
This is the unique identifier of the mirror server. In the table shown by mb edit, this is the only field that cannot be edited. To rename an identifier, you can use the mb rename command.
The realname of the mirror operator. This could be a person, an the organization running the mirror, or a sponsor. If the mirror list is exposed in some way, this field could be used to give the operator some visibility. Otherwise, it is of no significance than for your information.
A contact or informative URL.
The root HTTP URL of the mirrored file tree on the mirror. Used by the redirector to redirect requests via HTTP. If a mirror doesn’t offer HTTP, but only FTP, an FTP URL can be entered here as well.
The root FTP URL of the mirrored file tree on the mirror. Used by the scanner to retrieve the file list - if rsync isn’t available..
The root rsync URL used by the scanner to find the files via rsync. It’s possible to use URLs with credentials, like rsync://<username>:<password>@<hostname>/module. rsync is the preferred method of scanning, so it is beneficial if rsync access exists. If it doesn’t, the scanner falls back to FTP or HTTP.
The region code specifying the continent the mirror server is located in. See also regionOnly. If you create a new mirror, mb new tries to fill in this field and the following field for you; it’s possible to edit it later, though.
The country code for the server. See also countryOnly.
This is optional and is a number of the autonomous system the mirror is located in. It may serve as a more specific “network location” than the country, and is filled in automatically when a mirror is created. If you don’t use the autonomous system database together with MirrorBrain, the value will be zero and will be ignored by MirrorBrain. It is not strictly needed. It can also be edited manually, or updated via mb update –asn <identifier> from looked up data. Only meaningful if MirrorBrain is used together with mod_asn.
Same as asn, this value is optional, and if present, it is used for a possibly finer-grained mirror selection. It is filled in automatically, and can be edited like asn. Use mb update –prefix <identifier> to fill in data from a routing table lookup.
If true, only clients from the same region (continent) as the mirror are redirected to this mirror.
If true, only clients from the same country as the mirror are redirected to this mirror.
If true, the mirror will only get requests from clients that are located within the same network autonomous system (using the value in asn).
If true, the mirror will only get requests from clients that are located within the same network prefix using the value inn prefix).
List of other countries that should be sent to this mirror server. This overrides the country and region choice, and can be used to fine-tune mirror selection. The list of country IDs specified here is given in the form of comma-separated two-letter codes. Apache does a simple string match on these, and a value that would make sense would be ca,mx,ar,bo,br,cl,co,ec,fk,gf,gy,pe,py,sr,uy,ve, jp for instance.
Maximum filesize, the server can deliver without problems (some servers have problems with files > 2GB for example). MirrorBrain automatically checks HTTP servers for correct delivery, so there is no need to define this value for that reason. It can be used, however, to cause only “small” requests to go to certain mirrors, which are known to have too few bandwidth to deliver large files. If you set a threshold here (in bytes), the mirror will only get files that are smaller.
Notes which should be added to a html page listing all mirrors. The field may be used to store information separately from private notes taken in the comments field. The data isn’t exposed though, unless you take care of it.
The score (priority) of the server. Higher scored servers are used more often than lower scored servers. Default is 100. A server with score=150 will be used more often than a server with score=50.
Whether a mirror gets requests. Use this to enable redirects to a mirror, or switch them off. Can also be set with mb enable/disable <identifier>.
This field is edited by the mirror probe each time it runs (which normally is done frequently via cron). If it’s true, the mirror probe found that the mirror is alive the last time it looked.
Name of an admin or contact person for the mirror.
Contact Email address.
Free text field for additional comments. Use it in any way that suits you. It lends itself to take notes about communication with mirrors, for instance.

Editing a mirror

A mirror (in the mirror database) can be edited with the mb edit command.

The command will bring up an editor with the mirror’s metadata. The EDITOR and VISUAL environmental variable is respected, and the editor defaults to vim.

For fields where a Boolean is expected, you can type the value (while editing) in the form of 0/1 instead of true/false (shorter to type).

When you save the text and close the editor, you’ll be asked whether to save the data to the database.

Editing a mirrors network location

There are some fields in the mirror record, for which manual editing doesn’t make so much sense. These are:

  • country,
  • region,
  • autonomous system number,
  • network prefix,
  • geographical coordinates.

When a mirror is created (using mb new ), then all these fields are automatically filled in. This requires a working DNS lookup and a GeoIP database.

The lookup of the autonomous system number and network prefix require mod_asn to be configured.

The geographical coordinates require the GeoIP database to be the GeoIP city (lite) version. The smaller database versions don’t contain the coordinates.

The data can be updated later with the mb update command. Regularly running this command (say, once a month) is a good idea because the data sometimes might change over time. However, this also means that manual edits will be overwritten.

To update all network data for all mirrors, simply run:

% mb update -A --all-mirrors

The command can also be used for individual mirrors, and to update only some data:

 % mb update --coordinates --asn --prefix ftp5
updating geographical coordinates for (0.000 0.000 -> 53.083 8.8)

Or it can be applied to all active mirrors:

 % mb update --coordinates --asn --prefix
updating geographical coordinates for (0.000 0.000 -> 36.0 138.0)
updating network prefix for ( ->

Listing mirrors

mb list lists mirrors, with less or more details. In its simplest form, the command will simply print all identifiers of enabled mirrors. mb list -a includes also the disabled mirrors.

More useful is to add filters, or display more data.

Examples of filtering by country code (here: Bulgaria, bg):

 % mb list -c bg

Example of filtering by region (here: Oceania, oc), and also displaying the value of the otherCountries field for each mirror:

 % mb list -r oc --other-countries               nz           nz          nz        nz              nz                nz                nz

Example of listing all mirrors in Portugal and showing their score (their priority):

 % mb list -c pt --prio                    100              50                       50                   3

Showing priority, network prefix and autonomous system of Chinese mirrors:

 % mb list -c cn --prio --as --prefix           100  4134                 30  9389                    100 17816

When not filtering the output, the --country and --region commandline options are useful, because they add that data into the output. An example would be listing all mirrors with the command mb list –prio –as –prefix –country –region.

Scanning mirrors

Mirrors need to be scanned for their file lists. This is done with the mb scan command. The program will try rsync, if available, FTP if not, or HTTP if it’s the only option.

An individual mirror can be scanned like this:

 % mb scan roxen
Fri Jul 31 21:31:50 2009 starting
Fri Jul 31 21:31:51 2009 total files before scan: 17248
Fri Jul 31 21:31:59 2009 scanned 17248 files (1935/s) in 8s
Fri Jul 31 21:31:59 2009 files to be purged: 0
Fri Jul 31 21:32:00 2009 total files after scan: 17248
Fri Jul 31 21:32:00 2009 purged old files in 1s.
Fri Jul 31 21:32:00 2009 done.
Completed in 9 seconds

After creation of a new mirror, it is disabled first. A typical workflow would be to scan it, after creating it, and then enabling redirection. mb scan command can be used with the -e/--enable option to make this happen. If the scan went successfully, the mirror will be enabled afterwards:

 % mb scan -e tuwien
Fri Jul 31 21:50:45 2009 starting
Fri Jul 31 21:50:45 2009 total files before scan: 712
Fri Jul 31 21:50:46 2009 scanned 712 files (511/s) in 1s
Fri Jul 31 21:50:46 2009 files to be purged: 0
Fri Jul 31 21:50:46 2009 total files after scan: 712
Fri Jul 31 21:50:46 2009 purged old files in 0s. now enabled.
Fri Jul 31 21:50:46 2009 done.
Completed in 1 seconds

To scan all enabled mirrors in parallel, you would use -j/--jobs=N option to specify the number of scanners to start in parallel, and the -a/--all option:

% mb scan -j 16 -a

This is likely what you would configure to be done periodically by cron.

To scan only a subdirectory on the mirrors, the -d option can be used. This can be useful when it is known that content has been added or removed in particular places of large trees, in the following example shown with a single mirror only:

 % mb scan -d repositories/Apache ftp5
Checking for existance of 'repositories/Apache' directory
Scheduling scan on:
Completed in 0 seconds
Fri Jul 31 21:41:37 2009 starting
Fri Jul 31 21:41:38 2009 files in 'repositories/Apache' before scan: 780
Fri Jul 31 21:41:40 2009 scanned 780 files (636/s) in 1s
Fri Jul 31 21:41:40 2009 files to be purged: 0
Fri Jul 31 21:41:42 2009 total files after scan: 760122
Fri Jul 31 21:41:42 2009 purged old files in 2s.
Fri Jul 31 21:41:42 2009 done.
Completed in 4 seconds

For debugging purposes, the -v option is useful. It can be repeated several times to enable more output.

Listing files

Files known to the database can be listed with the mb file ls command. When specifying a path name, the leading slash is optional and not relevant. (Internally, the filenames are stored without.)


 % mb file ls /distribution/11.1/repo/oss/suse/ppc/tcsh-6.15.00-93.3.ppc.rpm
as th  100 ok       ok
eu at  100 disabled dead
eu at  100 ok       ok
eu de  100 ok       ok
eu hu  100 ok       ok

Globbing can be used. Then, to get more than a list or mirrors, but also the filenames, the -u/--url option is useful:

 % mb file ls \*.iso -u
as th  100 ok       ok          
as th  100 ok       ok          
as th  100 ok       ok          
as th  100 ok       ok          
eu de  100 ok       ok          
eu hu  100 ok       ok    

In addition to just listing what’s known to the database, the command can also do probing. The number is the HTTP return code (200 for OK):

 % mb file ls /distribution/11.1/repo/oss/suse/ppc/tcsh-6.15.00-93.3.ppc.rpm --probe
as th  100 ok       ok                    200
eu at  100 disabled dead
eu at  100 ok       ok                 200
eu de  100 ok       ok                    200
eu hu  100 ok       ok              200

When used with probing, there is the additional option to actually download the content and display a checksum of what was returned:

 % mb file ls --probe /distribution/11.1/repo/oss/suse/ppc/tcsh-6.15.00-93.3.ppc.rpm --md5
as th  100 ok       ok                    200 50dc50b20a97783a51ff402359456e3a
eu at  100 disabled dead
eu at  100 ok       ok                 200 50dc50b20a97783a51ff402359456e3a
eu de  100 ok       ok                    200 50dc50b20a97783a51ff402359456e3a
eu hu  100 ok       ok              200 50dc50b20a97783a51ff402359456e3a

To be usable with lots of mirrors, the probing is done in parallel.

The mb file command can also be used as mb file add and mb file rm to manipulate the database. See the help output of the command for details.

Exporting mirror lists

The mb export command can export data from the mirror database in several different formats, for different purposes.

Exporting in mirmon format

mirmon is a program written by Henk P. Penning which monitors the status of mirrors. The format “mirmon” exports a list of mirrors in a text format that can be read by mirmon.

With this, it is straighforward to deploy mirmon and automate it to use the mirrors from the database. Thus, no separate list of mirrors needs to be maintained for it.

The command mb export --format=mirmon generates the list that mirmon needs, which basically looks like this:

 % mb export --format=mirmon | head
de  <...@...>
de   <...@...>
de      rsync://     <...@...>
us    <...@...>
us     <...@...>
us      rsync://       <...@...>

To give a full example, here’s how the actual mirmon config file would look like. Note the mirror_list line which pulls the generated list in:

mirror_list /usr/bin/mb export --format=mirmon |
web_page /var/www/
icons icons
probe /usr/bin/wget -q -O - -T %TIMEOUT% -t 1 %URL%timestamp.txt
state /home/mirrorbrain/mirmon/state
countries /usr/local/mirmon-2.3/countries.list
list_style plain
timeout 20

The cron job to create the list and run mirmon would look like this:

30 * * * *   mirrorbrain    perl /usr/local/mirmon-2.3/mirmon -q -get update -c /etc/mirmon.conf

Note: when mirmon is run for the first time, the state file needs to be touched, or the script will not run.

The icons which are included in the resulting HTML page need to made available by Apache:

Alias /mirmon/icons /usr/local/mirmon-2.3/icons
<Directory /usr/local/mirmon-2.3/icons>
    Options None
    AllowOverride None
    Order allow,deny
    Allow from all

Further tips:

  1. If your mirmon is configured with list_style apache instead of list_style plain, a different mirror list format is needed; use mb export with the mb export --format=mirmon-apache option then.

  2. If you prefer to run mb export under a different user id than mirmon, you can write the mirror list to an intermediate file, and configure mirmon to use the file like this:

    mirror_list /path/to/mirmon/mirrorlist-export

Exporting to a Version Control System (VCS)

Exporting data in text format is a dead easy way to keep a history of changes that happen in the mirror database — and mail them around, so everybody involved is kept updated. At the same time, it serves archival purposes.

The idea is to export snapshots of the data in text format. The resulting files are put into a standard version control system, and standard post-commit hook scripts can be used to trigger certain actions (e.g. email).

The resulting archive of changes is all human-readable (much more useful than raw database backups). The changes can actually be mailed around in the form of a diff, showing some context.

A different way to implement a notification system for mirror changes would be to notify about each and every change done to the database — however, often changes have to be done incrementally and this would be a noisy method when working on a mirror’s configuration.

Instead, an hourly snapshot is normally sufficient to keep others informed, and shouldn’t be too noisy.

Subversion is the only version control system supported at the moment, but should hopefully be ubiquitous enough.

To set this up, first a repository needs to be created:

doozer:~ # su - mirrorbrain
mirrorbrain@doozer:~> svnadmin create mirrors-svn-repos
mirrorbrain@doozer:~> svn co file://$PWD/mirrors-svn-repos mirrors-svn
Checked out revision 0.

Then, set up a cron job to run every hour, calling mb export with the --format=vcs and the --commit=svn options. The latter automatically runs svn commit after the export (taking into account files that have been deleted, or occur for the first time):

 # export mirrordb contents to SVN and send commit mails
7 * * * *      mirrorbrain   mb export --format vcs --target-dir ~/mirrors-svn --commit=svn

Finally, the post-commit hook script is missing, which takes care of sending mails. Create and edit it as follows:

mirrorbrain@doozer:~> touch mirrors-svn-repos/hooks/post-commit
mirrorbrain@doozer:~> chmod +x mirrors-svn-repos/hooks/post-commit
mirrorbrain@doozer:~> vi mirrors-svn-repos/hooks/post-commit

/usr/share/subversion/tools/hook-scripts/mailer/ commit "$REPOS" "$REV" /etc/mailer.conf

The path to the script likely needs adjustment. The configuration (/etc/mailer.conf) could look like this:

mail_command = /usr/sbin/sendmail

diff = /usr/bin/diff -u -L %(label_from)s -L %(label_to)s %(from)s %(to)s
generate_diffs = add copy modify
show_nonmatching_paths = yes

for_repos = /home/mirrorbrain/mirrors-svn-repos
from_addr = mirrorbrain@...
to_addr = admin@foo bar@...
commit_subject_prefix = [mirrordb]
propchange_subject_prefix = [mirrordb]

Exporting in PostgreSQL format

The format “postgresql” creates SQL INSERT statements that can be run on a PostgreSQL database. This can e.g. be used to migrate the data into another database.

The resulting dump could be loaded into a mirrorbrain instance like this:

mb db shell < db.dump

Exporting in Django format

This is experimental stuff — intended for hacking on the Django web framework. Data is exported in the form of Django ORM objects, and the export routine will very likely need modification for particular purposes. The existing code has been used to experiment with. Get in contact if you are interested in hacking on this!

Performing database maintenance

The mb db command offers some helpful functionality regarding database maintenance. It has several subcommands.

Regular cleanups with mb db vacuum

This command cleans up unreferenced files from the mirror database.

This should be done once a week for a busy file tree. Otherwise it should be rarely needed, but can possibly improve performance if it is able to shrink the database.

When called with the -n option, only the number of files to be cleaned up is printed, so it’s purely for information. No cleanup is performed.

The recommended cron job looks like this:

# Monday: database clean-up day...
30 1 * * mon              mirrorbrain   mb db vacuum

Note: This functionality is not to be confused with the PostgreSQL-internal vacuuming, which typically happens automatic these days (8.x), but was a manual process at some time in the past.

Database shell with mb db shell

With this command, you can conveniently open a database shell:

 % mb db shell
psql (8.4.1)
Type "help" for help.


...ready to enter commands in psql, the PostgreSQL interactive terminal.

Database size info with mb db size

The command mb db size prints the size of each database relation. (In PostgreSQL speak, a relation is a table or an index.) This provides insight for appropriate database tuning and planning. Here’s an example:

 % mb db sizes
Size(MB) Relation
464.5    filearr
532.9    filearr_path_key
 74.3    filearr_pkey
 23.8    pfx2asn
 30.1    pfx2asn_pfx_key
 19.9    pfx2asn_pkey
  0.0    pg_foreign_server
  0.0    pg_foreign_server_name_index
  0.0    pg_foreign_server_oid_index
  0.0    pg_user_mapping_user_server_index
  0.2    server
  0.0    server_enabled_status_baseurl_score_key
  0.0    server_identifier_key
  0.0    server_pkey
  0.0    sql_sizing_profiles
Total: 1145.9

This example shows a really, really large database, containing nearly 3 millions (!) of files. It uses a good gigabyte of disk space.

filearr contains the file names and associations to the mirrors. filearr_path_key is the index on the file names. filearr_pkey is the primary key. These will be the largest things in a database filled with millions of files.

The pfx* relations are only present when mod_asn is installed. The size they use is always the same.