Title geographic distance ordering for finer mirror selection
Priority wish Status resolved
Superseder Nosy List poeml, theuni
Assigned To poeml Keywords

Created on 2009-12-11.15:56:04 by poeml, last changed by poeml.

msg104 (view) Author: poeml Date: 2009-12-11.15:56:03
In some countries (US, Germany) there can be a wealth of mirrors. A specifically 
suited mirror can be picked when there's on in the same AS or in the same 
network as the client. But otherwise, any mirror from the country will be 
picked, which could be from the different end of the country (East coast, West 
coast in the us; north/south in Germany). 

Another scenario is that no mirror is found in the client's country, but there 
are several in its continent. However, which one to choose? Currently, it is a 
random choice, which means that, for instance, any European country could be 
sent to any European mirror; in most cases, it would probably be better to use a 
neighbouring country or the closest one that can be found.

Here's how a finer mirror selection could be achieved in both these cases.

The GeoLite city(!) database provides geographical coordinates. The coordinates 
of the mirrors could be filled into the database's mirror records when mirrors 
are created, or later, similar as their network and AS data.

When clients' IP addresses are looked up via GeoIP during request processing, 
their geographical coordinates would (should) be available as well.

Using a planar approximation of the Haversine formula, the distance to available 
mirrors can be calculated, and mirrors sorted by their closeness to the client.
A simple approximation, that is not only easy to implement but also should incur 
very low possible overhead, would be described by the following formular:

distance = sqrt((x2 - x1)^2 + (y2 - y1)^2)

Thus, the available mirrors can be ordered according to their distance to the 

References: (Haversine formula) (simple approximation)

(The simple approximation should be suitable insofar the 180° border is pretty 
much between Alaska and Russia, and other than islands in the Pacific Ocean all 
countries should be usefully covered by it. About the Pacific Ocean I don't know 
much, but it is very likely that those clients resolve to satellite links 
anyway, and need to be treated specially.) 

There is one problem, though: We use a weighted randomization to assign mirrors 
that would be equally suitable. That is useful (and used) for load balancing. 
With introducing ordering by geographic distance, the question arises how to use 
that together with mirror priorities. Alternatives / thoughts:

- Should we let geographic distance override priorities? Can the latter become 
obsolete? I don't really think so.

- Should we use geographic ordering only in certain scenarios: when no country 
mirror is available, and when dealing with certain countries (US)?

- Should we use just stratify according to geographic distances, instead of 
sorting by distance? That is, select the 3 or 5 mirrors closest by distance, and 
then do the weighted randomization on them to pick one?

- Should the configured priority value and the measured geographic distance be 
mixed together into a single number, that's then used for mirror 

- Should we do geographic distance ordering only within a set of mirrors that 
shares the same priority?

- At any rate, the static assignment of fallback mirrors (through the 
other_countries field) should have precedence. For instance, Russian clients 
should still be sent to German mirrors according to configured country-level 
fallback, instead of sending them to China, because it is closer, while China 
and Russia entertain few connectivity.

- Since geographic distance itself is only a rough approximation of suitability 
of a mirror for a client, certain analogies that one could think of, like 
"radius to serve", are relatively moot, and not worth implementing.

This needs to be discussed.
msg111 (view) Author: poeml Date: 2009-12-21.19:12:21
I just discovered that the database scheme already contains fields for longitude 
and latitude -- I totally had forgotten that. They are just not shown by 'mb', 
because they are not used. That's cool because no schema migration is needed...
msg184 (view) Author: poeml Date: 2010-04-23.03:11:34
Geographical coordinates are meanwhile stored in the database for mirrors.

(Was implemented around r7936-r7939).
msg224 (view) Author: poeml Date: 2010-09-06.00:11:32
I haven't received any comments on this, but I'd still regard geographical 
distance as a very useful addition to the current mirror selection algorithm.
msg302 (view) Author: poeml Date: 2010-11-04.19:32:15
Something that I didn't realize earlier is that the city version of the GeoIP database 
also contains "regions" on a level below country. Here are some examples:

 # geoiplookup_city -f /var/lib/GeoIP/GeoLiteCity.dat.updated
Continent: NA
Country:   US
Region id: CA
Region:    California
City:      Redwood City
Latitude:  37.491402
Longitude: -122.210999

It works not only for US states, but also for German Bundesländer and French 

 # geoiplookup_city -f /var/lib/GeoIP/GeoLiteCity.dat.updated
Continent: EU
Country:   DE
Region id: 02
Region:    Bayern
City:      Gunzenhausen
Latitude:  49.099998
Longitude: 10.750000

 # geoiplookup_city -f /var/lib/GeoIP/GeoLiteCity.dat.updated
Continent: EU
Country:   FR
Region id: A8
Region:    Ile-de-France
City:      Paris
Latitude:  48.866699
Longitude: 2.333300

So another approach would be to first introduce this additional level in mirror 
selection. While this is a geographical approach, it may not give optimal results e.g. 
in Germany, where there is at least one huge autonomous system (680) which comprises 
most universities, but geographically spans whole Germany. 

Another aspect is that when geographical mirror selection becomes too fine-grained, 
then mirror priorization becomes less effective: while it is currently possible to 
assign only a few requests to certain mirrors, they would then mandatorily get all 
traffic from their "region".

But in the US, adding the GeoIP region (state level) would likely be beneficial.
msg303 (view) Author: poeml Date: 2010-11-04.19:43:40
Two arguments for geographical distance:

1) Mirror selection by geographical distance could make the "region" level in the 
US superfluous, achieve the same, and it would work also for locations where there 
is no mirror in the exact same state. By geographical distance, it would be easy to 
pick a mirror in a neighbour state, instead of falling back to country level.

2) In addition, we could artificially "increase" distances of mirrors with low 
priority, to attract more requests to farther, but more powerful mirrors -- and 
thereby keeping excessive traffic away from small mirrors.

On the other hand, when we introduce "state" level in mirror selection, we would 
also add a state_only flag that restricts mirror use to that state. Maybe also 
other_states to add more states to deal with.

Also, an AS match would always take precedence, which helps if a mirror is in an AS 
that spans many states.
msg304 (view) Author: poeml Date: 2010-11-05.14:10:47
Experimenting with geographical distance is promising. Here's the findings for a client in Belgium, with no 
mirror in the country:

[Fri Country 'BE', Continent 'EU'
[Fri Nov 05 14:49:42 2010] [warn] [client] [mod_mirrorbrain] AS '5400', Prefix '', 
lat/lng 50.833302,4.333300 state id 11, state 'Brussels Hoofdstedelijk Gewest'
same region:                  (score  100) (rank    6777729) (dist 19.69) 
same region:                    (score   80) (rank    8786956) (dist 7.08)  
same region:       (score  100) (rank    6322872) (dist 5.27)  
same region:                  (score  100) (rank    8061858) (dist 4.86)  
same region:                  (score  100) (rank    8365641) (dist 25.94) 
same region:                 (score  100) (rank     502272) (dist 16.17) 
same region:             (score  100) (rank    2606190) (dist 8.58)  
same region:                    (score  100) (rank     671658) (dist 2.80)  
same region:           (score  100) (rank   10110186) (dist 3.18)  
same region:        (score  100) (rank    3443310) (dist 16.17) 
same region:               (score  100) (rank    2029689) (dist 5.46)  
same region:            (score  100) (rank    3410610) (dist 6.79)  
same region:                     (score  100) (rank    3767694) (dist 14.25) 
same region:              (score  100) (rank    2693499) (dist 10.16) 
same region:             (score  100) (rank   10623903) (dist 7.79)  
same region:                  (score   50) (rank   19739735) (dist 10.87) 
same region:             (score  100) (rank    8726322) (dist 2.92)  
same region:                 (score  100) (rank    8012808) (dist 2.80)  
same region:                   (score   80) (rank   12016420) (dist 16.07) 
same region:             (score  100) (rank    5669853) (dist 12.09) 

By geographical distance, a mirror in France or Netherlands would have been chosen, which sounds better than e.g. 
one in Portugal.

Example of a client in Germany:

Country 'DE', Continent 'EU'
AS '3320', Prefix '', lat/lng 50.666698,12.616700 state id 13, state 'Sachsen'
same country:       (score  100) (rank    4746078) (dist 3.92)
same country:             (score  100) (rank     951243) (dist 0.34)
same country:           (score  100) (rank    2549292) (dist 5.19)
same country:            (score  100) (rank    1006179) (dist 1.94)

Here, the mirror in Chemnitz (right around the corner) would have been preferred.

An example with another client in Germany, this time one in the huge AS 680 which sports a lot of mirrors (only 3 
in my test setup):

Country 'DE', Continent 'EU'AS '680', Prefix '', lat/lng 51.966702,7.633300 state id 07, state 
same AS:             (score  100) (rank    7522635) (dist 5.40)
same AS:           (score  100) (rank    2116017) (dist 0.64)
same AS:            (score  100) (rank    3404724) (dist 4.12)
same country:       (score  100) (rank    4702914) (dist 3.56)

Here, the mirror in Hagen is indeed much closer than the other two mirrors in the AS.
msg305 (view) Author: poeml Date: 2010-11-05.22:42:58
Too useful to not be implemented. Committed to trunk:

To be released with the upcoming version 2.14.0.
msg306 (view) Author: theuni Date: 2010-11-06.03:23:58
Looks very useful indeed peter, thanks!

msg309 (view) Author: poeml Date: 2010-11-11.16:23:54
The feature is in production at a few sites since a few days, and I got no 
catastrophic failure reports so far. 

On the centos-mirror mailing list, some interesting scenarios have been discussed 
where the new feature is just what's needed!

Closing as "done"!
Date User Action Args
2010-11-11 16:23:55poemlsetstatus: testing -> resolved
messages: + msg309
2010-11-06 03:23:59theunisetnosy: + theuni
messages: + msg306
2010-11-05 22:42:58poemlsetstatus: chatting -> testing
messages: + msg305
2010-11-05 14:10:47poemlsetmessages: + msg304
2010-11-04 19:43:40poemlsetmessages: + msg303
2010-11-04 19:32:15poemlsetmessages: + msg302
2010-09-06 00:11:32poemlsetmessages: + msg224
2010-09-05 23:53:25poemlsetassignedto: poeml
2010-04-23 03:11:34poemlsetmessages: + msg184
2009-12-21 19:12:21poemlsetmessages: + msg111
2009-12-11 15:56:04poemlcreate