Re: [mirror discuss] Re: mirrorbrain for sugar labs

From: Peter Pöml <poeml_at_cmdline.net>
Date: Fri, 25 Sep 2009 13:14:57 +0200
Hi!

On 25.09.2009, at 02:40, David Farning wrote:

> Peter ask me to continue a private thread on this mailing list.  Also
> CCing Matthew Zeier from mozilla infrastructure.  I was looking for
> Mozilla's solution and he pointed me in the direction of mirrorbrain.
>
> On Thu, Sep 24, 2009 at 5:25 PM, Peter Pöml <poeml_at_cmdline.net> wrote:
>> Hi David!
>>
>> thank you for writing. Interesting to learn about Sugar. It sounds  
>> exciting!
>>
>> Would you mind me resending my reply with the MirrorBrain mailing  
>> list
>> Cc'ed, and continue discussion there? I think it would be great  
>> material for
>> the list and could provide insight to others. It's also great to  
>> see some
>> activity there :-) (If not, no problem at all.)
>
> done

Thank you. (I actually had the other mailing list in mind,  
mirrorbrain@, but it doesn't matter much, the same people are  
subscribed there - the discuss list was meant more for discussion of  
mirror issues regardless of MirrorBrain; I should have said that. But  
I guess it doesn't matter that much!)

>> On 24.09.2009, at 22:33, David Farning wrote:
>>>
>>> I am looking at using mirrorbrain as the CDN for wki.sugarlabs.org .
>>> We are still pretty small we generally have 200G per day but peak at
>>> 32000G per day during releases.
>>
>> That's not nothing ;) I'd say it is an amount where a carefully set  
>> up
>> infrastructure with mirrors makes sense. Also, it sounds like there  
>> would be
>> a lot of users that one wants to keep happy, and who would benefit  
>> from
>> every improvement. And from looking
>
> We are currently running our infrastructure from the FSF's colocation
> facility.  So I include keeping our generous host happy pretty high on
> the list.


Yes, understandably.


>>> On normal day the majority of our traffic comes from
>>> activities.sugarlabs.org . a.sl.o is based off of mozilla's amo so
>>> anything we do here help them.
>>
>> I see, http://activities.sugarlabs.org/ is very similar as
>> https://addons.mozilla.org/, and it offers download links to lots  
>> of .xo
>> files, and redirects to
>> http://download.sugarlabs.org/sources/activities/ from where the  
>> files are
>> downloaded.
>> For now, I only note the redirection to d.sl.o, and no further  
>> redirection
>> from there.
>>
>> I also see other downloads, like
>> http://wiki.sugarlabs.org/go/Sugar_on_a_Stick which links to some  
>> mirrors.
>>
>>> We have a small collection of mirrors that help us during releases.
>>> But, the user must manually chose between mirrors. Agggg.
>>
>> Okay, so from what I would guess at this stage is that d.sl.o could  
>> redirect
>> to the mirrors, instead of delivering the .xo files all by itself;  
>> correct?
>> That would be exactly where MirrorBrain is could step in.
>
>
> Yes, the two main pieces are the sugar on a stick images and the .xo  
> files.


Okay.


>>> My questions are:
>>> 1. Is it worth it to use mirrorbrain at this stage?  Particularly
>>> around releases.
>>
>> Yes, definitely, the only thing to keep in mind is that deploying  
>> it costs
>> time, but I would think that it is worth the effort. If you have  
>> very few
>> mirrors, it can be the life-saver for the releases -- and if you  
>> gradually
>> get more mirrors, it will improve the service quality for the end  
>> users
>> because they can usually be routed to a better mirror.
>
> Yes,  this is particularly important because many of our large
> deployments are in remote regions.  Something like 80% of our .xo
> traffic is from Uruguay.


I see.


>> The effort in deployment is mainly in building and installing the  
>> software
>> and its different components. This is certainly doable and I'm  
>> happy to help
>> with it. If you run, say, purely on a CentOS5 based shop with aged  
>> Apache
>> and complicated deployment procedures, it can be difficult, but  
>> d.sl.o
>> rather seems to run Apache/2.2.11 on Ubuntu, which means that  
>> Apache is new
>> enough, and everything else will be available as well I guess. I  
>> would
>> actually like to build MirrorBrain packages on Ubuntu, and that  
>> might be a
>> reason to do that maybe?
>
> Everything except the build farm is Ubuntu.  Ubuntu packages would be
> nice.  But I am willing to build from scratch.


Which Ubuntu version specifically? In the openSUSE build service, I  
can build for 8.04, 8.10, and 9.04. It would also be interesting for  
me to become a real Debian package maintainer, but using the openSUSE  
build service might be the quicker route for now. I managed to build  
mod_asn for Debian and Ubuntu already (see http://download.opensuse.org/repositories/Apache:/MirrorBrain/xUbuntu_9.04/ 
), and I'm confident that I could do the same for mod_mirrorbrain and  
stuff that you would need. Those package would be updated then from a  
single source together with the various RPM packages that are built,  
which would be of great convenience later.

Most needed dependencies should already be available for a modern  
Debian/Ubuntu system. One thing that may be needed to be double- 
checked is mod_geoip. It seems that this module is very outdated - http://packages.qa.debian.org/liba/libapache2-mod-geoip.html 
  has an 1.1.x version, and there is a newer package waiting in http://mentors.debian.net/debian/pool/main/l/libapache2-mod-geoip/ 
  but even that is already 1.5 years old.


>>> 2. How will mirror brain interact will a.sl.o(AMO)?  Will new
>>> activites just be served from that primary node until mirrorbrain  
>>> runs
>>> a scan to verify the the new activite has been rsynced to a mirror
>>> node.
>>
>> MirrorBrain needs the file tree locally and can work off it as a  
>> normal
>> Apache. If it doesn't know a mirror for a file, Apache will just  
>> deliver it
>> as normal; if a mirror is known, Apache will redirect to it.  
>> Therefore,
>> publishing new files is just a matter of putting them into the file  
>> tree.
>> Later, mirrors will catch up, and as soon as they are scanned,  
>> Apache will
>> know about the presence on the mirrors and redirect to them.
>
> Ok great, so then we can modify the rsync so that only popular files
> are mirrored.  a.sl.o keeps every version of an activity in the main
> tree for historical purposes.  But there is no reason to keep copies
> on the mirrors.


Yes, this makes sense.


>> If large amounts of content are published at once, it can be useful  
>> (or even
>> needed) to first publish them only for the mirrors, by putting them  
>> into a
>> stage area that they can access, and later update Apache's file  
>> tree, when
>> they are distributed enough. Another regime (useful if the file  
>> tree is
>> large and gets frequent, small updates) could be to push-sync files  
>> as soon
>> they come in, and directly scan after each push.
>
> Ok, we can figure that out.  It would be cool if a.sl.o could trigger
> the push when ever a new activity is added.


I started working on some kind of framework for this purpose, because  
the same need arose at openSUSE in the past, and there it was  
implemented with some simple (and hard to maintain) shell scripts. I  
am thinking of a Django web app to configure the pushes for mirrors,  
and a little job queue that runs the push syncs, and which is  
triggered by e.g. XML-RPC or REST interface, or by inotifies directly  
from the filesystem.

The web frontend part I have almost implemented, and I've put some  
screenshots here to make the idea a little visible:

http://www.poeml.de/~poeml/MirrorSync/mirrors.png
http://www.poeml.de/~poeml/MirrorSync/modules.png
http://www.poeml.de/~poeml/MirrorSync/excludes.png

This is not of much practical use yet, but it might be an interesting  
path to go in the future. It's definitely something that other people/ 
projects also have a need for, so a reusable and simple framework  
could be useful I thought.

(The code is in a private SVN repository so far, just because I was  
experimenting with live data and needed to have passwords in the  
database)


>> Maybe there is even an existing release infrastructure that one could
>> integrate with.
>
> We are not that fancy yet.
>
>>> 3. How does mirrorbrain work with mysql? Do the admin framework and
>>> tool set work with mysql yet?
>>
>> At the beginning of this year, I abandoned MySQL support in all the  
>> tools,
>> but the core (the mod_mirrorbrain Apache module) will work. The  
>> tools to
>> maintain the mirror database won't work, and while this could  
>> probably be
>> fixed, I can say that when the list of mirrors is not long, and one  
>> is
>> proficient in the mysql commandline, it is certainly possible to  
>> maintain
>> the mirror data manually with the mysql client. I did so for a long  
>> time in
>> fact, before I finally started to write some tools.
>>
>> I would recommend to use PostgreSQL because that will result in a  
>> setup that
>> is clean and as documented, and also the database will be self- 
>> contained and
>> low-maintenance enough that it would matter much to anyone which  
>> database is
>> used underneath.
>>
>> However, mod_mirrorbrain will happily use MySQL as file database. I  
>> am
>> *quite* sure that the scanner script also still works with MySQL,  
>> but I
>> can't promise, as I haven't tested it since I did the switch to  
>> PostgreSQL.
>>
>> I decided to switch to PostgreSQL because Apache's DBD framework  
>> cannot use
>> two different databases in one vhost yet, and I needed a special  
>> datatype in
>> PostgreSQL to implement mod_asn (which you won't need with only few  
>> mirrors;
>> don't bother to install it). I was aware that it might put off some  
>> people
>> that are more familiar with MySQL, but I can speak very positively  
>> about
>> PostgreSQL, it is a great piece of software and it was a pleasurable
>> experience to me to get acquainted with it. I am happy to help with  
>> that;
>> it's not difficult, just a little different.
>
> Using postgresSQL is not a blocker.  So we can worry about that later.
>
>> It would of course be an option to re-implement MySQL support and  
>> PostgreSQL
>> at the same time, but my time has been to scarce so far to even  
>> consider
>> this, as there are other things that would seem more important, as  
>> e.g. the
>> lack of a web interface, that I would like to tackle.
>>
>>
>> Does this help further?
>
> So, I guess my next steps are:
> 1. set up a opensuse VM and install mirrorbrain to see how it is
> suppose to work.


I once created a VirtualBox image based on openSUSE 11.1, which may be  
the quickest way to have a look:
http://mirrorbrain.org/news/mirrorbrain-eval-virtualbox-appliance/
It contains a complete install and one or two (Firefox) mirrors set  
up, and it should allow you to immediately play with Apache as well as  
with the "mb" admin tool (see http://mirrorbrain.org/docs/mirrors/).

You could adjust the path to the file tree in the Apache configuration  
(see /etc/apache2/vhosts.d/*.conf), rsync a copy of the file tree into  
the image, add your mirrors to the database, scan them and you should  
have a working redirector then.

> 2. Set up a ubuntu VM matching the sugar labs infrastructure and
> install mirrorbrain.
>
> I'll try to do that this weekend.  I am sure I will have questions


As happy as I would be to directly assist you with it, I'll be away  
for the weekend unfortunately (and leave now). But I'm back on Monday!

Peter


_______________________________________________
discuss mailing list
Archive: http://mirrorbrain.org/archive/discuss/

Note: To remove yourself from this mailing list, send a mail with the content
 	unsubscribe
to the address discuss-request_at_mirrorbrain.org
Received on Fri Sep 25 2009 - 11:15:09 GMT

This archive was generated by hypermail 2.2.0 : Fri Dec 11 2009 - 22:12:59 GMT