[mirror discuss] Re: [distro-dev] progress with MirrorBrain

From: Peter Poeml <poeml_at_cmdline.net> Date: Thu, 16 Jul 2009 01:10:07 +0200 · This archive was generated by hypermail 2.2.0 : Fri Dec 11 2009 - 22:12:59 GMT

Hi Andrea!

Thanks a lot for your feedback, it's really appreciated.

On Wed, Jul 15, 2009 at 11:49:13PM +0200, Andrea Pescetti wrote:
> Peter Poeml wrote:
> > if the question is: will we be able to implement good, live,
> > detailed logging in _conjunction_ with MirrorBrain? Then the answer is a
> > clear yes.
> 
> Good, thanks!
> 
> > I have written up a concept for collecting downlaod statistics here:
> > http://mirrorbrain.org/download-statistics/
> 
> Nice document. I agree with most of it and find it a very reasonable
> solution. I just don't find how you plan to tackle point 6 in your list,
> i.e.,
>   There is the odd client which goes wild and issues the same request
>   over and over again, which can skew numbers very much.
> This is indeed a significant problem in the case of OOo and it would be
> nice to be able to set a "threshold" (say, 10 download per day) valid
> for each IP address and ignore, in statistics, all downloads exceeding
> it.
> 
> Currently, as far as I know, this is managed semi-manually through
> interpolation of data from the previous days (but we don't have IP
> addresses in the available data, so there is some guessing involved
> too). Anyway, if the IP address is not lost in stored data and is made
> available for processing, we could compute this correction at a later
> stage.

Good point. One way to tackle it (live on the download server collecting
the numbers) could be to keep state of accessed files per IP address.
Much in the same spirit as the Apache module mod_ip_count does (for
protection of server resources and some DoS protection). I use a patched
version that uses mod_memcache with good success on a mirror.
http://en.opensuse.org/Mirror_Setup_Howto#mod_ip_count
That's very lightweight and it wouldn't be a problem to do this per
file with a reasonably low TTL.

Blocking requires careful adjustment due to corporate networks and web
caches (multiple requests originating from the same IP). 

But given that we wouldn't actually use the state to block any accesses,
but rather to restrict counters from going up the roof, we could do this
a little more aggressive.

At the same time, it might make sense to look for X-forwarded-for
headers and give those requests some headroom.

That doesn't cover corporate networks, but if we don't store the URL but
instead a hash of IP, URL, User agent and referer, it should work pretty
well.

Does this make sense?

Do you see this phenomenon a lot? I actually saw only may be one such
client during (the most busy time of) each major openSUSE release.

Peter

_______________________________________________
discuss mailing list
Archive: http://mirrorbrain.org/archive/discuss/

Note: To remove yourself from this mailing list, send a mail with the content
 	unsubscribe
to the address discuss-request_at_mirrorbrain.org