Author: poeml Date: Tue Nov 24 20:31:01 2009 New Revision: 21 URL: http://svn.mirrorbrain.org/viewvc/mod_stats?rev=21&view=rev Log: dlcount prototype: add comments to the header Modified: trunk/tools/dlcount.py Modified: trunk/tools/dlcount.py URL: http://svn.mirrorbrain.org/viewvc/mod_stats/trunk/tools/dlcount.py?rev=21&r1=20&r2=21&view=diff ============================================================================== --- trunk/tools/dlcount.py (original) +++ trunk/tools/dlcount.py Tue Nov 24 20:31:01 2009 @@ -1,31 +1,61 @@ #!/usr/bin/python -# Analyze Apache logfiles without hogging memory -# -# This script uses Python generators, which means that it doesn't allocate memory -# It rather works like a Unix pipe. -# -# It transparently opens uncompressed, gzip or bzip2 compressed files. -# -# The implementation is based on David Beazley's PyCon UK 08 great talk about -# generator tricks for systems programmers. -# -# -# # Copyright 2008,2009 Peter Poeml # -# This program is free software; you can redistribute it and/or -# modify it under the terms of the GNU General Public License version 2 -# as published by the Free Software Foundation; -# -# This program is distributed in the hope that it will be useful, -# but WITHOUT ANY WARRANTY; without even the implied warranty of -# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the -# GNU General Public License for more details. -# -# You should have received a copy of the GNU General Public License -# along with this program; if not, write to the Free Software -# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License version 2 +# as published by the Free Software Foundation; +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write to the Free Software +# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA +# +# +# +# Analyze Apache logfiles in order to count downloads +# +# +# This script parses a MirrorBrain-enhanced access_log and does the following: +# - a little ring buffer filters requests recurring within a sliding time window (keyed by ip+url+referer+user-agent) +# - strip trailing http://... cruft +# - remove duplicated slashes +# - remove accidental query strings +# - remove a possible .metalink suffix +# - remove the /files/ prefix +# +# It applies filtering by +# - status code being 200 or 302 +# - requests must be GET +# - bouncer's IP which keeps coming back to download all files (from OOo) +# +# It also captures the country where the client requests originate from. +# +# This script uses Python generators, which means that it doesn't allocate +# memory according to the log size. It rather works like a Unix pipe. +# (The implementation of the generator pipeline is based on David Beazley's +# PyCon UK 08 great talk about generator tricks for systems programmers.) +# +# +# I baked a first regexp which is able to parse most (OpenOffice.org) requests +# from /stable and /extended. There are some exceptions (language code with 3 +# letters) and I didn't take care of /localized yet. +# +# The script should serve as model implementation for the Apache module which +# does the same in realtime. +# +# +# Usage: +# ./dlcount.py /var/log/apache2/download.services.openoffice.org/2009/11/download.services.openoffice.org-20091123-access_log.bz2 | sort -u +# +# Uncompressed, gzip or bzip2 compressed files are transparently opened. +# +# +# __version__='0.9' _______________________________________________ mirrorbrain-commits mailing list Archive: http://mirrorbrain.org/archive/mirrorbrain-commits/ Note: To remove yourself from this list, send a mail with the content unsubscribe to the address mirrorbrain-commits-request_at_mirrorbrain.orgReceived on Tue Nov 24 2009 - 19:31:04 GMT
This archive was generated by hypermail 2.2.0 : Tue Nov 24 2009 - 19:45:10 GMT