[mirrorbrain-commits] [mod_stats] r45 - /trunk/tools/dlcount.py

From: <poeml_at_mirrorbrain.org> Date: Thu, 26 Nov 2009 14:48:22 -0000 · This archive was generated by hypermail 2.2.0 : Thu Nov 26 2009 - 15:45:12 GMT

Author: poeml
Date: Thu Nov 26 15:48:22 2009
New Revision: 45

URL: http://svn.mirrorbrain.org/viewvc/mod_stats?rev=45&view=rev
Log:
work on the comment header.

Modified:
    trunk/tools/dlcount.py

Modified: trunk/tools/dlcount.py
URL: http://svn.mirrorbrain.org/viewvc/mod_stats/trunk/tools/dlcount.py?rev=45&r1=44&r2=45&view=diff
==============================================================================

--- trunk/tools/dlcount.py (original)
+++ trunk/tools/dlcount.py Thu Nov 26 15:48:22 2009
@@ -21,41 +21,37 @@
 #
 #
 # This script parses a MirrorBrain-enhanced access_log and does the following:
-#   - a little ring buffer filters requests recurring within a sliding time window (keyed by ip+url+referer+user-agent)
-#   - strip trailing http://... cruft
-#   - remove duplicated slashes
-#   - remove accidental query strings
-#   - remove a possible .metalink suffix
-#   - remove the /files/ prefix
+#   - select lines on that the log analysis is supposed to run
+#     (StatsLogMask directive, which defaults to a regexp suitable for a MirrorBrain logfile)
+#     The expression also selects data from the log line, for example the
+#     country where a client request originated from.
+#   - a little ring buffer filters requests recurring within a sliding time
+#     window (keyed by ip+url+referer+user-agent
+#     length of the sliding window: StatsDupWindow
+#   - arbitrary log lines can be ignored by regexp (StatsIgnoreMask)
+#   - IP addresses can be ignored by string prefix match (StatsIgnoreIP)
+#   - apply prefiltering to the request (regular expressions with substitution) 
+#     with one or more StatsPrefilter directives
+#   - parse the remaining request url into the values to be logged
+#     (StatsCount directive)
+#   - apply optional post-filtering to the parsed data (StatsPostfilter)
+#
 # 
-# It applies filtering by
-#   - status code being 200 or 302
-#   - requests must be GET
-#   - bouncer's IP which keeps coming back to download all files (from OOo)
+# The script should serve as model implementation for the Apache module which
+# does the same in realtime.
+#
+#
+# Usage: 
+# ./dlcount.py /var/log/apache2/download.services.openoffice.org/2009/11/download.services.openoffice.org-20091123-access_log.bz2 | sort -u
+#
+# Uncompressed, gzip or bzip2 compressed files are transparently opened.
 # 
-# It also captures the country where the client requests originate from.
-#
+# 
 # This script uses Python generators, which means that it doesn't allocate
 # memory according to the log size. It rather works like a Unix pipe.
 # (The implementation of the generator pipeline is based on David Beazley's
 # PyCon UK 08 great talk about generator tricks for systems programmers.)
 #
-# 
-# I baked a first regexp which is able to parse most (OpenOffice.org) requests
-# from /stable and /extended. There are some exceptions (language code with 3
-# letters) and I didn't take care of /localized yet.
-# 
-# The script should serve as model implementation for the Apache module which
-# does the same in realtime.
-#
-#
-# Usage: 
-# ./dlcount.py /var/log/apache2/download.services.openoffice.org/2009/11/download.services.openoffice.org-20091123-access_log.bz2 | sort -u
-#
-# Uncompressed, gzip or bzip2 compressed files are transparently opened.
-# 
-#
-# 
 
 
 __version__='0.9'




_______________________________________________
mirrorbrain-commits mailing list
Archive: http://mirrorbrain.org/archive/mirrorbrain-commits/

Note: To remove yourself from this list, send a mail with the content
 	unsubscribe
to the address mirrorbrain-commits-request_at_mirrorbrain.org