[mirrorbrain-commits] [mod_stats] r74 - /trunk/tools/tdf.conf

From: <poeml_at_mirrorbrain.org>
Date: Tue, 05 Oct 2010 00:10:32 -0000
Author: poeml
Date: Tue Oct  5 02:10:31 2010
New Revision: 74

URL: http://svn.mirrorbrain.org/viewvc/mod_stats?rev=74&view=rev
Log:
add configuration for download.documentfoundation.org

Added:
    trunk/tools/tdf.conf

Added: trunk/tools/tdf.conf
URL: http://svn.mirrorbrain.org/viewvc/mod_stats/trunk/tools/tdf.conf?rev=74&view=auto
==============================================================================
--- trunk/tools/tdf.conf (added)
+++ trunk/tools/tdf.conf Tue Oct  5 02:10:31 2010
_at_@ -1,0 +1,118 @@
+# pretreatment (filtering, fixups), applied in order
+# the syntax is meant to be suitable for inclusion into Apache config
+# regexps must be in double quotes. Double quotes can be backslash-quoted.
+
+
+# good reference about Python regexp: http://www.amk.ca/python/howto/regex/regex.html
+#
+# short intro to things that may be special to Python: 
+#   (?:   )         non-capturing group
+#   (?P<foo>    )   named group
+# (FIXME: need to check if all these are supported in Apache)
+#
+
+
+
+# This directive applies only to the "offline parsing" script. Apache doesn't see
+# the log line before it constructs and writes it at the end of request processing.
+# Thus, Apache ignores this directive.
+#
+# It serves to 
+#   1) split a line of the log file into the relevant fragments
+#   2) ignore log lines that don't match
+#
+# The expression needs to result into the following six match groups:
+# (IP, timestamp, url, status, referer, ua, country)
+# FIXME: country should be optional, because it occurs only in a MirrorBrain logfile
+# 
+# 123.123.123.123 - - [23/Nov/2009:18:19:14 +0100] "GET /files/stable/3.1.1/OOo_3.1.1_MacOSXIntel_install_en-US.dmg HTTP/1.1" 302 399 "http://download.openoffice.org/all_rc.html" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 1.1.4322; .NET CLR 3.5.30729; .NET CLR 3.0.30618)" ftp.astral.ro r:country 913 844 EU:RO ASN:9050 P:92.81.0.0/16 size:24661382 -
+# 70.191.83.63 - - [04/Oct/2010:01:45:27 +0200] "GET /libreoffice/testing/LO_3.3.0-beta1_Win_x86_install_en-US.exe HTTP/1.1" 302 235 "http://www.documentfoundation.org/download/" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8" want:file give:redirect r:country tdf.unixheads.org NA:US ASN:22773 P:70.191.80.0/20 499 649 size:144783373 -
+
+#
+# 200 is returned for files that are not on mirrors, and for metalinks
+#
+StatsLogMask "^(\S+).+\[(.*?)\] \"GET (\S*) HTTP.*\" (200|302) [^\"]+ \"([^\"]*)\" \"([^\"]*)\".* .* \w\w:(\w\w) ASN:"
+#StatsLogMask "^(\S+).+\[(.*?)\] \"GET (\S*) HTTP.*\" (200|302) [^\"]+ \"([^\"]*)\" \"([^\"]*)\".*"
+
+# FIXME: we should have a separate directive that determines which status codes
+#        are considered for logging, which is read both by the script and by
+#        Apache
+#
+#StatsLogStatus 200
+#StatsLogStatus 302
+
+
+#
+# Before doing anything else, silently ignore all files matching the following
+# regular expression.  (All others will be considered for counting.)
+# 
+
+# ignore all requests that come with query string
+StatsIgnoreMask ".*\?.*"
+# 
+# ignore files with these endings
+StatsIgnoreMask "^.*\.(md5|sha1|sha256|magnet|btih|mirrorlist|ico)$"
+# ignore the sources
+StatsIgnoreMask "^/libreoffice/src/.*$"
+
+
+# Ignore all requests from this host (string prefix match), and don't process
+# the log lines at all.
+#StatsIgnoreIP 140.211.167.212
+
+
+#
+# Drop recurring identical requests
+#
+
+# define the size of a sliding window for remembering the last requests,
+# while parsing the log. Keyed by (ip, url, referer, user-agent), 
+# every requests is checked whether it has been seen in identical form before.
+StatsDupWindow 200
+
+# FIXME: Apache will have to filter on time instead of number of requests,
+#        for practical reasons (memcached automates this nicely)
+
+
+#
+# Apply the following series of filters to the request URL
+#
+
+# strip prefixed protocol (normally only sent to proxies, but can occur in the wild)
+# (the filter is applied to the requested url.)
+StatsPrefilter "^http://[^/]+/" ""
+
+# remove duplicated slashes
+StatsPrefilter "/+" "/"
+
+# strip leading base path
+StatsPrefilter "^/libreoffice/" ""
+
+# strip .metalink and other suffixes
+StatsPrefilter "\.(metalink|meta4|torrent)$" ""
+
+
+#
+# StatsCount translates the remaining URL into the pieces to be logged
+#
+#testing/LO_3.3.0-beta1_Win_x86_install_en-US.exe
+#testing/LO_3.3.0-beta1_MacOS_x86_install_en-US.dmg
+#testing/LO_3.3.0-beta1_Linux_x86-64_install-deb_en-US.tar.gz
+#testing/LO_3.3.0-beta1_Linux_x86-64_install-rpm_en-US.tar.gz
+#testing/LO_3.3.0-beta1_Linux_x86_install-deb_en-US.tar.gz
+#testing/LO_3.3.0-beta1_Linux_x86_install-rpm_en-US.tar.gz
+StatsCount "^testing/LO_([^_/]+)_(Win|MacOS|Linux)_(x86|x86-64)_(install|install-deb|install-rpm)_(?P<lang>[a-zA-Z]{2}(-[a-zA-Z]{2})?)\.(?:exe|dmg|tar\.gz|tar\.bz2)$" "prod: LO  os: \2-\3  version: \1  lang: \g<lang>"
+
+
+
+#
+# Filters to be applied after parsing (but still before counting)
+#
+
+#StatsPostfilter "foo" "bar"
+StatsPostfilter "(prod|os|version|lang): " ""
+# FIXME: we didn't need to add those words in the first place; not needed at all.
+
+
+
+# vim: ft=apache ai ts=4 sw=4 smarttab expandtab smarttab




_______________________________________________
mirrorbrain-commits mailing list
Archive: http://mirrorbrain.org/archive/mirrorbrain-commits/

Note: To remove yourself from this list, send a mail with the content
 	unsubscribe
to the address mirrorbrain-commits-request_at_mirrorbrain.org
Received on Tue Oct 05 2010 - 00:10:32 GMT

This archive was generated by hypermail 2.3.0 : Mon Feb 20 2012 - 23:47:04 GMT