A Concept For Download Statistics
The ''What'' And ''Why''
Everybody wants download statistics. Whether Mozilla, OpenOffice.org, openSUSE or whatever large project is concerned, the knowledge about download counts is instrumental for development coordination. Here's how it could be done.
Note: See ''Feedback'' at the end of this document for how to get in touch!
This is often approached by writing scripts that surf log files. That has a few shortcomings in that it
- doesn't happen in realtime
- needs a lot of time if refinement of an analysis is needed later (and it needs to run over the logs again)
- typically results in flat reports, but not in data that can be queried flexibly
- has scalability issues
- difficulty of logfile aggregation from multiple servers
There must be better ways. Below, I'll sketch what I think could be the download statistics master plan. I'll first describe a specialized Apache logging module named mod_stats, an existing implementation. Then I'll show how the concept should be generalized, so it becomes useful for everybody wanting to count downloads.
The first sketch fitted on four postcards. Of course, the devil is in the details.
So, first some general considerations.
- Logging directly into a database is something that's easy today. Apache has a very good API to connect to databases. In fact, there are ready-to-use Apache modules for logging to databases. However, if it comes to high traffic it isn't as easy as it might sound. The volume of the logs can be voluminous, mainly due to the fact that all modules that I know of basically save flat lines of the access log to database rows (split into fields of course). This is convenient for detailed analyses of the complete logs, however it'll still result into gigabytes of logs per day, just in a database. Lots of requests on the same file will cause insertion of the corresponding string into the database lots of times. And a scheme that keeps an index of seen entries couldn't scale in practice. So one would need a database with the ''raw'' data, and an additional database that aggregates on it. My principal idea here is to log only those things that are interesting, and use counters that are increased. I'll expand on it below.
- Absolute download counts are always a problem to determine, unless client never get redirected somewhere outside of your control. If you control all infrastructure taking part in content delivery, you can - otherwise not. The issue is twofold. First, there are aborted transfers. Users press stop buttons in their browser, reload, et cetera. Second, browsers (and other download clients in particular) do byterange requests, requesting only parts of the content. On the server side, you see the byterange header and could log it, however byteranges are often used in 12345- form (''send me bytes 12345 onwards, till the end''), but the client will typically arbitrarily stop receiving data when it thinks it's enough, and does so to several mirrors in parallel.
- Even though absolute counts are difficult or impossible to obtain, relative counts can be extremely helpful, and it is relatively easy to collect them.
- By operating one of the mirrors in your mirror network yourself, or talking to one of the cooperating mirrors, you could get access to actual transfer statistics. It seems feasible to interpolate from a) the redirect counts on the MirrorBrain server and b) absolute transfer statistics from few mirrors, to a good estimate of the content transferred in the delivery network in total.
- Parsing URLs in access logs poses some problems, like:
- duplicated slashes
- URLs having http://hostname/ prepended
- characters occurring in URL-escaped as well as unescaped form
- There is the odd client which goes wild and issues the same request over and over again, which can skew numbers very much. Here are two postings with details about the problem, and proposed solution.
- Download statistics can be invaluable not only for marketing and development, but also when maintaining a mirror infrastructure, to govern decisions what should be mirrored and what not.
- If it comes to counting existing installations of a software, things change a little bit. Then it becomes desirable to count by IP addresses, and it helps if the clients maintain some kind of cookie to distinguish them from each other (because by IP address isn't very reliable in itself). It is interesting to read how openSUSE statistics and Fedora statistics acquire these numbers.
Existing logging facilities in MirrorBrain
MirrorBrain allows logging several details out of the box, that complement Apache's access log:
- which mirror a client was sent to
- data about client origin: country, continent, autonomous system number and network prefix.
- the reason why a particular mirror was selected
- size of the file in question
- files for which no mirror was found
- (with Apache's native logging means) things like the number of bytes actually transferred, byterange headers requested by the client and more
- all these could be logged conditionally, with Apache's SetEnvIf and CustomLog env=.... means.
These means provide a lot of detail for analysis, but don't make it easy for the data to aggregated.
apachetop is very helpful to answer realtime, ad-hoc kind of questions.
mod_stats, and existing implementation
In 2007, I implemented a logging module for the openSUSE build service called mod_stats to overcome these difficulties. Since MirrorBrain sits in the middle of the infrastructure and gets all requests before they are farmed out to mirrors, it is a natural place to count downloads. (This was one of the design choices of MirrorBrain.) mod_stats counts downloaded packages in realtime into a database.
It hooks into the logging phase of Apache's request processing, when the request is already handled, so nothing else has to wait for it. (It is extremely lean anyway, and I repeatedly looked hard at the database process status and hardly ever managed to see an ongoing request.) Measurements confirmed that this scheme is highly scalable.
The underlying technology, Apache's DBD database connection pool API, is arguably the best foundation at this time to implement performant and specific logging. When used together with MirrorBrain, it allows to share the same database connection pool that is in use by mod_mirrorbrain and mod_asn already.
Still, the existing implementation has the drawback that it groks only URLs of the openSUSE Build Service, which it was written originally.
The idea is to make this module more general, which brings us to the:
The ''Download Statistics Master Plan''
The basic concept would be to map pathnames to ''products'' by regular expressions. The logging would count the downloads per products. Such a map could look like this (simplified, and not necessarily a working) example:
The matching rules could be simplified (and automated) much by using advanced regular expression syntax to backreference the matches. That'll be a little more complex, depending on how comfortable you are with regular expressions, but has the advantage that it'll automatically match downloads added in the future:
Moreover, we could do the following:
Count every redirect (or direct delivery); count every generated metalink. Don't count HEAD or If-Modified-Since requests.
Automatically add the database rows that don't exist yet on the fly.
Automatically create database tables when new logging is configured. You'd only need to create a database, give the web server access and it would do the rest. (Just for the convenience.)
Allow multiple maps per web server virtual host, each corresponding to a separate table. This would allow for counting on a broad level, and finer level at the same time. (And it adds ways for testing.)
Some simple report generation should be implemented directly in the logging module.
summary report (maybe? possibly limited by client ip to prevent excessive usage of resoures?)
return a specific counter for a file (and the regexp that it matched, since it aggregates), when /foo/bar?download_stats is accessed. The little report could be broken up by timeframe and could look like this:
matches='regexp' count_today=... count_last_7_days=... count_last_30_days=... counts_per_day=...,...,..,...,
In addition to this primitive, built-in report generation (implemented ''on-board'' in the logging module), it would be straightforward to generate reports and graphs from the database with simple scripting. A query interface could easily be provided by a CGI script, for instance.
Provide a test mode / test script to see what is going to be counted:
cat logfile | regexp_map_parser
(For this purpose, it might make sense to put the product map into a separate file, outside of the Apache config, to be able to easier access it from other programs. The maps could also be stored in the database.)
Save the regexps along the logged items, for reference and to be able to later generate rsync filelists from them (mirroring by popularity)
The resulting table that is logged would look about like this:
Here, the day field corresponds to the day of the year (001..366) (same as (date +%j).
A possible query is:
SELECT SUM(count) FROM TABLE counts WHERE prodname='FF300_de' AND day BETWEEN(178,180)
The table could contain further fields: language, mirror, client country. It could be further compacted by leaving out the year and logging only the day (number since 1970).
The essential things were implemented in the end of 2009. mod_stats, an Apache module to count in realtime, hasn't been updated yet. However, there is a script that can be used to parse the logs once per day. It seems to work nicely, and powers http://marketing.openoffice.org/marketing_bouncer.html since February 2010.
Reports are generated with a Django web application: http://download.services.openoffice.org/stats/csv/20100102.csv (Note that this URL can be slow, because the server is quite slow virtual machine)
Update 2012: The Document Foundation uses the same script, since about 1.5 years now.
Update 2014: From the LibreOffice team, I hear that the a posteriori log parsing with the script takes too long when they run it after a longer time. For instance, over the logs of a whole month. With such an amount of logfiles, the script needs a day to complete, which is not really fun. I remember that with vast log files it was better to run the script daily (then it's not so annoying when it might take up to an hour or so. They tell me that attempts to improve performance have failed, so that they considered implementing a download counter that counts "live" -- ...which is exactly what mod_stats originally did!
So the current status is:
- There's still a need for a good log parsing for download counting. And I'm not aware of alternative solutions.
- mod_stats, the Apache module, would perfectly fill this need. However, it's still specific to openSUSE's filename scheme where I built it for originally. It has not been made generic yet.
- The interim script that parses log files after they have been generated works but needs to run often enough so that the work is not piling up. (The script was mainly made for testing purposes, in the first place.)
- The script has evolved (because it saw real usage) to have more
functionality than the Apache module:
- it can run an arbitrary number of preprocessing and postprocessing operations to URLs
- it can run different regular expressions on the URLs to construct counters for different naming schemes
- it implements a ring buffer that stores recent requests and makes sure that recurrent requests by the same client on the same file do not increase the counters (because that'd skew the numbers)
- The Apache module
- needs to get the features that the script got (see above list)
- and maybe it needs adjustments to work with PostgreSQL, because at the time it ran with MySQL.
Other than that, I think it would be good to integrate the code into the MirrorBrain source tree, so it's easier accessible and easier to deploy once ready to use. [DONE]
- realtime counting (which would be kind of nice)
- maybe different reports to be generated
- maybe automatic table creation
- some database column names are specific to the one existing deployment, and should be changed to generic names
Interesting implications are (besides the general potenfial of a collaboration, that it can spark and enable the development of better and more versatile solutions):
- realtime counting (and realtime querying)
- the query facilities of a database
- potential foundation for implementing popularity-based / location-based mirroring
- space-saving nature of the counts that are stored in a database: a more trivial approach, which just stores a count for any file that is encountered, would quickly blow up the database to large sizes, when there is a continuous turnover in the file tree with new files.
What is your feedback to this concept? Please get in touch and post to the mirrorbrain_at_mirrorbrain.org mailing list. (You don't need to be subscribed to be able to post.)
Awaiting your comments!