[mirrorbrain] CHASM (Cryptographic Hash Algorithm Secured Mirroring) + download stats

From: Anthony Bryan <anthonybryan_at_gmail.com> Date: Sat, 3 Apr 2010 02:34:35 -0400 · This archive was generated by hypermail 2.3.0 : Wed May 05 2010 - 18:17:05 GMT

just heard about CHASM. there isn't much written about it:

"CHASM is the Cryptographic Hash Algorithm Secured Mirroring solution.

An ambitious project to replace rsync as the leading mirroring
solution for Linux distributions and other large projects with
multiple mirrors with a peer-to-peer system that also provides
assurances about the integrity of mirrored data."

there is a Google summer of code project for kernel.org along with one
for centralized statistics gathering.

https://korg.wiki.kernel.org/index.php/Gsoc2010:ideas

chasmd improvements

Assisting: John "Warthog9" Hawley
Website: http://projects.robescriva.com/projects/show/chasm

CHASM, the Cryptographic Hash Algorithm Secured Mirroring solution, is
a project that is to help alleviate a lot of the pains that mirrors
have in organizing and verifying their content. The project can be
thought of as a stateful rsync daemon in some respects, and is a
project that kernel.org and a number of other large mirroring
infrastructures have been looking into for several years now. This is
ultimately a project that will be used by a greater portion of the
larger mirroring infrastructures and as such has a lot of need for
high performance and good design.
This is a project to help get CHASM to a usable and production quality
state, it is currently in the middle of a rewrite into C++ for
performance reasons and there are still several aspects that may need
to be flushed out. Individuals will need a solid understanding of *NIX
systems programming in C or C++ (C++ is mainly used to provide things
like destructors and type safety). Familiarity with the git scm
storage model, and rsync internals are both positive traits.
Developers seeking to work on CHASM will be working primarily on
developing network code, including documenting the network protocols.
Students will be expected to be able to develop such code/protocols
independently, but will be provided every chance for feedback and
guidance from the current developers so as to maximize the impact of
their contributions.
Students looking to work on CHASM should contact the current
developers, and register on the bug tracker
(http://projects.robescriva.com/account/register).
Things to note about this project:
There are several servers involved in this project; most of which
communicate locally over Unix domain sockets.
Each server will be a separate piece of functionality.
All code written should be accompanied by test code to aid in
automated testing (see http://cdash.chasmd.org/ for our dashboard).
C++ is the language used by current developers. We chose C++ for its
beneficial standard library and ability to link C libraries as well.
Code written must be capable of running for extended periods of time
without excess resource consumption or leakage.

Centralized statistics gathering

Primary: John "Warthog9" Hawley

This is a multi-part project involving both the collection of the
statistics and the server aggregation of the statistics. The main idea
of this project is to create a universally usable statistics download
statistics collection. The Open Source community has a tendency to
rely on a wide flung array of servers and infrastructure to provide
it's download distribution. This works wonderfully for the most part,
however there is little insight into the mirrors themselves from the
position of the originator of the data. This lack of insight is due to
a multitude of problems, from privacy concerns and legal reasons to
system to system resources on the mirror itself.
This project is intended to help both the mirrors themselves and the
upstream providers of data get a better handle on how many downloads
of various things are actually occurring. It's intended to be an all
encompassing solution, meaning that the project will work equally well
for something like kernel.org, to Fedora, to Ubuntu, to Apache and to
Mozilla should they choose to use it. This project will involve both a
frontend log parser capable of determining what downloads have
occurred, the type of download and how much data was transferred, as
well as unique downloaders for that server. There will also be a
backend portion of this, that will initially be hosted on kernel.org.
This backend will be the collection point for the statistics that will
be provided by frontend processes running on the mirrors. It will
involve logging statistics, parsing out duplicates from a single
mirror, deal with mirror authenticity and aggregating the statistics.
It will also provide a website for individuals to be able to quickly
browse and discover common downloads from a particular distribution,
or open source project.
Things of Note about this project:
There is both a client and a server aspect of this project, both
pieces need to be created and interoperable along with a client/server
api.
Clients:
Resource constrained environment
Needs to be lightweight and as efficient as possible
Potential to be processing 10s or 100s of Gigabytes of data on a
single run fora single machine
Will be collecting data from a variety of different log types from
http, ftp, rsync, git, etc.
Server:
Mostly a web-app, for reporting and data collection
Needs to be relatively efficient, but not to the same extent as the client
Has to be capable of running independent of the kernel.org infrastructure
General todos:
Prototype client
Prototype server
Prototype API

-- 
(( Anthony Bryan ... Metalink [ http://www.metalinker.org ]
  )) Easier, More Reliable, Self Healing Downloads

_______________________________________________
mirrorbrain mailing list
Archive: http://mirrorbrain.org/archive/mirrorbrain/

Note: To remove yourself from this mailing list, send a mail with the content
 	unsubscribe
to the address mirrorbrain-request_at_mirrorbrain.org