[mirrorbrain] Re: Repository delta download und Mirrorbrain 2.13

From: Peter Pöml <peter_at_poeml.de> Date: Tue, 19 Oct 2010 04:26:58 +0200 · This archive was generated by hypermail 2.3.0 : Tue Oct 19 2010 - 11:17:06 GMT

Hi Micha,

thanks for the patches!

I'm including the MirrorBrain mailing list in my reply, which is the
best place for all communication about MirrorBrain development. Since
your mail is about code, I hope you pardon me for taking your private
mail where it belongs. It would be great if you could write directly to
the list next time, so the community is not excluded. (Or open a bug,
which also prevents things from collecting dust deep in my mailbox :)

Comments below, scattered inline.

On Mon, Oct 04, 2010 at 06:52:22 +0200, Michael Schroeder wrote:
> 
> Hallo!
> 
> On Fri, Sep 24, 2010 at 11:34:23AM +0200, Michael Schroeder wrote:
> > On Thu, Sep 23, 2010 at 07:44:41PM +0200, Peter Pöml wrote:
> > > MirrorBrain 2.13 hat zsync-Support, der in Form von "herunterladbaren" .zsync Files daher kommt.
> > > http://mirrorbrain.org/docs/configuration/#configuring-zsync-generation
> > 
> > Das weiss ich, aber das brauch ich nicht. Ich brauch das Metalink
> > File mit Zsync Blockchecksummen, ich moechte kein weiteres File
> > herunterladen.
> 
> Ich hab jetzt den attachten Patch bei uns laufen.
> 
> Folgende Sachen solltest Du auf jeden Fall uebernehmen:
> 
> mb/mb/hashes.py dump_2_12_template:
> 
> </verification> sollte nicht im "if" Block ausgegeben werden,
> wenn do_chunked_hashes False ist, fehlt es.
> 
> _at_@ -374,7 +388,8 @@ class HashBag:
>                  r.append('            <hash piece="%s">%s</hash>' % (n, piece))
>                  n += 1
>  
> -            r.append('        </pieces>\n      </verification>\n')
> +            r.append('        </pieces>')
> +        r.append('      </verification>\n')
>  
>          return '\n'.join(r)

Ah, good catch - I'd just suggest to add the missing newline after </pieces>.

> mod_mirrorbrain/mod_mirrorbrain.c mb_handler:
> 
> Die Berechnung der Anzahl pieces ist falsch. Da muss aufgerundet
> werden, und die for Schleife darf nur bis n gehen:

(translation: The calculation of the number of pieces is wrong. It needs
to be rounded up, and the for loop must only run to n:)

> _at_@ -1048,18 +1051,26 @@ static hashbag_t *hashbag_fill(request_r
>  
>              /* split the string into an array of the actual pieces */
>  
> -            apr_off_t n = r->finfo.size / h->sha1piecesize;
> +            apr_off_t n = (r->finfo.size  + h->sha1piecesize - 1) / h->sha1piecesize;
>              // XXX ap_log_rerror(APLOG_MARK, APLOG_ERR, 0, r, "[mod_mirrorbrain] dbd: %" APR_INT64_T_FMT " sha1 pieces", 
> n);
>  
>              h->sha1pieceshex = apr_array_make(r->pool, n, sizeof(const char *));
>              int max = strlen(val);
>              int i;
> -            for (i = 0; (i <= n); i++) {
> +            for (i = 0; i < n; i++) {
>                  if (((i + 1) * SHA1_DIGESTSIZE*2) > max)
>                          break;
>                  APR_ARRAY_PUSH(h->sha1pieceshex, char *) = apr_pstrndup(r->pool, 
>                          val + (i * SHA1_DIGESTSIZE * 2), (SHA1_DIGESTSIZE * 2));

Cool, that makes sense. Thanks!

> Ausserdem wird m.E. der "hashbag" nie mehr freigegeben, macht das
> der Apache automatisch wenn der Request fertig ist?

(translation: Also, the "hashbag" seems to be never released, does
Apache do that when it finished handling the request?)

Yes, it's fundamentally always like that in Apache's request handling.
That variable uses only "pool memory", allocated from a pool of memory
which is completely released after the request is done. There are pools
for configuration, requests, connections, and other resources, with
different lifetime. 

In the mod_mirrorbrain handler, memory is allocated from the request
pool, with small exceptions (like the call to realfile() which needs to
be malloced and freed.

That's also why you see request_rec *r being passed around so
frequently, which is just to be able to access r->pool which points to
the respective memory pool.

Strictly following this rule reduces the risk of memleaks basically to
zero (and instead lets you find funny bugs deep, deep somewhere else ;)

It is worthwhile to know though, that the pool memory is actually
reused, and thus it is not completely freed after a request. Thus,
allocating large amounts can have a "lasting effect", so to speak.

> Gruesse,
>   Micha.
> 
> -- 
> Michael Schroeder                                   mls_at_suse.de
> SUSE LINUX Products GmbH, GF Markus Rex, HRB 16746 AG Nuernberg
> main(_){while(_=~getchar())putchar(~_-1/(~(_|32)/13*2-11)*13);}

Further inline comments below:

> --- ./mb/mb.py.orig	2010-10-04 14:00:02.000000000 +0000
> +++ ./mb/mb.py	2010-10-04 16:25:22.000000000 +0000
> _at_@ -911,6 +911,8 @@ class MirrorDoctor(cmdln.Cmdln):
>                               'subdirectory -- see examples)')
>      _at_cmdln.option('-t', '--target-dir', metavar='PATH',
>                          help='set the target directory (required)')
> +    _at_cmdln.option('-z', '--zsync-mask', metavar='REGEX',
> +                        help='regular expression to select files to create zsync hashes for')
>      _at_cmdln.option('-v', '--verbose', action='store_true',
>                          help='show more information')
>      def do_makehashes(self, subcmd, opts, startdir):
> _at_@ -975,6 +977,8 @@ class MirrorDoctor(cmdln.Cmdln):
>              opts.ignore_mask = re.compile(opts.ignore_mask)
>          if opts.file_mask: 
>              opts.file_mask = re.compile(opts.file_mask)
> +        if opts.zsync_mask: 
> +            opts.zsync_mask = re.compile(opts.zsync_mask)
>  
>          unlinked_files = unlinked_dirs = 0
>  
> _at_@ -1055,6 +1059,12 @@ class MirrorDoctor(cmdln.Cmdln):
>                  if opts.ignore_mask and re.match(opts.ignore_mask, src):
>                      continue
>  
> +                do_chunked_with_zsync = False
> +                chunk_size = self.config.dbconfig.get('chunk_size')
> +                if opts.zsync_mask and self.config.dbconfig.get('chunked_hashes') and re.match(opts.zsync_mask, src):
> +                    do_chunked_with_zsync = True
> +                    chunk_size = 65536
> +
>                  # stat only once
>                  try:
>                      hasheable = mb.hashes.Hasheable(src_basename, 
> _at_@ -1063,7 +1073,8 @@ class MirrorDoctor(cmdln.Cmdln):
>                                                      base_dir=opts.base_dir,
>                                                      do_zsync_hashes=self.config.dbconfig.get('zsync_hashes'),
>                                                      do_chunked_hashes=self.config.dbconfig.get('chunked_hashes'),
> -                                                    chunk_size=self.config.dbconfig.get('chunk_size'))
> +                                                    chunk_size=chunk_size,
> +                                                    do_chunked_with_zsync=do_chunked_with_zsync)

This change (and the hunk above) breaks the configuration of the chunk
size via /etc/mirrorbrain.conf, where the value is read from (or
substituted by 262144), see mb/mb/conf.py.

If the same chunk size is not good for zsync (which I presume), it is
still good for torrents and important for local adjustment.

I suggest that either the same chunk size is used (if that makes sense),
or a separate variable (zsync_chunk_size) is introduced to the
configuration. You only need to add it to the DEFAULTS dictionary in
mb/mb/conf.py, I think. See 
http://svn.mirrorbrain.org/viewvc/mirrorbrain/trunk/mb/mb/conf.py?r1=8061&r2=8078

>                  except OSError, e:
>                      if e.errno == errno.ENOENT:
>                          sys.stderr.write('File vanished: %r\n' % src)
> --- ./mb/mb/hashes.py.orig	2010-10-04 14:06:30.000000000 +0000
> +++ ./mb/mb/hashes.py	2010-10-04 16:02:38.000000000 +0000
> _at_@ -32,7 +32,7 @@ class Hasheable:
>      """represent a file and its metadata"""
>      def __init__(self, basename, src_dir=None, dst_dir=None,
>                   base_dir=None, do_zsync_hashes=False,
> -                 do_chunked_hashes=True, chunk_size=DEFAULT_PIECESIZE):
> +                 do_chunked_hashes=True, chunk_size=DEFAULT_PIECESIZE, do_chunked_with_zsync=False):
>          self.basename = basename
>          if src_dir:
>              self.src_dir = src_dir
> _at_@ -58,6 +58,7 @@ class Hasheable:
>          self.hb = HashBag(src=self.src, parent=self)
>          self.hb.do_zsync_hashes = do_zsync_hashes
>          self.hb.do_chunked_hashes = do_chunked_hashes
> +        self.hb.do_chunked_with_zsync = do_chunked_with_zsync
>          self.hb.chunk_size = chunk_size
>  
>      def islink(self):
> _at_@ -118,6 +119,8 @@ class Hasheable:
>              conn.mycursor = conn.Hash._connection.getConnection().cursor()
>          c = conn.mycursor
>  
> +        if self.hb.chunk_size == 0 or (self.size + self.hb.chunk_size - 1) / (self.hb.chunk_size - 1) != len(self.hb.pieceshex):
> +            self.hb.zsyncpieceshex = []
>  
>          c.execute("SELECT id FROM filearr WHERE path = %s LIMIT 1",
>                    [self.src_rel])
> _at_@ -166,7 +169,7 @@ class Hasheable:
>                         self.hb.sha1hex or '',
>                         self.hb.sha256hex or '',
>                         self.hb.chunk_size,
> -                       ''.join(self.hb.pieceshex),
> +                       ''.join(self.hb.pieceshex) + ''.join(self.hb.zsyncpieceshex),
>                         self.hb.btihhex or '',
>                         self.hb.pgp or '',
>                         self.hb.zblocksize,
> _at_@ -204,7 +207,7 @@ class Hasheable:
>                           WHERE file_id = %s""",
>                        [int(self.mtime), self.size,
>                         self.hb.md5hex or '', self.hb.sha1hex or '', self.hb.sha256hex or '',
> -                       self.hb.chunk_size, ''.join(self.hb.pieceshex),
> +                       self.hb.chunk_size, ''.join(self.hb.pieceshex) + ''.join(self.hb.zsyncpieceshex),
>                         self.hb.btihhex or '',
>                         self.hb.pgp or '', 
>                         self.hb.zblocksize,

I have difficulty understanding the above two hunks. You append the
zsync hashes to the piece-wise hashes that are used in torrents and
metalinks? Really?

There already is a database field for zsync hashes, zsums, which is
filled with self.hb.zsums. What did prevent you from using that?

> _at_@ -241,6 +244,7 @@ class HashBag:
>          self.npieces = 0
>          self.pieces = []
>          self.pieceshex = []
> +        self.zsyncpieceshex = []
>          self.btih = None
>          self.btihhex = None
>  
> _at_@ -293,6 +297,8 @@ class HashBag:
>              if self.do_chunked_hashes:
>                  self.pieces.append(sha1.sha1(buf).digest())
>                  self.pieceshex.append(sha1.sha1(buf).hexdigest())
> +                if self.do_chunked_with_zsync:
> +                    self.zsyncpieceshex.append(self.get_zsync_digest(buf, self.chunk_size));
>  
>              if self.do_zsync_hashes:
>                  self.zs_get_block_sums(buf)
> _at_@ -367,6 +373,14 @@ class HashBag:
>              r.append('        </signature>')
>  
>          if self.do_chunked_hashes:
> +            if self.do_chunked_with_zsync:
> +                r.append('        <pieces length="%s" type="zsync">' % (self.chunk_size))
> +                n = 0
> +                for piece in self.zsyncpieceshex:
> +                    r.append('            <hash piece="%s">%s</hash>' % (n, piece))
> +                    n += 1
> +                r.append('        </pieces>')
> +
>              r.append('        <pieces length="%s" type="sha1">' % (self.chunk_size))

I should maybe remark that this part of the code has been left there for
backwards compatibility only, for upgrading from 2.12.x to 2.13.x. Since
2.13.0, mod_mirrorbrain doesn't read the "hashes on disk" anymore that
are generated here.

So far, they were still generated, in order not to break a setup where
a mod_mirrorbrain module is running that relies on it, and where a newer
mb makehashes tool has already been running (so you guys could have a
safe upgrade).

But now the code is obsolete. The presence of the generated files is
still used to track changes in the database (I had difficulty to remove 
up hashes from the database for deleted files, and worked around by
using these hash files).

>  
>              n = 0
> _at_@ -374,7 +388,8 @@ class HashBag:
>                  r.append('            <hash piece="%s">%s</hash>' % (n, piece))
>                  n += 1
>  
> -            r.append('        </pieces>\n      </verification>\n')
> +            r.append('        </pieces>')
> +        r.append('      </verification>\n')
>  
>          return '\n'.join(r)

... due to the above, I would suppose that this bug has never surfaced.

>  
> _at_@ -445,6 +460,12 @@ class HashBag:
>                  self.zsums.append( c[0:self.zchecksum_len] )  # save only some leading bytes
>  
>  
> +    def get_zsync_digest(self, buf, blocksize):
> +        if len(buf) < blocksize:
> +            buf = buf + ( '\x00' * ( blocksize - len(buf) ) )
> +        r = zsync.rsum06(buf)
> +        return "%02x%02x%02x%02x" % (ord(r[3]), ord(r[2]), ord(r[1]), ord(r[0]))
> +

This looks to me as if zsync checksumming is now done twice. That's a
waste.

>      def calc_btih(self):
>          """ calculate a bittorrent information hash (btih) """
>  

Before I read on, I'd like to raise a general question regarding your
patch. Do you think it is a good idea to extend the Metalink with a
further <pieces> section?

Which clients support this? I fear that there is only one, which is the
openSUSE download client. I don't have anything against it, but before
enlarging the database, adding code that has to be maintained for a long
time, which costs work, I would like to see how this could be done best.
For the moment, I wonder why the existing zsync support is not good
enough. What's wrong with it? Should it not be fixed/improved, instead
of adding a second database field with further zsync hashes?

The existing zsync hash support works with the standard zsync client,
while your proposed patch does not, since the hashes are contained in
a Metalink (which only Metalink clients can read, which is not the case
for the zsync client). 

Instead of including the zsync hashes inside a Metalink, I suggest to
keep them separate and directly accessible (as it is the case). That
seems more generally useful to me. (There is no big reason not to also
include them into the Metalink, but that's second.)

May I suggest a different way to get the zsync hashes across? First, we 
could add a reference to them to the Metalink:
<url type="zsync" preference="...">http://example.com/foo.zsync</url>

Second, a very "modern" way, one could add a HTTP Link header to the
server reply, which is the best way to add such a reference in a way
that it can be piggybacked to any reply, easily accessible and ignorable.
This is the emerging standard.
(We are also planning to use these more in the future, see
http://mirrorbrain.org/issues/issue15 )

Either way, your download client could use such a reference to download
the zsync data, which would then be available also in its standard form.

As mentioned above, no problem to include that also in the Metalink,
probably configurable to save on size, but for now I'm having the
problem with your patch that it duplicates the zsync support. It would
be great if your implementation could be merged into the existing code.
What do you think?

Thanks for your contribution! And extra thanks for pointing out the two
bugs.

Peter

> --- ./mod_mirrorbrain/mod_mirrorbrain.c.orig	2010-09-24 09:53:56.000000000 +0000
> +++ ./mod_mirrorbrain/mod_mirrorbrain.c	2010-10-04 12:44:18.000000000 +0000
> _at_@ -119,6 +119,7 @@
>  #define MD5_DIGESTSIZE 16
>  #define SHA1_DIGESTSIZE 20
>  #define SHA256_DIGESTSIZE 32
> +#define ZSYNC_DIGESTSIZE 4
>  
>  #ifdef WITH_MEMCACHE
>  #define DEFAULT_MEMCACHED_LIFETIME 600
> _at_@ -210,6 +211,7 @@ struct hashbag {
>      const char *sha256hex;
>      int sha1piecesize;
>      apr_array_header_t *sha1pieceshex;
> +    apr_array_header_t *zsyncpieceshex;
>      const char *btihhex;
>      const char *pgp;
>      int zblocksize;
> _at_@ -970,6 +972,7 @@ static hashbag_t *hashbag_fill(request_r
>      h->sha256hex = NULL;
>      h->sha1piecesize = 0;
>      h->sha1pieceshex = NULL;
> +    h->zsyncpieceshex = NULL;
>      h->btihhex = NULL;
>      h->pgp = NULL;
>      h->zblocksize = 0;
> _at_@ -1048,18 +1051,26 @@ static hashbag_t *hashbag_fill(request_r
>  
>              /* split the string into an array of the actual pieces */
>  
> -            apr_off_t n = r->finfo.size / h->sha1piecesize;
> +            apr_off_t n = (r->finfo.size  + h->sha1piecesize - 1) / h->sha1piecesize;
>              // XXX ap_log_rerror(APLOG_MARK, APLOG_ERR, 0, r, "[mod_mirrorbrain] dbd: %" APR_INT64_T_FMT " sha1 pieces", n);
>  
>              h->sha1pieceshex = apr_array_make(r->pool, n, sizeof(const char *));
>              int max = strlen(val);
>              int i;
> -            for (i = 0; (i <= n); i++) {
> +            for (i = 0; i < n; i++) {
>                  if (((i + 1) * SHA1_DIGESTSIZE*2) > max)
>                          break;
>                  APR_ARRAY_PUSH(h->sha1pieceshex, char *) = apr_pstrndup(r->pool, 
>                          val + (i * SHA1_DIGESTSIZE * 2), (SHA1_DIGESTSIZE * 2));
>              }
> +	    // check if we have extra zsync data appended
> +	    if (max == (SHA1_DIGESTSIZE + ZSYNC_DIGESTSIZE) * 2 * n) {
> +		h->zsyncpieceshex = apr_array_make(r->pool, n, sizeof(const char *));
> +		for (i = 0; i < n; i++) {
> +		    APR_ARRAY_PUSH(h->zsyncpieceshex, char *) = apr_pstrndup(r->pool, 
> +			    val + (n * SHA1_DIGESTSIZE * 2) + (i * ZSYNC_DIGESTSIZE * 2), (ZSYNC_DIGESTSIZE * 2));
> +		}
> +	    }
>          }
>      }
>  
> _at_@ -2293,6 +2304,18 @@ static int mb_handler(request_rec *r)
>                      if (hashbag->sha256hex)
>                          ap_rprintf(r, "    <hash type=\"sha-256\">%s</hash>\n", hashbag->sha256hex);
>  
> +                    if (hashbag->zsyncpieceshex 
> +                        && (hashbag->sha1piecesize > 0) 
> +                        && !apr_is_empty_array(hashbag->zsyncpieceshex)) {
> +                        ap_rprintf(r, "    <pieces length=\"%d\" type=\"zsync\">\n", 
> +                                   hashbag->sha1piecesize);
> +
> +                        char **p = (char **)hashbag->zsyncpieceshex->elts;
> +                        for (i = 0; i < hashbag->zsyncpieceshex->nelts; i++) {
> +                            ap_rprintf(r, "      <hash>%s</hash>\n", p[i]);
> +                        }
> +                        ap_rputs("    </pieces>\n", r);
> +                    }
>                      if (hashbag->sha1pieceshex 
>                          && (hashbag->sha1piecesize > 0) 
>                          && !apr_is_empty_array(hashbag->sha1pieceshex)) {
> _at_@ -2324,6 +2347,18 @@ static int mb_handler(request_rec *r)
>                      if (hashbag->sha256hex)
>                          ap_rprintf(r, "        <hash type=\"sha256\">%s</hash>\n", hashbag->sha256hex);
>  
> +                    if (hashbag->zsyncpieceshex 
> +                        && (hashbag->sha1piecesize > 0) 
> +                        && !apr_is_empty_array(hashbag->zsyncpieceshex)) {
> +                        ap_rprintf(r, "        <pieces length=\"%d\" type=\"zsync\">\n", 
> +                                   hashbag->sha1piecesize);
> +
> +                        char **p = (char **)hashbag->zsyncpieceshex->elts;
> +                        for (i = 0; i < hashbag->zsyncpieceshex->nelts; i++) {
> +                            ap_rprintf(r, "          <hash piece=\"%d\">%s</hash>\n", i, p[i]);
> +                        }
> +                        ap_rputs("        </pieces>\n", r);
> +                    }
>                      if (hashbag->sha1pieceshex 
>                          && (hashbag->sha1piecesize > 0) 
>                          && !apr_is_empty_array(hashbag->sha1pieceshex)) {

_______________________________________________
mirrorbrain mailing list
Archive: http://mirrorbrain.org/archive/mirrorbrain/

Note: To remove yourself from this mailing list, send a mail with the content
 	unsubscribe
to the address mirrorbrain-request_at_mirrorbrain.org