Hi Micha, thanks for the patches! I'm including the MirrorBrain mailing list in my reply, which is the best place for all communication about MirrorBrain development. Since your mail is about code, I hope you pardon me for taking your private mail where it belongs. It would be great if you could write directly to the list next time, so the community is not excluded. (Or open a bug, which also prevents things from collecting dust deep in my mailbox :) Comments below, scattered inline. On Mon, Oct 04, 2010 at 06:52:22 +0200, Michael Schroeder wrote: > > Hallo! > > On Fri, Sep 24, 2010 at 11:34:23AM +0200, Michael Schroeder wrote: > > On Thu, Sep 23, 2010 at 07:44:41PM +0200, Peter Pöml wrote: > > > MirrorBrain 2.13 hat zsync-Support, der in Form von "herunterladbaren" .zsync Files daher kommt. > > > http://mirrorbrain.org/docs/configuration/#configuring-zsync-generation > > > > Das weiss ich, aber das brauch ich nicht. Ich brauch das Metalink > > File mit Zsync Blockchecksummen, ich moechte kein weiteres File > > herunterladen. > > Ich hab jetzt den attachten Patch bei uns laufen. > > Folgende Sachen solltest Du auf jeden Fall uebernehmen: > > mb/mb/hashes.py dump_2_12_template: > > </verification> sollte nicht im "if" Block ausgegeben werden, > wenn do_chunked_hashes False ist, fehlt es. > > _at_@ -374,7 +388,8 @@ class HashBag: > r.append(' <hash piece="%s">%s</hash>' % (n, piece)) > n += 1 > > - r.append(' </pieces>\n </verification>\n') > + r.append(' </pieces>') > + r.append(' </verification>\n') > > return '\n'.join(r) Ah, good catch - I'd just suggest to add the missing newline after </pieces>. > mod_mirrorbrain/mod_mirrorbrain.c mb_handler: > > Die Berechnung der Anzahl pieces ist falsch. Da muss aufgerundet > werden, und die for Schleife darf nur bis n gehen: (translation: The calculation of the number of pieces is wrong. It needs to be rounded up, and the for loop must only run to n:) > _at_@ -1048,18 +1051,26 @@ static hashbag_t *hashbag_fill(request_r > > /* split the string into an array of the actual pieces */ > > - apr_off_t n = r->finfo.size / h->sha1piecesize; > + apr_off_t n = (r->finfo.size + h->sha1piecesize - 1) / h->sha1piecesize; > // XXX ap_log_rerror(APLOG_MARK, APLOG_ERR, 0, r, "[mod_mirrorbrain] dbd: %" APR_INT64_T_FMT " sha1 pieces", > n); > > h->sha1pieceshex = apr_array_make(r->pool, n, sizeof(const char *)); > int max = strlen(val); > int i; > - for (i = 0; (i <= n); i++) { > + for (i = 0; i < n; i++) { > if (((i + 1) * SHA1_DIGESTSIZE*2) > max) > break; > APR_ARRAY_PUSH(h->sha1pieceshex, char *) = apr_pstrndup(r->pool, > val + (i * SHA1_DIGESTSIZE * 2), (SHA1_DIGESTSIZE * 2)); Cool, that makes sense. Thanks! > Ausserdem wird m.E. der "hashbag" nie mehr freigegeben, macht das > der Apache automatisch wenn der Request fertig ist? (translation: Also, the "hashbag" seems to be never released, does Apache do that when it finished handling the request?) Yes, it's fundamentally always like that in Apache's request handling. That variable uses only "pool memory", allocated from a pool of memory which is completely released after the request is done. There are pools for configuration, requests, connections, and other resources, with different lifetime. In the mod_mirrorbrain handler, memory is allocated from the request pool, with small exceptions (like the call to realfile() which needs to be malloced and freed. That's also why you see request_rec *r being passed around so frequently, which is just to be able to access r->pool which points to the respective memory pool. Strictly following this rule reduces the risk of memleaks basically to zero (and instead lets you find funny bugs deep, deep somewhere else ;) It is worthwhile to know though, that the pool memory is actually reused, and thus it is not completely freed after a request. Thus, allocating large amounts can have a "lasting effect", so to speak. > Gruesse, > Micha. > > -- > Michael Schroeder mls_at_suse.de > SUSE LINUX Products GmbH, GF Markus Rex, HRB 16746 AG Nuernberg > main(_){while(_=~getchar())putchar(~_-1/(~(_|32)/13*2-11)*13);} Further inline comments below: > --- ./mb/mb.py.orig 2010-10-04 14:00:02.000000000 +0000 > +++ ./mb/mb.py 2010-10-04 16:25:22.000000000 +0000 > _at_@ -911,6 +911,8 @@ class MirrorDoctor(cmdln.Cmdln): > 'subdirectory -- see examples)') > _at_cmdln.option('-t', '--target-dir', metavar='PATH', > help='set the target directory (required)') > + _at_cmdln.option('-z', '--zsync-mask', metavar='REGEX', > + help='regular expression to select files to create zsync hashes for') > _at_cmdln.option('-v', '--verbose', action='store_true', > help='show more information') > def do_makehashes(self, subcmd, opts, startdir): > _at_@ -975,6 +977,8 @@ class MirrorDoctor(cmdln.Cmdln): > opts.ignore_mask = re.compile(opts.ignore_mask) > if opts.file_mask: > opts.file_mask = re.compile(opts.file_mask) > + if opts.zsync_mask: > + opts.zsync_mask = re.compile(opts.zsync_mask) > > unlinked_files = unlinked_dirs = 0 > > _at_@ -1055,6 +1059,12 @@ class MirrorDoctor(cmdln.Cmdln): > if opts.ignore_mask and re.match(opts.ignore_mask, src): > continue > > + do_chunked_with_zsync = False > + chunk_size = self.config.dbconfig.get('chunk_size') > + if opts.zsync_mask and self.config.dbconfig.get('chunked_hashes') and re.match(opts.zsync_mask, src): > + do_chunked_with_zsync = True > + chunk_size = 65536 > + > # stat only once > try: > hasheable = mb.hashes.Hasheable(src_basename, > _at_@ -1063,7 +1073,8 @@ class MirrorDoctor(cmdln.Cmdln): > base_dir=opts.base_dir, > do_zsync_hashes=self.config.dbconfig.get('zsync_hashes'), > do_chunked_hashes=self.config.dbconfig.get('chunked_hashes'), > - chunk_size=self.config.dbconfig.get('chunk_size')) > + chunk_size=chunk_size, > + do_chunked_with_zsync=do_chunked_with_zsync) This change (and the hunk above) breaks the configuration of the chunk size via /etc/mirrorbrain.conf, where the value is read from (or substituted by 262144), see mb/mb/conf.py. If the same chunk size is not good for zsync (which I presume), it is still good for torrents and important for local adjustment. I suggest that either the same chunk size is used (if that makes sense), or a separate variable (zsync_chunk_size) is introduced to the configuration. You only need to add it to the DEFAULTS dictionary in mb/mb/conf.py, I think. See http://svn.mirrorbrain.org/viewvc/mirrorbrain/trunk/mb/mb/conf.py?r1=8061&r2=8078 > except OSError, e: > if e.errno == errno.ENOENT: > sys.stderr.write('File vanished: %r\n' % src) > --- ./mb/mb/hashes.py.orig 2010-10-04 14:06:30.000000000 +0000 > +++ ./mb/mb/hashes.py 2010-10-04 16:02:38.000000000 +0000 > _at_@ -32,7 +32,7 @@ class Hasheable: > """represent a file and its metadata""" > def __init__(self, basename, src_dir=None, dst_dir=None, > base_dir=None, do_zsync_hashes=False, > - do_chunked_hashes=True, chunk_size=DEFAULT_PIECESIZE): > + do_chunked_hashes=True, chunk_size=DEFAULT_PIECESIZE, do_chunked_with_zsync=False): > self.basename = basename > if src_dir: > self.src_dir = src_dir > _at_@ -58,6 +58,7 @@ class Hasheable: > self.hb = HashBag(src=self.src, parent=self) > self.hb.do_zsync_hashes = do_zsync_hashes > self.hb.do_chunked_hashes = do_chunked_hashes > + self.hb.do_chunked_with_zsync = do_chunked_with_zsync > self.hb.chunk_size = chunk_size > > def islink(self): > _at_@ -118,6 +119,8 @@ class Hasheable: > conn.mycursor = conn.Hash._connection.getConnection().cursor() > c = conn.mycursor > > + if self.hb.chunk_size == 0 or (self.size + self.hb.chunk_size - 1) / (self.hb.chunk_size - 1) != len(self.hb.pieceshex): > + self.hb.zsyncpieceshex = [] > > c.execute("SELECT id FROM filearr WHERE path = %s LIMIT 1", > [self.src_rel]) > _at_@ -166,7 +169,7 @@ class Hasheable: > self.hb.sha1hex or '', > self.hb.sha256hex or '', > self.hb.chunk_size, > - ''.join(self.hb.pieceshex), > + ''.join(self.hb.pieceshex) + ''.join(self.hb.zsyncpieceshex), > self.hb.btihhex or '', > self.hb.pgp or '', > self.hb.zblocksize, > _at_@ -204,7 +207,7 @@ class Hasheable: > WHERE file_id = %s""", > [int(self.mtime), self.size, > self.hb.md5hex or '', self.hb.sha1hex or '', self.hb.sha256hex or '', > - self.hb.chunk_size, ''.join(self.hb.pieceshex), > + self.hb.chunk_size, ''.join(self.hb.pieceshex) + ''.join(self.hb.zsyncpieceshex), > self.hb.btihhex or '', > self.hb.pgp or '', > self.hb.zblocksize, I have difficulty understanding the above two hunks. You append the zsync hashes to the piece-wise hashes that are used in torrents and metalinks? Really? There already is a database field for zsync hashes, zsums, which is filled with self.hb.zsums. What did prevent you from using that? > _at_@ -241,6 +244,7 @@ class HashBag: > self.npieces = 0 > self.pieces = [] > self.pieceshex = [] > + self.zsyncpieceshex = [] > self.btih = None > self.btihhex = None > > _at_@ -293,6 +297,8 @@ class HashBag: > if self.do_chunked_hashes: > self.pieces.append(sha1.sha1(buf).digest()) > self.pieceshex.append(sha1.sha1(buf).hexdigest()) > + if self.do_chunked_with_zsync: > + self.zsyncpieceshex.append(self.get_zsync_digest(buf, self.chunk_size)); > > if self.do_zsync_hashes: > self.zs_get_block_sums(buf) > _at_@ -367,6 +373,14 @@ class HashBag: > r.append(' </signature>') > > if self.do_chunked_hashes: > + if self.do_chunked_with_zsync: > + r.append(' <pieces length="%s" type="zsync">' % (self.chunk_size)) > + n = 0 > + for piece in self.zsyncpieceshex: > + r.append(' <hash piece="%s">%s</hash>' % (n, piece)) > + n += 1 > + r.append(' </pieces>') > + > r.append(' <pieces length="%s" type="sha1">' % (self.chunk_size)) I should maybe remark that this part of the code has been left there for backwards compatibility only, for upgrading from 2.12.x to 2.13.x. Since 2.13.0, mod_mirrorbrain doesn't read the "hashes on disk" anymore that are generated here. So far, they were still generated, in order not to break a setup where a mod_mirrorbrain module is running that relies on it, and where a newer mb makehashes tool has already been running (so you guys could have a safe upgrade). But now the code is obsolete. The presence of the generated files is still used to track changes in the database (I had difficulty to remove up hashes from the database for deleted files, and worked around by using these hash files). > > n = 0 > _at_@ -374,7 +388,8 @@ class HashBag: > r.append(' <hash piece="%s">%s</hash>' % (n, piece)) > n += 1 > > - r.append(' </pieces>\n </verification>\n') > + r.append(' </pieces>') > + r.append(' </verification>\n') > > return '\n'.join(r) ... due to the above, I would suppose that this bug has never surfaced. > > _at_@ -445,6 +460,12 @@ class HashBag: > self.zsums.append( c[0:self.zchecksum_len] ) # save only some leading bytes > > > + def get_zsync_digest(self, buf, blocksize): > + if len(buf) < blocksize: > + buf = buf + ( '\x00' * ( blocksize - len(buf) ) ) > + r = zsync.rsum06(buf) > + return "%02x%02x%02x%02x" % (ord(r[3]), ord(r[2]), ord(r[1]), ord(r[0])) > + This looks to me as if zsync checksumming is now done twice. That's a waste. > def calc_btih(self): > """ calculate a bittorrent information hash (btih) """ > Before I read on, I'd like to raise a general question regarding your patch. Do you think it is a good idea to extend the Metalink with a further <pieces> section? Which clients support this? I fear that there is only one, which is the openSUSE download client. I don't have anything against it, but before enlarging the database, adding code that has to be maintained for a long time, which costs work, I would like to see how this could be done best. For the moment, I wonder why the existing zsync support is not good enough. What's wrong with it? Should it not be fixed/improved, instead of adding a second database field with further zsync hashes? The existing zsync hash support works with the standard zsync client, while your proposed patch does not, since the hashes are contained in a Metalink (which only Metalink clients can read, which is not the case for the zsync client). Instead of including the zsync hashes inside a Metalink, I suggest to keep them separate and directly accessible (as it is the case). That seems more generally useful to me. (There is no big reason not to also include them into the Metalink, but that's second.) May I suggest a different way to get the zsync hashes across? First, we could add a reference to them to the Metalink: <url type="zsync" preference="...">http://example.com/foo.zsync</url> Second, a very "modern" way, one could add a HTTP Link header to the server reply, which is the best way to add such a reference in a way that it can be piggybacked to any reply, easily accessible and ignorable. This is the emerging standard. (We are also planning to use these more in the future, see http://mirrorbrain.org/issues/issue15 ) Either way, your download client could use such a reference to download the zsync data, which would then be available also in its standard form. As mentioned above, no problem to include that also in the Metalink, probably configurable to save on size, but for now I'm having the problem with your patch that it duplicates the zsync support. It would be great if your implementation could be merged into the existing code. What do you think? Thanks for your contribution! And extra thanks for pointing out the two bugs. Peter > --- ./mod_mirrorbrain/mod_mirrorbrain.c.orig 2010-09-24 09:53:56.000000000 +0000 > +++ ./mod_mirrorbrain/mod_mirrorbrain.c 2010-10-04 12:44:18.000000000 +0000 > _at_@ -119,6 +119,7 @@ > #define MD5_DIGESTSIZE 16 > #define SHA1_DIGESTSIZE 20 > #define SHA256_DIGESTSIZE 32 > +#define ZSYNC_DIGESTSIZE 4 > > #ifdef WITH_MEMCACHE > #define DEFAULT_MEMCACHED_LIFETIME 600 > _at_@ -210,6 +211,7 @@ struct hashbag { > const char *sha256hex; > int sha1piecesize; > apr_array_header_t *sha1pieceshex; > + apr_array_header_t *zsyncpieceshex; > const char *btihhex; > const char *pgp; > int zblocksize; > _at_@ -970,6 +972,7 @@ static hashbag_t *hashbag_fill(request_r > h->sha256hex = NULL; > h->sha1piecesize = 0; > h->sha1pieceshex = NULL; > + h->zsyncpieceshex = NULL; > h->btihhex = NULL; > h->pgp = NULL; > h->zblocksize = 0; > _at_@ -1048,18 +1051,26 @@ static hashbag_t *hashbag_fill(request_r > > /* split the string into an array of the actual pieces */ > > - apr_off_t n = r->finfo.size / h->sha1piecesize; > + apr_off_t n = (r->finfo.size + h->sha1piecesize - 1) / h->sha1piecesize; > // XXX ap_log_rerror(APLOG_MARK, APLOG_ERR, 0, r, "[mod_mirrorbrain] dbd: %" APR_INT64_T_FMT " sha1 pieces", n); > > h->sha1pieceshex = apr_array_make(r->pool, n, sizeof(const char *)); > int max = strlen(val); > int i; > - for (i = 0; (i <= n); i++) { > + for (i = 0; i < n; i++) { > if (((i + 1) * SHA1_DIGESTSIZE*2) > max) > break; > APR_ARRAY_PUSH(h->sha1pieceshex, char *) = apr_pstrndup(r->pool, > val + (i * SHA1_DIGESTSIZE * 2), (SHA1_DIGESTSIZE * 2)); > } > + // check if we have extra zsync data appended > + if (max == (SHA1_DIGESTSIZE + ZSYNC_DIGESTSIZE) * 2 * n) { > + h->zsyncpieceshex = apr_array_make(r->pool, n, sizeof(const char *)); > + for (i = 0; i < n; i++) { > + APR_ARRAY_PUSH(h->zsyncpieceshex, char *) = apr_pstrndup(r->pool, > + val + (n * SHA1_DIGESTSIZE * 2) + (i * ZSYNC_DIGESTSIZE * 2), (ZSYNC_DIGESTSIZE * 2)); > + } > + } > } > } > > _at_@ -2293,6 +2304,18 @@ static int mb_handler(request_rec *r) > if (hashbag->sha256hex) > ap_rprintf(r, " <hash type=\"sha-256\">%s</hash>\n", hashbag->sha256hex); > > + if (hashbag->zsyncpieceshex > + && (hashbag->sha1piecesize > 0) > + && !apr_is_empty_array(hashbag->zsyncpieceshex)) { > + ap_rprintf(r, " <pieces length=\"%d\" type=\"zsync\">\n", > + hashbag->sha1piecesize); > + > + char **p = (char **)hashbag->zsyncpieceshex->elts; > + for (i = 0; i < hashbag->zsyncpieceshex->nelts; i++) { > + ap_rprintf(r, " <hash>%s</hash>\n", p[i]); > + } > + ap_rputs(" </pieces>\n", r); > + } > if (hashbag->sha1pieceshex > && (hashbag->sha1piecesize > 0) > && !apr_is_empty_array(hashbag->sha1pieceshex)) { > _at_@ -2324,6 +2347,18 @@ static int mb_handler(request_rec *r) > if (hashbag->sha256hex) > ap_rprintf(r, " <hash type=\"sha256\">%s</hash>\n", hashbag->sha256hex); > > + if (hashbag->zsyncpieceshex > + && (hashbag->sha1piecesize > 0) > + && !apr_is_empty_array(hashbag->zsyncpieceshex)) { > + ap_rprintf(r, " <pieces length=\"%d\" type=\"zsync\">\n", > + hashbag->sha1piecesize); > + > + char **p = (char **)hashbag->zsyncpieceshex->elts; > + for (i = 0; i < hashbag->zsyncpieceshex->nelts; i++) { > + ap_rprintf(r, " <hash piece=\"%d\">%s</hash>\n", i, p[i]); > + } > + ap_rputs(" </pieces>\n", r); > + } > if (hashbag->sha1pieceshex > && (hashbag->sha1piecesize > 0) > && !apr_is_empty_array(hashbag->sha1pieceshex)) { _______________________________________________ mirrorbrain mailing list Archive: http://mirrorbrain.org/archive/mirrorbrain/ Note: To remove yourself from this mailing list, send a mail with the content unsubscribe to the address mirrorbrain-request_at_mirrorbrain.orgReceived on Tue Oct 19 2010 - 02:27:06 GMT
This archive was generated by hypermail 2.3.0 : Tue Oct 19 2010 - 11:17:06 GMT