Re: [mirrorbrain] Re: Repository delta download und Mirrorbrain 2.13

From: Peter Pöml <peter_at_poeml.de>
Date: Tue, 19 Oct 2010 17:54:17 +0200
Hi Micha,

On Tue, Oct 19, 2010 at 12:09:35PM +0200, Michael Schroeder wrote:
> On Tue, Oct 19, 2010 at 04:26:58AM +0200, Peter Pöml wrote:
> Oh yes, setting the chunk size to 65536 is just a quick hack I did.

Yup, I understood that meanwhile.

> > I have difficulty understanding the above two hunks. You append the
> > zsync hashes to the piece-wise hashes that are used in torrents and
> > metalinks? Really?
> 
> Absolutely. ;-) I needed the patch to be compatible to older
> mirrorbrain versions so I just appended the zsync hashes.
> As I said in my mail, this is just a hack to get it working again.
> I didn't want to extend the database with yet another field.

Oh, there is no reason to not touch the current zsync support - it is
experimental, documented as such, and not configured by default. IMO it
is better to improve it and make it more flexible, than to add a second
implementation which is even less flexible, and duplicates code. It also
duplicates data in the database, and CPU time in calculating the hashes.
Especially the code maintenance aspect would give me a headache. It's
not as if this project had too much manpower ;-)
 
> Btw, you really should put the hash type of the pieces in the database
> and not assume that sha-1 is always used.

I disagree here. The database columns for piecewise hashes are clearly
named "sha1pieces" and "sha1piecesize" which makes sure that there is no
misunderstanding. One doesn't need to fill them or use them, of course.

Since the piecewise hashes are not used for final verification (which a
top hash is used for), there is no big incentive to offer a lot of types
of hashes here - and it could cost a lot of space anyway.

If another type of piecewise hash is added, it should obviously not go
into the wrong column but get its own.

Overloading the one column is not a good idea, just as there currently
several separate columns for different types of data in order to access
them in an efficient way. If you looked at the database scheme, you'll
have noticed that there's more than a single column named "blob". ;-)

There is a defined layer of abstraction for storing this kind of data,
and I don't think there is a need to leave that and implement a
"microformat" on a different level.

> > There already is a database field for zsync hashes, zsums, which is
> > filled with self.hb.zsums. What did prevent you from using that?
> 
> Cause I didn't want to break mirrorbrain's own zsync support.
> Compatibility...

Don't worry -- that part is very experimental still. And what would
break anyway? So scared? ;-)

> > This looks to me as if zsync checksumming is now done twice. That's a
> > waste.
> 
> Only if you use mirrorbrain's zsync support, which I don't.

If there are other reasons not to use it, except fear of breaking
it, could you explain them? Maybe there is something that I miss.

> Oh sure, as it's supported by the standard.
> 
> A cleaner implementation would be to
> - put the checksum type for the pieces in the database
> - use "sha-1+zsync" as type with a hash size of 20+4 bytes instead
>   of appending the zsync part.

Hm, sha-1+zsync (i.e., combining two hashes into one) doesn't sound like
a very clean thing to me, and unlikely to be supported by anyone else.
Do you think this could become a standard?

If you suggest that hashes (with same block size) are combined, because
may seem cleaner to you, I think that may make sense only for clients
that support all of the combined hashes, and the way you combine them
(which is not in any standard yet). By gut feels, I would assume however
that there will be clients that support not all of the hashes, which
would make it difficult for them cope with the metadata. Therefore, I
think it is cleaner to keep the hashes separate (and also less
error-prone, and less work for programmers). Just like you did, adding
them alongside other hashes.

We also don't put a hash of type "sha-1+sha-256+md5+pgp" into the
Metalink, consisting of 20+32+16+x bytes, or varying par jour, but
instead we list them separately -- because in most people's book it
seems to be easier to handle and cleaner.

Or to rephrase this, there is a defined level to abstract the metadata
(the IETF Metalink), and while introducing a microformat side may be
creative, it breaks out of the scheme. It would require different ways
of parsing, and it would break backwards compatibility, which the
current way to transmit the metadata doesn't.

> > For the moment, I wonder why the existing zsync support is not good
> > enough. What's wrong with it? Should it not be fixed/improved, instead
> > of adding a second database field with further zsync hashes?
> 
> No, I don't want to do another request for a zsync file (which also
> contains the block checksums again, thus is wasteful). For our
> use case (minimizing transfer and latency) it's best to have them
> in the metalink file.

Fair enough, but then I propose to make this configurable. Not everybody
would want to have the additional data sent out with every metalink, I
suppose. (But that's the smallest & easiest part :)

> > The existing zsync hash support works with the standard zsync client,
> > while your proposed patch does not, since the hashes are contained in
> > a Metalink (which only Metalink clients can read, which is not the case
> > for the zsync client). 
> 
> I don't mind, as I don't use a standard zsync client.

You might not mind, but I do. I *do* care :-)

So from what I see, it should be possible that you rework your patch to
use the existing data column, and calculates hashes once instead of
twice. Or are there further obstacles with that approach?

Peter


_______________________________________________
mirrorbrain mailing list
Archive: http://mirrorbrain.org/archive/mirrorbrain/

Note: To remove yourself from this mailing list, send a mail with the content
 	unsubscribe
to the address mirrorbrain-request_at_mirrorbrain.org
Received on Tue Oct 19 2010 - 15:54:25 GMT

This archive was generated by hypermail 2.3.0 : Thu Oct 21 2010 - 11:47:06 GMT