Discussion:
[fossil-users] Compare by hash
Kelly Dean
2015-01-12 11:24:58 UTC
Permalink
Git does compare-by-hash. This is a mistake, because Valerie Hansen said so. It seems Fossil makes the same mistake. If the hash function is broken, you're hosed, and you don't know until it's too late. The checks described at selfcheck.wiki don't detect it.

ZFS provides a dedup=verify option in order to prevent this failure mode.
Stephan Beal
2015-01-12 11:44:28 UTC
Permalink
Post by Kelly Dean
Git does compare-by-hash. This is a mistake, because Valerie Hansen said
so. It seems Fossil makes the same mistake. If the hash function is broken,
you're hosed, and you don't know until it's too late. The checks described
at selfcheck.wiki don't detect it.
"It's not a problem until you make it a problem."

George Clooney in "Dusk til Dawn"
--
----- stephan beal
http://wanderinghorse.net/home/stephan/
http://gplus.to/sgbeal
"Freedom is sloppy. But since tyranny's the only guaranteed byproduct of
those who insist on a perfect world, freedom will have to do." -- Bigby Wolf
Joerg Sonnenberger
2015-01-12 12:28:41 UTC
Permalink
Post by Kelly Dean
Git does compare-by-hash. This is a mistake, because Valerie Hansen
said so. It seems Fossil makes the same mistake.
I have no idea who Valerie Hansen is or why I should care about him.
Comparing by content is generally not an option because the other
content is not available.

Joerg
Richard Hipp
2015-01-12 12:43:47 UTC
Permalink
Post by Joerg Sonnenberger
Post by Kelly Dean
Git does compare-by-hash. This is a mistake, because Valerie Hansen
said so. It seems Fossil makes the same mistake.
I have no idea who Valerie Hansen is or why I should care about him.
Comparing by content is generally not an option because the other
content is not available.
Perhaps Kelly refers to this paper:
https://www.usenix.org/legacy/events/hotos03/tech/full_papers/henson/henson_html/

The monotone people addressed this concern over a decade ago:
http://www.monotone.ca/docs/Hash-Integrity.html
--
D. Richard Hipp
***@sqlite.org
Kelly Dean
2015-01-13 01:49:27 UTC
Permalink
Post by Richard Hipp
http://www.monotone.ca/docs/Hash-Integrity.html
The monotone page says:
⌜the vulnerability is not that the hash might suddenly cease to address benign blocks well, but merely that additional security precautions might become a requirement to ensure that blocks are benign, rather than malicious.⌝

If you allow me to push anything into your repository (even just pushing bug reports; you don't have to trust me), and I can find a hash preimage for an artifact that I know somebody else is planning to push to you, then I can push my artifact first, and when the other person tries to push, you won't get his artifact.

Or if somebody is going to send you a file I don't want you to see, and I can find a preimage, I can send you my file first, with some excuse for you to commit it somewhere (anywhere in your repository). Then when the other person sends you his file, and you try to commit, Fossil will silently fail to commit it. If you then delete the file without looking at it, and later rely on Fossil to show it to you, you'll get mine instead.

Not only is there no published preimage attack on sha1, there isn't even one on md5 (only collision for md5). But after sha1 gets practical collisions, I can generate a collision, send one of the artifacts to you, and send the other one to somebody else. Then when you two synchronize, you'll think you both have the same artifact, but you don't.

This is peanut-gallerizing because there's no practical collision attack for sha1 yet, and especially because even when there is one, in Fossil's case I'd have to collide both sha1 and md5 at the same time, which would be really hard (but I don't think the cryptographers will consider sha1+md5 to be a secure hash function after sha1 is broken).
Stephan Beal
2015-01-13 02:10:00 UTC
Permalink
Post by Kelly Dean
If you allow me to push anything into your repository (even just pushing
bug reports; you don't have to trust me), and I can find a hash preimage
Prove it can happen first ;). It's a theoretical problem with (AFAIK) no
proven attack (at least not in the context of fossil).
--
----- stephan beal
http://wanderinghorse.net/home/stephan/
http://gplus.to/sgbeal
"Freedom is sloppy. But since tyranny's the only guaranteed byproduct of
those who insist on a perfect world, freedom will have to do." -- Bigby Wolf
Kelly Dean
2015-01-13 02:46:41 UTC
Permalink
Post by Stephan Beal
Post by Kelly Dean
If you allow me to push anything into your repository (even just pushing
bug reports; you don't have to trust me), and I can find a hash preimage
Prove it can happen first ;). It's a theoretical problem with (AFAIK) no
proven attack
Are you referring to finding a preimage, or using a preimage to attack Fossil? If the former, then it's true by the pigeonhole principle, though you can't do it for sha1 yet.
Post by Stephan Beal
(at least not in the context of fossil).
If you can find a preimage, then you certainly can do the attack I described. If you don't believe it, then build a copy of Fossil that uses a noncryptographic hash (e.g. a simple checksum) instead of sha1 and md5, compute a preimage (which is trivial in this case), and try the attack. Fossil dedupes by hash, and compares only by hash (i.e. not by content), so the attack will work.
Stephan Beal
2015-01-13 10:44:47 UTC
Permalink
Post by Kelly Dean
Post by Stephan Beal
(at least not in the context of fossil).
If you can find a preimage, then you certainly can do the attack I
described.
As my little brother once told me: 'if' is a pretty big word, considering
how small it is.
--
----- stephan beal
http://wanderinghorse.net/home/stephan/
http://gplus.to/sgbeal
"Freedom is sloppy. But since tyranny's the only guaranteed byproduct of
those who insist on a perfect world, freedom will have to do." -- Bigby Wolf
Kelly Dean
2015-01-13 01:46:05 UTC
Permalink
Post by Joerg Sonnenberger
I have no idea who Valerie Hansen is or why I should care about him.
As Richard Hipp pointed out, I misspelled her name. It was ‟Henson” (now ‟Aurora”).
(Head hung in shame.)

Anyway I was being facetious with the ‟because Valerie said so” remark.
Post by Joerg Sonnenberger
Comparing by content is generally not an option because the other
content is not available.
When you do a check-in, the content is available.

If you have files foo and bar with different content but the same hash (i.e. there's a hash collision), and you commit foo, then bar, and dedup by hash (which Fossil does, just like Git), then do a checkout and try to verify by checking the hashes, you'll be fooled into thinking nothing is wrong, even though you've accidentally recorded the content of foo for bar.

But if you verify by comparing to the original data (which you just tried to commit), you'll discover that the checkout has the wrong content for bar; you won't be fooled.
Richard Hipp
2015-01-13 02:23:39 UTC
Permalink
Post by Kelly Dean
If you have files foo and bar with different content but the same hash (i.e.
there's a hash collision), and you commit foo, then bar, and dedup by hash
(which Fossil does, just like Git), then do a checkout and try to verify by
checking the hashes, you'll be fooled into thinking nothing is wrong, even
though you've accidentally recorded the content of foo for bar.
Homework problem: Suppose 100 developers each make 100 file changes
every day on the same fossil repository. So 10,000 new hashes enter
the repository every day. (A) Compute the expected time until the
first SHA1 hash collision is encountered. (B) Rewrite the answer to
the previous question as a fraction or multiple of the estimated
current age of the universe. (C) Assuming each file is 100 bytes in
size (after delta encoding and compression), what the expected size of
the repository database when the first SHA1 hash collision is
encountered.
--
D. Richard Hipp
***@sqlite.org
Kelly Dean
2015-01-13 03:37:28 UTC
Permalink
Post by Richard Hipp
Homework problem: Suppose 100 developers each make 100 file changes
every day on the same fossil repository. So 10,000 new hashes enter
the repository every day. (A) Compute the expected time until the
first SHA1 hash collision is encountered. (B) Rewrite the answer to
the previous question as a fraction or multiple of the estimated
current age of the universe. (C) Assuming each file is 100 bytes in
size (after delta encoding and compression), what the expected size of
the repository database when the first SHA1 hash collision is
encountered.
A. (2^(160/2) hashes) / (2^13.3 hashes/day) → 2^66.7 days.
B. (13.3·10^9 years per eon) / (2^66.7 days) → 2^33 eons.
C. (2^(160/2) hashes) * (100 bytes/hash) → 2^73.4 bytes (i.e. 10 zettabytes).

All of which is irrelevant. I'm talking about accidentally being fooled by an attacker, not accidentally colliding with a benign file.
Kelly Dean
2015-01-13 22:00:06 UTC
Permalink
Post by Kelly Dean
B. (13.3·10^9 years per eon) / (2^66.7 days) → 2^33 eons.
Er, that's 2^24.6 eons. (Forgot to divide by days/year.)

Unless you're a young-Earth creationist, in which case it's 2^45.6 eons.
Joerg Sonnenberger
2015-01-13 10:21:17 UTC
Permalink
Post by Kelly Dean
If you have files foo and bar with different content but the same hash
(i.e. there's a hash collision), and you commit foo, then bar, and
dedup by hash (which Fossil does, just like Git), then do a checkout
and try to verify by checking the hashes, you'll be fooled into
thinking nothing is wrong, even though you've accidentally recorded the
content of foo for bar.
What is the thread model again? That I pull from random malocious
people? I don't do that. I would strongly recomment against doing that.

Your scenario just doesn't make sense. The much more interesting case is
getting wrong data send from a 3rd party matching an existing (signed)
manifest. Comparing content is not an option for that as you don't have
the original content.

Joerg
Ron W
2015-01-13 16:32:21 UTC
Permalink
Post by Joerg Sonnenberger
The much more interesting case is
getting wrong data send from a 3rd party matching an existing (signed)
manifest. Comparing content is not an option for that as you don't have
the original content.
As I understand the Fossil sync protocol, if the remote side offers
something whose ID matches the ID of something the local Fossil already
has, the local Fossil will decline it.
Joerg Sonnenberger
2015-01-13 17:24:36 UTC
Permalink
Post by Ron W
Post by Joerg Sonnenberger
The much more interesting case is
getting wrong data send from a 3rd party matching an existing (signed)
manifest. Comparing content is not an option for that as you don't have
the original content.
As I understand the Fossil sync protocol, if the remote side offers
something whose ID matches the ID of something the local Fossil already
has, the local Fossil will decline it.
Not exactly what I meant. Consider I run a mirror of sqlite.org and can
create a second preimage for one of the blobs. If you have signed
manifests, it should not be a problem if you pulled the data from me.

Joerg
Kelly Dean
2015-01-13 21:09:54 UTC
Permalink
Post by Ron W
As I understand the Fossil sync protocol, if the remote side offers
something whose ID matches the ID of something the local Fossil already
has, the local Fossil will decline it.
Which means if the attacker wins the race to offer you an artifact, then your local Fossil will decline the legitimate offerer's artifact.

Of course, over the network, transmitting the content of the entire repository during every sync isn't feasible, so all you can do is extend UUIDs with a new secure hash function when the current function is broken.

But when committing a new file to a local repository, compare-by-content _is_ practical. I'm throwing peanuts at Fossil's failure to even provide an option to do that. The user can do it manually: commit, then checkout into another working directory whatever he just committed, then compare with the original, then delete the temporary checkout (or raise an alarm if there's a mismatch). But this isn't something the user should have to do manually.
Ron W
2015-01-13 21:43:31 UTC
Permalink
Post by Kelly Dean
But when committing a new file to a local repository, compare-by-content
_is_ practical. I'm throwing peanuts at Fossil's failure to even provide an
option to do that. The user can do it manually: commit, then checkout into
another working directory whatever he just committed, then compare with the
original, then delete the temporary checkout (or raise an alarm if there's
a mismatch). But this isn't something the user should have to do manually.
So, you are asking for an option to enable an automatic, post-commit
"fossil diff $@" (where $@ is the list of files just committed)?
Kelly Dean
2015-01-13 23:24:49 UTC
Permalink
Post by Ron W
So, you are asking for an option to enable an automatic, post-commit
Yes. And it needs to compare the contents, not just compare hashes as a shortcut.

For a simple implementation, it can just unconditionally compare that whole list of committed files. But more efficiently, it can just compare the ones for which a hash match was found during the commit; ones for which no match was found don't need to be compared, since Fossil wrote new data to the repository rather than deduping against existing data.

That means the amount of data written to the repository, plus the amount read from the repository, is always equal to the amount of data in the commit, regardless of how much of that commit data is already in the repository. In contrast, the simple implementation would unnecessarily double the total amount of data transferred, if it was all-new data (none of it already in the repository).

BTW, for a separate issue, you might want to use the simple implementation anyway, if you're concerned that your storage device might not have written the data properly. In this case, you have to commit, then purge the files from your RAM cache, and purge the data from the storage device's cache (and hope it's honest about this), then read the data back to do the comparison.

And if the storage device might lie about the cache flushing (in order to fraudulently win benchmarks), then you would have to power cycle it before reading back the data, and you'd have to short its Vcc to ground to discharge its overly-large bypass capacitors so it couldn't avoid your power-cycling, and you'd have to check to make sure it had no diodes or other isolation features to avoid your shorting, and... ;-)

This isn't just paranoia. Some devices do lie, as Jeff Bonwick (ZFS author) found.

Well ok, the part about resistance to power-cycling is just paranoia.

But the efficient comparison method is ok if you trust your storage device.
Ron W
2015-01-14 17:33:26 UTC
Permalink
Post by Kelly Dean
Post by Ron W
So, you are asking for an option to enable an automatic, post-commit
Yes. And it needs to compare the contents, not just compare hashes as a shortcut.
"fossil diff" does content compare.

Right now, what you want can be accomplished via the following shell script:

#!/bin/bash
files=`fossil changes | egrep 'EDITED|ADDED'`
fossil commit $files
fossil diff $files

Obviously, this is not as efficient as being built-in to Fossil, but does
serve as an example for your requested functionality.

Note that the fossil and egrep commands are enclosed in ` (back tick) while
the pattern for egrep is enclosed in ' (single quote).

For more usage info on "fossil diff", type "fossil help diff" (there is
also "fossil gdiff" if you prefer to use a GUI diff tool).
Andy Bradford
2015-01-15 20:37:24 UTC
Permalink
Yes. And it needs to compare the contents, not just compare hashes as
a shortcut.
Given Fossil's distributed design, I don't think it is always possible
to compare contents, at least not on the client. For example:

A clones from S
B clones from S
B begins a commit, autosync pull kicks in and pulls all new content from S
current merge conflict happens
new compare content happens
all looks good, allow commit

However, just after B's autosync pull completed:

A commits a nefarious collision with autosync disabled
A immediately pushes to S

Now S has the bogus data that just happens to hash to the same content
that B intended to commit. But B has already committed, and is now
moving on to the push:

B autosync push to S begins
S receives content and discards because it already has hash
S might fork if the commits were made against the same node

It seems to me that checking for the collision in the client on B won't
really be reliable because there is always a race condition between
pushers. But, that would mean the server S would have to check and what
would it do if it detects a collision after the client has already
successfully synchronized 99% of the checkin which is now committed to
the database on S because it detected a single collision in one of the
artifacts? Should it store the data in a special collisions table and
send back a message to the client notifying it that there is a
collision? Or just discard the data and send back a message to the
client?

Or would the sync protocol have to be modified so that the client, on
push, receives a copy of any existing artifacts that it is intending to
push so that it can compare the contents prior to pushing?

Andy
--
TAI64 timestamp: 4000000054b82526
Andy Bradford
2015-01-15 20:46:30 UTC
Permalink
Or would the sync protocol have to be modified so that the client, on
push, receives a copy of any existing artifacts that it is intending
to push so that it can compare the contents prior to pushing?
Not this. This is already what autosync does, so we're just starting
down the long winding road of infinite regress.

Andy
--
TAI64 timestamp: 4000000054b82749
Ron W
2015-01-15 21:00:43 UTC
Permalink
Post by Andy Bradford
Or would the sync protocol have to be modified so that the client, on
push, receives a copy of any existing artifacts that it is intending
to push so that it can compare the contents prior to pushing?
Not this. This is already what autosync does, so we're just starting
down the long winding road of infinite regress.
Actually, the client would receive the any siblings of the new artifacts it
is intending to push. Of course, if any of the hashes of the siblings
matched any of the hashes of the new artifacts, the client could detect the
collision.

However, there would still be a window for a malicious push between the
auto-sync pull and the auto-sync push. (Perhaps this window could be
narrowed by an option to send igot cards for all the new artifacts, then if
the client doesn't receive gimme cards for all the new artifacts, it would
know there was a collision.)

Then, after the auto-sync push, the client could do another pull,
requesting the artifacts it just pushed, then comparing them.

Kelly Dean
2015-01-13 21:08:58 UTC
Permalink
Post by Joerg Sonnenberger
Post by Kelly Dean
If you have files foo and bar with different content but the same hash
(i.e. there's a hash collision), and you commit foo, then bar, and
dedup by hash (which Fossil does, just like Git), then do a checkout
and try to verify by checking the hashes, you'll be fooled into
thinking nothing is wrong, even though you've accidentally recorded the
content of foo for bar.
What is the thread model again? That I pull from random malocious
people? I don't do that. I would strongly recomment against doing that.
Th threat model is that you're a forensic investigator, your job is to deal with digital evidence from white-collar criminal suspects, and you have to record that evidence securely and prove its integrity in court. If it's possible for the attack I described to happen (even if it hasn't happened), then there's doubt about the integrity of the evidence, so the evidence gets thrown out, and you're responsible for letting suspects go free on a technicality.

But if the system you use for secure storage compares by content, not by hash, when you commit new evidence, then the attack can't happen, even if the hash function is broken.

You could say, well, don't use Fossil for storing evidence. But why not? An option like ZFS's dedup=verify would solve the problem. Without it, Fossil unnecessarily relies on the hash function being secure.
Joerg Sonnenberger
2015-01-13 21:35:29 UTC
Permalink
Post by Kelly Dean
Post by Joerg Sonnenberger
Post by Kelly Dean
If you have files foo and bar with different content but the same hash
(i.e. there's a hash collision), and you commit foo, then bar, and
dedup by hash (which Fossil does, just like Git), then do a checkout
and try to verify by checking the hashes, you'll be fooled into
thinking nothing is wrong, even though you've accidentally recorded the
content of foo for bar.
What is the thread model again? That I pull from random malocious
people? I don't do that. I would strongly recomment against doing that.
Th threat model is that you're a forensic investigator, your job is to
deal with digital evidence from white-collar criminal suspects, and you
have to record that evidence securely and prove its integrity in court.
If it's possible for the attack I described to happen (even if it
hasn't happened), then there's doubt about the integrity of the
evidence, so the evidence gets thrown out, and you're responsible for
letting suspects go free on a technicality.
Bullshit. Cryptographic hash functions are designed to be used as a
replacement for identity checks. Either the hash is broken AND you have
allowed untrusted third parties to write to your repository or not.
In your scenario, you have already lost at the point where you allow
untrusted third parties near the the repository, so the strength of the
hash function is mood.
Post by Kelly Dean
But if the system you use for secure storage compares by content, not
by hash, when you commit new evidence, then the attack can't happen,
even if the hash function is broken.
Have you even tried to read what I said? Comparing by content is
generally impossible because you do not have two copies of the content
at hand. Noone wants to transfer every blob all the time to ensure that
the other side really has the content it is talking about. That makes no
sense. Dignal signatures, even those considered legally binding, are
nothing more than a cryptographic hash over the data, signed using some
additional magic. Noone applies RSA or ECDSA to documents larger than a
couple of bytes. It doesn't make things more secure.

The paper you referenced is misguided FUD. It shows a fundamental
misunderstanding of the components involved.

Joerg
Kelly Dean
2015-01-13 22:46:27 UTC
Permalink
Post by Joerg Sonnenberger
Either the hash is broken AND you have
allowed untrusted third parties to write to your repository or not.
In your scenario, you have already lost at the point where you allow
untrusted third parties near the the repository, so the strength of the
hash function is mood.
I'm not sure I follow your argument here. By ‟the point where you allow untrusted third parties near the repository”, do you mean the point where you commit suspects' (untrusted, obviously) evidence?

Even if the hash is broken and you commit evidence, you still have not lost, _if_ you compare by content. If you only compare by hash, then you have lost.
Post by Joerg Sonnenberger
Comparing by content is
generally impossible because you do not have two copies of the content
at hand.
When you're committing new files to a local repository, obviously you do have both the current data (in the repository) and the new files that you need to ensure don't collide with the current data. In this case, comparing by content to guarantee there's no collision is practical.
Post by Joerg Sonnenberger
Noone wants to transfer every blob all the time to ensure that
the other side really has the content it is talking about.
I already wrote in this thread, ⌜Of course, over the network, transmitting the content of the entire repository during every sync isn't feasible, so all you can do is extend UUIDs with a new secure hash function when the current function is broken.⌝
Post by Joerg Sonnenberger
Dignal signatures, even those considered legally binding, are
nothing more than a cryptographic hash over the data, signed using some
additional magic.
I realize this. It's why when the hash is broken, signatures using it will no longer be secure. But I'm not talking about using compare-by-content when the hash is known to be secure. I'm talking about using compare-by-content in cases where the hash is broken, or too close to being broken to warrant trust. Sha1, which Fossil uses, is in that category for many people.
Post by Joerg Sonnenberger
The paper you referenced is misguided FUD. It shows a fundamental
misunderstanding of the components involved.
I referenced a person, not a paper. Richard Hipp referenced her old 2003 paper, which she has since corrected (in 2009, IIRC) with a new paper, acknowledging and fixing the problems that everybody pointed out in her original paper.
Joerg Sonnenberger
2015-01-14 10:17:37 UTC
Permalink
Post by Kelly Dean
When you're committing new files to a local repository, obviously you
do have both the current data (in the repository) and the new files
that you need to ensure don't collide with the current data. In this
case, comparing by content to guarantee there's no collision is practical.
If you don't pull from malicious users, it doesn't happen. If you do,
you already have bigger problems. What you want just makes things slower
for no good reason.

Joerg
Kelly Dean
2015-01-14 23:57:57 UTC
Permalink
Post by Joerg Sonnenberger
If you don't pull from malicious users, it doesn't happen. If you do,
you already have bigger problems. What you want just makes things slower
for no good reason.
You're right, if you don't commit malicious people's data, then there's no problem. But in that case, the use of sha1 (or any other cryptographic hash) makes things _far_ slower for no good reason. The performance penalty of sha1 dwarfs the penalty of compare-by-content, so it doesn't make sense to object to the latter when the former is the real performance problem.

Instead of sha1, use something like a 160-bit version of xxhash, which is 10-20 times faster than a secure hash, and has no more risk of collision than does the latter, assuming you don't commit malicious people's data.

Fossil's use of sha1 gives you the worst of both cases:
An extremely slow hash, far slower than necessary, if you trust the data.
A hash that's too close to being broken to warrant trust (in the judgement of cryptographers and of people who rely on them for guidance), if you don't trust the data.

I understand that Fossil, like Git, standardized on sha1 when it was still secure. But that was a decade ago (give or take). If the choice of secure hash function is supposed to be part of Fossil's enduring format (rather than be a parameter of that format), then the intended endurance is quite likely impossible, and fails at least in the case of sha1. But if only a _fast_ hash is supposed to be part of the enduring format, then endurance is possible. If Fossil is only intended for storing data you trust, then you might as well use a fast hash.
Andy Bradford
2015-01-15 00:40:25 UTC
Permalink
Post by Kelly Dean
Instead of sha1, use something like a 160-bit version of xxhash, which
is 10-20 times faster than a secure hash, and has no more risk of
collision than does the latter, assuming you don't commit malicious
people's data.
The use of SHA1 is detailed here:

http://www.fossil-scm.org/index.html/doc/trunk/www/concepts.wiki

A particular version of a particular file is called an
"artifact". Each artifact has a universally unique name which
is the SHA1 hash of the content of that file expressed as 40
characters of lower-case hexadecimal. Such a hash is referred
to as the Artifact Identifier or Artifact ID for the artifact.
The SHA1 algorithm is created with the purpose of providing a
highly forgery-resistant identifier for a file. Given any file
it is simple to find the artifact ID for that file. But given a
artifact ID it is computationally intractable to generate a
file that will have that Artifact ID.

So, highly forgery-resitant identifier for a file is one of goals and
reasons for using SHA1.

There are other uses listed here:

http://www.fossil-scm.org/index.html/doc/trunk/www/tech_overview.wiki

Other references include:

Each artifact is named by its SHA1 hash and is thus immutable.

Each artifact in the repository is named by its SHA1 hash. No
prefixes or meta information is added to an artifact before its
hash is computed. The name of an artifact in the repository is
exactly the same SHA1 hash that is computed by sha1sum on the
file as it exists in your source tree.

It is theoretically possible for two artifacts with different
content to share the same hash. But finding two such artifacts
is so incredibly difficult and unlikely that we consider it to
be an impossibility.
Post by Kelly Dean
...
A hash that's too close to being broken to warrant trust (in the
judgement of cryptographers and of people who rely on them for
guidance), if you don't trust the data.
Is the following relevant:

What are the odds that in any group of developers, that the one who is
malicious will be able to generate content that matches the hash of
unwritten content and is able to commit it to the repository before the
second can commit the as yet undiscovered legitimate content, thus
rendering it impossible to add the legitimate content due to the
immutability of the SHA1 identifier for said content? And not only that,
but that there exists only one arrangement that can ever make up the
legitimate content (e.g. it is impossible to add another space to the
content to make it have a different SHA1 hash) and it has the same hash
as the now immutable artifact?

Security is often about risk evaluation.

What are the risks of a collision? Perhaps one developer, after spending
hours on a solution, will have made a commit without realizing that the
SHA1 for the content he is committing already exists and that Fossil
will therefore not really write his changes to the repository and they
will disappear?

Without looking at the source, I'm not even sure what Fossil would
actually do in this case (hard to verify without any known SHA1
collision examples to work with, but can be done by analyzing the code).
Would the original file would exist in the work area of the Fossil
repository where it was committed? Presumably the developer would
discover this in communication with others who expected to see the
content and still be able to recover the content.

Interesting discussion, thanks.

Andy
--
TAI64 timestamp: 4000000054b70c9b
Richard Hipp
2015-01-15 00:50:20 UTC
Permalink
Post by Andy Bradford
What are the risks of a collision? Perhaps one developer, after spending
hours on a solution, will have made a commit without realizing that the
SHA1 for the content he is committing already exists and that Fossil
will therefore not really write his changes to the repository and they
will disappear?
I believe that is exactly what happens. The first artifact with a
particular hash to get into the repo wins, and all other artifacts
with the same hash are ignored.

Of course, the probability of hitting a collision accidentally is too
small to comprehend. For all practical purposes, a hash collision is
impossible and the scenario you describe will never happen. So you do
not need to fear losing work. You are much more likely to be struck
by a giant meteor in the millisecond before you press the Enter key
than you are to encounter a hash collision.
--
D. Richard Hipp
***@sqlite.org
Ron Aaron
2015-01-15 10:18:25 UTC
Permalink
_______________________________________________
fossil-users mailing list
fossil-***@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Kelly Dean
2015-01-15 10:59:16 UTC
Permalink
Post by Andy Bradford
Post by Kelly Dean
Instead of sha1, use something like a 160-bit version of xxhash, which
is 10-20 times faster than a secure hash, and has no more risk of
collision than does the latter, assuming you don't commit malicious
people's data.
[snip]
Post by Andy Bradford
So, highly forgery-resitant identifier for a file is one of goals and
reasons for using SHA1.
Yes, I understand that. I was responding to Joerg's comment:
‟If you don't pull from malicious users, it doesn't happen. If you do, you already have bigger problems.”

My point remains: Fossil unnecessarily uses an extremely slow hash if you're expected to trust the data, and uses one that's too close to being broken to warrant trust if you don't trust the data. Joerg seemed to be suggesting not to use Fossil if you don't trust the data, but you're suggesting (and I agree) Fossil should be usable even if you don't trust the data.

If Fossil is to be usable for untrusted data, and therefore must use a secure hash function, it makes sense to use that function even for trusted data, for the sake of simplicity. If Fossil is only to be usable for trusted data, then a cryptographic hash is wasteful.
Post by Andy Bradford
What are the odds that in any group of developers, that the one who is
malicious will be able to generate content that matches the hash of
unwritten content and is able to commit it to the repository before the
second can commit the as yet undiscovered legitimate content
The scenario I gave is you're committing evidence that you already have (not ‟undiscovered/unwritten” content). And the risk isn't that some particular person has the capability; the risk is that the capability being within the state of the art, and you failing to defend against it, means you let criminal suspects go free on a technicality. To defend against it, you have to both compare by content, and secure the repository against tampering (because just securely recording the hashes doesn't suffice in this case). If anybody does have the capability, and you fail to defend against it, then he can tamper with it (if you don't secure the repository) or cause you to unwittingly fail to record it (if you don't compare by content).

And I'm talking about the technical issue, not about certifications.
Post by Andy Bradford
And not only that,
but that there exists only one arrangement that can ever make up the
legitimate content (e.g. it is impossible to add another space to the
content to make it have a different SHA1 hash) and it has the same hash
as the now immutable artifact?
Irrelevant. In order to know to change the content to avoid the collision, you have to know that there was a collision in the first place. Being able or unable to make that change doesn't give you that knowledge.
Post by Andy Bradford
Security is often about risk evaluation.
Of course. When Git and Fossil were created, sha1 was secure enough for cryptographers to recommend it. It no longer is. If Fossil's enduring format specifies a particular hash function for security, then that endurance is liable to fail, and already fails if that function is sha1. By failing to make the hash function a parameter, Fossil imposes a decade-old risk evaluation on its users, rather than enabling them to do their own evaluation and switch to sha2 or sha3 or some other hash that's still secure enough for cryptographers to recommend it.

The only options Fossil gives in this case are to not use Fossil, or use a custom, incompatible derivative of Fossil, or do compare-by-content. The last option is most practical for now.

Of course, making the hash function a parameter makes things more complicated, because your UUIDs have to change (or at least be extended), and in general all historical uses of them (e.g. in manifests) have to be securely timestamped (using a current still-secure hash) to enable detection of forgery of historical records. People who already had (from before the hash function was broken) secure repositories of those records don't need those secure timestamps, but everybody else does.

In would be nice to have a hash function that will be secure forever. Or at least for a few decades. It would make things as simple as Fossil pretends they are. But our history casts doubt that we have such a function, and even if we do have one, sha1 isn't it.
Post by Andy Bradford
What are the risks of a collision?
I'm talking about attacks, not colliding with a benign file.
Post by Andy Bradford
the probability of hitting a collision accidentally is too
small to comprehend. For all practical purposes, a hash collision is
impossible and the scenario you describe will never happen. So you do
not need to fear losing work.
Again, I'm talking about attacks, not colliding with a benign file, as I explained when I did the unnecessary homework problem.
Ron W
2015-01-15 17:59:03 UTC
Permalink
Post by Kelly Dean
My point remains: Fossil unnecessarily uses an extremely slow hash if
you're expected to trust the data, and uses one that's too close to being
broken to warrant trust if you don't trust the data.
Even on 10 year old hardware, Fossil's performance is tolerable. On 5 year
hardware, it is acceptable. (Of course, if you, for whatever reason, switch
from very recent hardware to, even, 5 year old hardware, it's going to seem
very slow, but that is the case for many applications.)
Post by Kelly Dean
Joerg seemed to be suggesting not to use Fossil if you don't trust the
data, but you're suggesting (and I agree) Fossil should be usable even if
you don't trust the data.
I don't know what "threat model" Fossil might have been designed to resist.
"very forgery resistant" doesn't tell me much.

I think Joerg's (and others') point about untrusted was more about the
trustworthiness of contributors with push privilege to a project's
repository. No matter how secure the has function is, the trustworthiness
of contributors has to be adequately evaluated before granting them push
privilege.

As for resistance against attackers, there is a balance dependent upon the
value of a repository's content. If that is forensic evidence from a
criminal investigation, then it might be highly valuable. Likewise,
classified documents.

I don't think that kind of content was in mind when Fossil was designed.

[...]
Post by Kelly Dean
The only options Fossil gives in this case are to not use Fossil, or use a
custom, incompatible derivative of Fossil, or do compare-by-content. The
last option is most practical for now.
It is possible enhance Fossil to do an automatic, post-commit content
compare between the local repository and the local file system. I have also
shown a very basic "wrapper" for Fossil that achieves this (in one of my
previous posts).

Comparing against a remote server is more involved: Request a tar (or zip)
file of the commit just made, unpackage it, then diff it against the
originating workspace. Alternately, individual files can be requested, the
compared. This, too, could be automated by a wrapper around Fossil. An
enhanced Fossil could, of course, do this more efficiently.
Post by Kelly Dean
Of course, making the hash function a parameter makes things more
complicated, because your UUIDs have to change (or at least be extended),
and in general all historical uses of them (e.g. in manifests) have to be
securely timestamped (using a current still-secure hash) to enable
detection of forgery of historical records.
The least impractical way to change existing hashes would be to dump the
repository, then replay the commits into a new repository using the old
hashes as tags on the new commits,

Applying the new hash as a tag to the existing commits would be less work,
but the details would have to carefully evaluated and planned. But...

Switching hash algorithms within an existing algorithm would introduce a
lot of extra logic to distinguish and process "old" and "new" commits.

From a practical standpoint, existing repositories would probably continue
to use the old hash.

In theory, supporting new hash algorithms could be handled by incrementing
the "clone protocol" version to allow support of a new attribute that
specifies the hash algorithm used. New "clients" attempting to talk to old
servers could fall back to an earlier protocol version and select SHA1 to
use with that repository.

Ultimately, for any given tool, there is a risk vs cost analysis. For the
projects I work on, Fossil meets my needs. If I were going to store
documents or data that could incur a significant liability, I would use a
tool from a respected vendor so I could demonstrate due diligence in a
legal proceeding. (Note: For all I know, Fossil could be better than the
best tool from the most respected vendor, but it's far easier for opposing
council to shoot down a "free" tool than one purchased from a respected
vendor.)
Ron W
2015-01-13 22:03:30 UTC
Permalink
Post by Kelly Dean
You could say, well, don't use Fossil for storing evidence. But why not?
An option like ZFS's dedup=verify would solve the problem. Without it,
Fossil unnecessarily relies on the hash function being secure.
Even if a post-commit "fossil diff $@" is done (on the "local" Fossil
repo), updates to the "remote" Fossil repo would still only have a hash for
integrity checking as it it not practical to send copies of the original
files (or even "reconstituted" copies).

Also, FWIW, Fossil is not FIPS certified for storing forensic (or other
classified) data. Nor would it be practical to get Fossil FIPS certified.
We would need a large donation to pay the cost of the audit. Also, I
suspect Fossil's source code is not compliant with FIPS coding standards,
so might have to be rewritten. (And not just Fossil, but SQLite, as well.)
Baruch Burstein
2015-01-14 09:03:07 UTC
Permalink
Post by Kelly Dean
You could say, well, don't use Fossil for storing evidence. But why not?
Why not? As Richard recently stated in the floss-weekly interview (and
elsewhere): Fossil was meant for controlling sqlite source. If anyone finds
it useful for any other use that is just a bonus. So while many people find
it useful for managing source code of other projects, the likelihood of it
being the right tool for the job for something like storing evidence is
small.
--
˙uʍop-ǝpısdn sı ɹoʇıuoɯ ɹnoʎ 'sıɥʇ pɐǝɹ uɐɔ noʎ ɟı
Loading...