Discussion:
Hashing
Travis Griggs
2012-01-28 09:59:43 UTC
Permalink
All of this talk about hashing, again. :) A while ago a posted what I
thought was a nice solution to this whole problem. I got no response on
it, so I was never sure whether it was deemed a good thing, a weird
thing (green plane idea), or a bad thing.

The idea was to use the same pattern that SortedCollection uses by
giving Set a two block instance variables, one a two argument block to
compare two objects by (equivalenceBlock) and the other a single
argument block (hashBlock) used to compute the hash value of an object.
I never toyed with using symbol to send via perform: and perform:with:,
though that might be faster?

I'm tired of seeing the classical "I forgot to implement hash when I
implemented equality" bug. The Smalltalk library for some reason or
another has interned this notion of hash into Object itself. I don't
understand why that is. I can think of plenty of applications where
hashing an object would never be necessary. IOW, IMO this is not an
essential Object behavior. What really bugs me about the current state
of affairs is the violation of encapsulation. Hashing is a function used
by Sets (and their various subclasses). It has nothing to do with what
the object does itself. How to hash an Object should be bound to a Set.
And not necessarily a subclass type of Set, but the instance itself. If
we used blocks, then we wouldn't have to have IdentitySet and Set (along
with the various subclasses of each). You would just create a Set
initialized for either equality or identity. And you could do other
things too. When I did the implementation, I ran a number of tests to
see what the performance hit was. For a set of random integers, the
speed hit was between 10-20%. For a random set of floats between 0.0 and
1.0 though, I could get as much as 40% speed increase by using a hash
function that multiplied by 1000 and truncated.

Over the last couple of months there have been discussions on how to
hash Floats, Points, and now Collections. Every time we argue the same
sorts of things. Somebody comes up with a nifty way of hashing for their
domain usage of said type of object and proposes that we impelement hash
as such. Someone responds with a domain where proposed hash is not very
good or just plain falls apart. Why does every object type have to have
just one way of hashing. Which object type will it be next?

I have no problem with implementing a hash method in an object for
convenience's sake, but I think we should generalize the Set's
themselves a bit further and put this thing to rest once and for all.
Then you could document the invariant between hash and equality in the
Set's documentation, which is where it belongs since that's who it
affects.

Another thought along this line is that quite often, one's Sets are
typed. In other words I create a Set of Integers, or Numbers, or
Collections, or even Objects. One could add class side accessors that
would return various hash/equality algorithms for that kind of object.

Thoughts?

--
Travis Griggs
Key Technology
***@keyww.com
To Smalltalk! - and Beyond!
Jecel Mattos de Assumpcao Jr.
2012-01-28 10:13:23 UTC
Permalink
If saving memory isn't a priority, there is always the Self way of
hashing. Dedicate some 12 to 16 bits in the header of each object
and fill it in from some counter or random number generator when
the object is created and never change it after that.

Of course, given Squeak's compact headers I don't think this is
the way to go here.

-- Jecel
Dan Ingalls
2012-01-28 10:15:16 UTC
Permalink
Jecel -
Post by Jecel Mattos de Assumpcao Jr.
If saving memory isn't a priority, there is always the Self way of
hashing. Dedicate some 12 to 16 bits in the header of each object
and fill it in from some counter or random number generator when
the object is created and never change it after that.
Of course, given Squeak's compact headers I don't think this is
the way to go here.
FYI, this is EXACTLY what Squeak does. Every object has 12 bits of hash, and we manage to squeeze them into the 1-word headers along with everything else!

- Dan
Florin Mateoc
2012-01-28 10:17:54 UTC
Permalink
Post by Travis Griggs
The idea was to use the same pattern that
<underline><color><param>ffff,0000,0000</param>SortedCollection</color></underline>
uses by
Post by Travis Griggs
giving Set a two block instance variables, one a two argument block to
compare two objects by (<underline><color><param>ffff,0000,0000</param>equivalenceBlock</color></underline>) and the other a single
argument block (<underline><color><param>ffff,0000,0000</param>hashBlock</color></underline>) used to compute the hash value of an object.
.......................................... Hashing is a function used
by Sets (and their various subclasses). It has nothing to do with what
the object does itself. How to hash an Object should be bound to a Set.
And not necessarily a subclass type of Set, but the instance itself. If
we used blocks, then we wouldn't have to have <underline><color><param>ffff,0000,0000</param>IdentitySet</color></underline> and Set (along
with the various subclasses of each). You would just create a Set
initialized for either equality or identity. And you could do other
things too.
For example I wanted in an application to redefine both #= for my persistent

objects (meaning their non-key variables were equal) and #== (meaning

their key variables were equal). I could not actually re-implement #==

(<underline><color><param>ffff,0000,0000</param>Enfin</color></underline> does not even let you try), and the reason why a new method #equals:

was not enough was precisely that I wanted to use the existent Set subclasses

as containers, without re-implementing them all as <underline><color><param>ffff,0000,0000</param>EqualsSet</color></underline>...
Post by Travis Griggs
I have no problem with implementing a hash method in an object for
convenience's sake, but I think we should generalize the Set's
themselves a bit further and put this thing to rest once and for all.
I think it's a great idea.


Cheers,


Florin


</x-rich>
Phil Weichert
2012-01-28 10:24:36 UTC
Permalink
Travis,
Sigh! I argee that continue to talk about hashing or lack thereof get
tiresome. In most cases, that I have sent all that one needs to do is use
the IdentityDictionary. If what ones really means by equals(=) is the same
contents, then subclass the desired type of collection and reimplement equal
to test the contents.
Phil
Dana Anthony
2012-01-28 10:26:53 UTC
Permalink
Post by Travis Griggs
All of this talk about hashing, again. :) A while ago a posted what I
thought was a nice solution to this whole problem. I got no response on
it, so I was never sure whether it was deemed a good thing, a weird
thing (green plane idea), or a bad thing.
I'm still so unclear on the plane color issue that I have no clue.
Post by Travis Griggs
The idea was to use the same pattern that SortedCollection uses by
giving Set a two block instance variables, one a two argument block to
compare two objects by (equivalenceBlock) and the other a single
argument block (hashBlock) used to compute the hash value of an object.
I never toyed with using symbol to send via perform: and perform:with:,
though that might be faster?
This is the first I have seen of this idea but your explanation makes
sense now that I have read it. It's true, the Set is like a
Sorted Collection, and these blocks are exactly parallel to the
sort block. The only reason I can see of not doing it this way
is that sets are used in Smalltalk implementation itself. But
for that reason they could easily have a distinct Set implementation
that was different with a common superclass. (the way they have
Symbol different from String basically, or CompiledMethod as a
kind of Collection...)

Like SortedCollection, the default should be doing it the way the
object does in its = and hash methods. So the default blocks
should be: [ :x :y | x = y ] and [ :x | x hash ].
But you could supply any block you wanted. Well, I'm all for it.
Post by Travis Griggs
--
Travis Griggs
Key Technology
--
Dana LG Anthony
Internet Publications Technology at SAS Institute
***@unx.sas.com
Tim Olson
2012-01-28 10:28:10 UTC
Permalink
Post by Travis Griggs
All of this talk about hashing, again. :) A while ago a posted what I
thought was a nice solution to this whole problem. I got no response on
it, so I was never sure whether it was deemed a good thing, a weird
thing (green plane idea), or a bad thing.
Hi, Travis: I remember your original post -- I saved it away and went
back to it when the hashing stuff came up again. I agree with it; I
don't know why the hashing stuff (an implementation detail of Set keys)
has become so ingrained in the image such that every object must always
have a hash value. I like the idea of computing and attaching a hash
value to an object only when it is required, and like your idea of
pushing the hash computation out to the application itself.



-- tim
jecel at lsi.usp.br ()
2012-01-28 10:32:28 UTC
Permalink
Post by Dan Ingalls
FYI, this is EXACTLY what Squeak does. Every object has
12 bits of hash, and we manage to squeeze
them into the 1-word headers along with everything else!
Right, I should have checked. The 12 bits are taken from
a 16 bit pseudo random number generator. Self uses 22 bits
for the same thing.

So sets already have a good hash. The problem is that it
doesn't match the #= method. I guess I wasn't paying
enough attention to what the real issue was.

-- Jecel
Florin Mateoc
2012-01-28 10:48:41 UTC
Permalink
<x-rich>I think the idea can be generalized:


Testing for inclusion is part of the basic collection semantics that I

think was not completely exploited (i.e. was not made general enough),

leading to complications in the Collection class hierarchy.


We could add an instance variable called comparator to the base
Collection

class. This would be an instance of a Comparator class.


Comparator is an abstract class defining two methods: #is:eq: and

#is:notEq:, but not providing any implementation.


Comparator has three system defined subclasses:


<underline><color><param>ffff,0000,0000</param>EqualComparator</color></underline>
providing the implementations:

is: <underline><color><param>ffff,0000,0000</param>leftObject</color></underline> eq: <underline><color><param>ffff,0000,0000</param>rightObject

</color></underline> <underline><color><param>ffff,0000,0000</param>^leftObject</color></underline> = <underline><color><param>ffff,0000,0000</param>rightObject

</color></underline> and

is: <underline><color><param>ffff,0000,0000</param>leftObject</color></underline> notEq: <underline><color><param>ffff,0000,0000</param>rightObject

</color></underline> <underline><color><param>ffff,0000,0000</param>^leftObject</color></underline> ~= <underline><color><param>ffff,0000,0000</param>rightObject

</color></underline>

<underline><color><param>ffff,0000,0000</param>IdentityComparator</color></underline>...

and

<underline><color><param>ffff,0000,0000</param>ArithmeticComparator</color></underline> which is an abstract class again, adding two

method definitions: #is:lte: and #is:lt:


Two concrete subclasses would be:

<underline><color><param>ffff,0000,0000</param>HashComparator</color></underline> with:

is: <underline><color><param>ffff,0000,0000</param>leftObject</color></underline> eq: <underline><color><param>ffff,0000,0000</param>rightObject

</color></underline> <underline><color><param>ffff,0000,0000</param>^leftObject</color></underline> hash == <underline><color><param>ffff,0000,0000</param>rightObject</color></underline> hash

...

<underline><color><param>ffff,0000,0000</param>BasicHashComparator

</color></underline> .<underline><color><param>ffff,0000,0000</param>.with</color></underline> the obvious implementations



While this adds a level of indirection, it also adds more flexibility.

The above mentioned Comparator classes would provide convenient defaults,

while letting the developer "<underline><color><param>ffff,0000,0000</param>parametrize</color></underline>" the Collection classes with her

own user-defined Comparator classes.

Also, by putting under the same roof the implementations of #is:lt: and

#is:eq: (which need to be consistent) one can efficiently test for inclusion

in sorted collections by taking advantage of the sorting, and while keeping

things general (You cannot do this with a single <underline><color><param>ffff,0000,0000</param>compareBlock</color></underline>).


Any thoughts ?



Florin


</x-rich>
Pat Caudill
2012-01-28 10:58:35 UTC
Permalink
Post by Jecel Mattos de Assumpcao Jr.
If saving memory isn't a priority, there is always the Self way of
hashing. Dedicate some 12 to 16 bits in the header of each object
and fill it in from some counter or random number generator when
the object is created and never change it after that.
Of course, given Squeak's compact headers I don't think this is
the way to go here.
You don't have to do it that way. You have only 2 bits for a hash. They
start off in state "no hash". As soon as you take the (larger hash) of
the object you put the hash in a look aside table and mark the object as
"hash taken". As soon as the grabage collector gets room (moves the
object, or deallocates the object behind it) the has is moved to the end
of the object and make the tags "hash attached".

Alas this wont keep the invariant that objects that compare equal hash
equal if the object hash depends on the contents and the object is
modified after the hash is first taken.


Pat Caudill
Maurice Rabb
2012-01-28 10:55:12 UTC
Permalink
David--

I am out of town until Wed. I have done a lot of work on hashing
(including 30bit) since I wrote you. I would like to share it with you
when I get back. How was the BOF at OOPSLA?

--Maurice

---------------------------------------------------------------------------
Maurice Rabb 773.281.6003 Stono Technologies, LLC Chicago, USA
Loading...