Joseph Malkevitch: Hashing Tidbit

Hashing (1/02/2002)

Prepared by:

Joseph Malkevitch
Mathematics and Computing Department
York College (CUNY)
Jamaica, NY 11451

Email: malkevitch@york.cuny.edu (for additions, suggestions, and corrections)

The idea behind check digit systems is related to a fundamental idea in computer science, hashing. The idea in hashing is to take input information, say a string, and based on this information create an "identification string" for the original in such a way that if possible two input strings which are different have two different "identification strings." Ideally, the identification string should be "smaller" than original.

Example:

Suppose one has the group of students whose names are:

Alexander Goldovsky

Joseph Roe

Roger Rabbit

Robert Edge

Alan England

The number of characters in these names respectively, 18, 9, 11, 10, 12.

If we need to store names, we can do it using the alphabetic characters or under the code which gives the number of letters in the name. The names requires 60 storage cells but the lengths of the names can be stored in 9 cells. The problem is that as the list of names grows, we will have "collisions" in that people with different names will have the same lengths.

A slightly better system would be to represent names by pairs of digits, which are the length of their first and last names; here the representation would be:

(9,9) (6,3) (6,6) (6,4) (4,7)

This system will distinguish between John Doe and Don Hurt where the first one did not. However, it pays the price that it uses more symbols than the original.

Another hashing idea would be to store the name under the letter pair corresponding to the first and last letters of the persons name.

Thus, Roger Rabbit becomes RR and Roger Edge becomes RE. A collision would occur if the name Rina Rhino occurred since she has the same initials as Roger Rabbit.

In a hashing system, as opposed to code number systems, when a collision occurs there is a "costly" way to deal with the situation but since collisions are rare, the system is "cheaper" to implement even with a few costly collisions than it would be to not use the hashing approach.

In a check digit system, many different strings will have the same check digit but we know that when two strings have different check digits that the strings can not be the same. However, no attempt is made to avoid collisions.

References:

Cormen, T. and C. Leiserson, R. Rivest, The Design and Analysis of Computer Algorithms, 2nd edition, MIT Press, Cambridge, 2000.

Gonnet, G., Expected length of the longest probe sequence in hash code searching, J. ACM, 31 (1984) 289-304.

Knuth, D. and O. Amble, Ordered hash tables, The Computer Journal 17 (1974) 135-142.

Knuth, D., The Art of Computer Programming, Sorting and Seaching, Addison-Wesley, Reading, 2nd. ed., 1975.

Rosen, K., (ed.), Handbook of Discrete and Combinatorial Mathematis, CRC, Boca Raton, 2000.

Back to list of tidbits