Generating variations of checksum functions to minimize collisions
My goal is to efficiently perform an exhaustive diff of a directory tree.
Given a set of files F, I must compute the equivalence classes EQV = [F©û,
F©ü, F©ý... Fₙ] such that fⱼ, fₖ ¡ô EQV[i] iff
(fⱼ is identical to fₖ) for all i, j, k.
My general approach is to start with just one big class containing all
initial files, EQV_0=[[f1,f2,...f_n]], and repeatedly split it into more
refined equivalence classes EQV©û, EQV©ü... EQV_{m-1}, based on some
heuristics, for example, file size, checksum value. After all m heuristics
have been applied (EQV_{m-1}), a pairwise diff of all the files within
each class in EQV_{m-1} must be made. Because this last step is quadratic
for each of the classes in EQV_{m-1}, ie
O(sum(n©÷ for n in map(len, EQV_{m-1})) )
and will probably be the bottleneck of my algorithm if each the m splits
are done in linear time, my goal is to make EQV_{m-1} as flat as possible.
I would like to have access to a variety of good hash functions that I can
apply to minimize collisions on EQV_{m-1}. My current idea is to use some
library provided checksum function, such as adler, and to generate
variations of it by simply applying it to different starting bytes within
the file. Another one is to first apply fast hash functions, such as
adler, and then more expensive ones such as md5 on only the classes which
are still too large.
Considering that I can compute all the hashes for a given file in just one
read of that file, how could I compute a variety of hashes that will help
me discriminate among non-identical files?
Alternatively, what is a good list of hash functions available in python
that aren't cryptographically secure?
No comments:
Post a Comment