Multimedia:File fingerprint

From Wikimedia Usability Initiative

MediaWiki currently uses a SHA-1 hash to characterize the content of an uploaded file; such a hash is supposed to be unique for each file, and it allows to identify duplicates. However, this feature only identifies exact duplicates; it can't identify similar or derivative files.

Some applications use identifiers like "fingerprints" or "signatures" based on image identification technology like Haar-like features to find and track similar pictures. For example, digiKam uses a "lengthy number using a special technique (Haar algorithm) that make it possible to compare images by comparing this calculated signature. The less numerical difference there is between any two image signatures, the more they resemble each other."[1]

A similar feature for MediaWiki (probably as an extension) would greatly benefit Wikimedia Commons by providing the ability to:

  • identify derivative works (e.g. an original image and a cropped version)
  • identify similar pictures (e.g. different pictures of the same object, specifying a threshold of similarity)
  • provide the basis for powerful search features, including "fuzzy search" (e.g. free hand color sketching)
  • facilitate the identification of objects and thus help with the classification of multimedia assets

digiKam is released under the GPL. It offers a working implementation of image fingerprints in C++/Qt[2] based on a research article by Jacobs et al.[3]. It also provides a fuzzy search feature[4].

Notes and references

  1. Using digiKam − Fuzzy Searches/Duplicates, digiKam documentation.
  2. Digikam::Haar Namespace Reference, KDE 4.5 API Reference.
  3. Charles E. Jacobs, Adam Finkelstein, David H. Salesin. Fast Multiresolution Image Querying. Proceedings of SIGGRAPH'95, in Computer Graphics Proceedings, Annual Conference Series, pages 277-286, August 1995.
  4. digiKam Fuzzy Search Tools Under Construction, digiKam blog, June 2008.