Wikidata:Property proposal/Imagehash perceptual hash

From Wikidata
Jump to navigation Jump to search

Imagehash perceptual hash[edit]

Originally proposed at Wikidata:Property proposal/Commons

   Done: pHash checksum (P9310) (Talk and documentation)
DescriptionImagehash perceptual hash is perceptual hash which tells whether two images look nearly identical.
RepresentsImageHash perceptual hash (Q104884110)
Data typeString
Domainmediainfo (Commons only)
Allowed values[a-z\d]{16}
Example 1M68454019 → 878ed95a53a065e5
Example 2M68456558 → 8589da64f0b9599e
Example 3M68455617 → 86c2b43c5f956e32
Example 4M68456184 → c0d09f3524e5ef19
Source
Planned useFirst I would populate hash values for photos uploaded by user:FinnaUploadBot, but generally hash could be added to all of the Commons files
Number of IDs in sourcecurrently there is 68M files in Commons and checksum can be calculated to all photos
Expected completenesseventually complete (Q21873974)
Robot and gadget jobschecksum should be generated by bot
See also
  • checksum (P4092): small-sized datum derived from a block of digital data for the purpose of detecting errors. Use qualifier "determination method" (P459) to indicate how it's calculated, e.g. MD5.
  • Wikidata:Property proposal/fingerprint
  • Motivation[edit]

    I am using the pHash checksums for detecting duplicate photos in Commons. I am also using pHases to confirm if the photo in Commons and Finna repository are same. However, it would be useful if hashes could be share so they could be queried by any user. Pre-generated perceeptual hashes of files could be also fetched from SDC as a lists without a need to download actual files. . Zache (talk) 20:15, 22 January 2021 (UTC)[reply]

    Also note there are different implementations of phash-algorithm which are generating different hashes (example: https://phash.org or https://github.com/KilianB/JImageHash). I renamed the proposal so that proposal refers to Imagehash version. --Zache (talk) 04:22, 23 January 2021 (UTC)[reply]

    Discussion[edit]

    It is likely that there is multiple files in commons with same hash-value as it tries to match to similarity of the content. After that it is up to commons community to decide what to do the duplicate images but afaik it is possible that there is legit reasons to have multiple versions of same image. --Zache (talk) 16:32, 5 February 2021 (UTC)[reply]
    Interesting. Let's stay with string datatype then. I'm curious how many there will be. In any case, using P4092 (as suggested above) wouldn't simplify it, as one would need to check the determination method in addition. --- Jura 07:13, 6 February 2021 (UTC)[reply]