Class SimilarityIndex

  • public class SimilarityIndex
    extends Object
    Index structure of lines/blocks in one file.

    This structure can be used to compute an approximation of the similarity between two files. The index is used by SimilarityRenameDetector to compute scores between files.

    To save space in memory, this index uses a space efficient encoding which will not exceed 1 MiB per instance. The index starts out at a smaller size (closer to 2 KiB), but may grow as more distinct blocks within the scanned file are discovered.

    • Method Detail

      • score

        public int score​(SimilarityIndex dst,
                         int maxScore)
        Compute the similarity score between this index and another.

        A region of a file is defined as a line in a text file or a fixed-size block in a binary file. To prepare an index, each region in the file is hashed; the values and counts of hashes are retained in a sorted table. Define the similarity fraction F as the count of matching regions between the two files divided between the maximum count of regions in either file. The similarity score is F multiplied by the maxScore constant, yielding a range [0, maxScore]. It is defined as maxScore for the degenerate case of two empty files.

        The similarity score is symmetrical; i.e. a.score(b) == b.score(a).

        dst - the other index
        maxScore - the score representing a 100% match
        the similarity score