solonovamax
09/11/2023, 5:22 PM| Name | Distance | Similarity | Normalized | Metric | Memory cost | Execution cost | Typical usage |
|-------------------------------------------------|:--------:|:----------:|:----------:|:------:|----------------------|-------------------------------------|-----------------|
| Levenshtein | ☒ | ☐ | ☐ | ☒ | \\(O(m \\times n)\\) | \\(O(m \\times n)\\) <sup>1</sup> | |
| Damerau-Levenshtein<sup>3</sup> | ☒ | ☐ | ☐ | ☒ | \\(O(m \\times n)\\) | \\(O(m \\times n)\\) <sup>1</sup> | |
| Optimal String Alignment<sup>3</sup> | ☒ | ☐ | ☐ | ☒ | \\(O(m \\times n)\\) | \\(O(m \\times n)\\) <sup>1</sup> | |
| Longest Common Subsequence | ☒ | ☐ | ☐ | ☒ | \\(O(m \\times n)\\) | \\(O(m \\times n)\\) <sup>1,2</sup> | diff, git |
| Normalized Levenshtein | ☒ | ☒ | ☒ | ☒ | \\(O(m \\times n)\\) | \\(O(m \\times n)\\) <sup>1</sup> | |
| Normalized Damerau-Levenshtein<sup>3</sup> | ☒ | ☐ | ☒ | ☒ | \\(O(m \\times n)\\) | \\(O(m \\times n)\\) <sup>1</sup> | |
| Normalized Optimal String Alignment<sup>3</sup> | ☒ | ☐ | ☒ | ☒ | \\(O(m \\times n)\\) | \\(O(m \\times n)\\) <sup>1</sup> | |
| Normalized Longest Common Subsequence | ☒ | ☐ | ☒ | ☒ | \\(O(m \\times n)\\) | \\(O(m \\times n)\\) <sup>1,2</sup> | |
| Cosine similarity | ☒ | ☒ | ☒ | ☐ | \\(O(m + n)\\) | \\(O(m + n)\\) | |
| Jaccard index | ☒ | ☒ | ☒ | ☒ | \\(O(m + n)\\) | \\(O(m + n)\\) | |
| Jaro-Winkler | ☒ | ☒ | ☒ | ☐ | \\(O(m + n)\\) | \\(O(m \\times n)\\) | typo correction |
| N-Gram | ☒ | ☐ | ☒ | ☐ | | \\(O(m \\times n)\\) | |
| Q-Gram | ☒ | ☐ | ☐ | ☐ | | \\(O(m + n)\\) | |
| Ratcliff-Obershelp | ☒ | ☒ | ☒ | ☐ | \\(O(m + n)\\) | \\(O(n^3)\\) | |
| Sorensen-Dice coefficient | ☒ | ☒ | ☒ | ☐ | | \\(O(m + n)\\) | |
| Sift 4 | ☒ | ☐ | ☐ | ☐ | \\(O(m + n)\\) | \\(O(m + n)\\) | |
this has to be updated by-hand whenever a new algorithm is added.
I was wondering if there would be a good way that I could have a plugin that goes through all the classes that implement a specific interface/interfaces (in this case StringDistance
and StringSimilarity
and record them, then aggregate all the information in the index html file.
If yes, what would be the best places to hook into for this? (I'm assuming that things like the memory cost/execution cost can be added with a custom tag, eg. @memoryComplexity
and @timeComplexity
, for example:
/**
* Sift4 distance, as defined by
* Manda, Costin, [Siderite]. "Super Fast and Accurate String Distance Algorithm: Sift4." Siderite’s Blog,
* 10 Nov. 2014, <https://siderite.dev/blog/super-fast-and-accurate-string-distance.html>.
*
* Note: this algorithm is asymmetric. This means that
* \(distance(X, Y) \not\equiv distance(Y, X)\).
* This is one of the artifacts of the linear nature of the algorithm.
*
* @memoryComplexity O(m + n)
* @timeComplexity O(m + n)
* @author Thibault Debatty, solonovamax
*/
@ExperimentalStringMeasure
public class Sift4(
/**
* Set the maximum distance to search for character transposition.
* Compute cost of algorithm is `O(n * maxOffset)`
*/
private val maxOffset: Int = DEFAULT_MAX_OFFSET,
) : StringDistance {
glureau
09/12/2023, 7:22 AM