Finding duplicates.

The final of my second phase into GSoC is related to NLP for finding duplicate issues within a repository. To be honest, I’m just a learner and not much into these stuff. The algorithm I came up with is not even close to being perfect. But, it might be a good basis for a better GitMate in the near future.

Our requirements

  • Clustering Issues based on duplicates.
  • Uses issue title, description and labels.
  • Algorithm has to be scalable to thousands of issues. Comparing pair wise is not an option.

Similar existing tools

Tried approaches

1) Using word vectors

Construct word vectors over an n dimensional vector space and grouping things based on a threshold.

>>> def similarity(s1,s2):
... start = time()
... x, y = nlp(s1), nlp(s2)
... r = x.similarity(y)
... end = time()
... print(“time elapsed = “ + str(end-start))
... print(“similarity ratio: “ + str(r))

>>> similarity(‘Disable cache for tests’, ‘Use caching in production’)

time elapsed = 0.002780914306640625
similarity ratio: 0.490852554719

>>> similarity(‘Disable cache for tests’, ‘Use cache in production’)

time elapsed = 0.0010590553283691406
similarity ratio: 0.724472864444

Drawbacks
As you can see above, a simple change of cache to caching brings about a lot of change in the outcome. And, for instance, some issues might be using a body template, in such cases the title won’t be much of an affecting factor.

Calculating the similarity for every pair is not feasible performance wise. Other things possible?

2) TF-IDF Vectorising

References

Construct a TF-IDF (Term Frequency — Inverse Document Frequencty) matrix from raw text summary and learn vocabulary from the training set with a **fit** followed by a **transform** on a `TfidfVectorizer`.

threshold = 0.8 # normalized after several observations
vect = TfidfVectorizer(min_df=0, ngram_range=(1,5))
tfidf = vect.fit_transform(issue_texts)
matrix = tfidf_similarity(tfidf)
# Map issue_numbers with respective scores
results = [zip(issue_numbers, scores) for scores in matrix]
# Filter
results = [dict(filter(lambda r: r[1] >= threshold, row))
for row in results]
# Map (key=issue_number, value=result)
results = dict(zip(issue_numbers, results))

| Threshold | Time (in sec) | Issue Count | Duplicates |
| — — — — — | — — — — — — — — — | — — — — — — | — — — — — |
| 0.80 | 5.794461250305176 | 2323 | 44 |
| 0.85 | 5.536595821380615 | 2323 | 44 |
| 0.90 | 6.009936809539795 | 2323 | 6 |

Drawbacks

This approach while being a rock solid base of issue identification, has a critical drawback in some repositories, like [coala/coala](https://github.com/coala/coala) where issues follow a certain body template. The false marking of issue similarity based on issue description leads to unneccessary outcomes. More likely, a higher threshold ratio might spoil the purpose.

3) Training a model

Using an ML based algorithm. e.g. https://github.com/Glavin001/IssueBot This would be the most ideal solution for the problem. But is a long way to go and isn’t acheivable in a shorter timeframe. In the long run, a good ML algorithm would be used for identifying duplicate issues.