Posted by David Barts
- Results will come in faster (up to an hour faster on small crawls and literally days faster on larger crawls)
- More accurate duplicate removal, resulting in fewer duplicates in your crawl results
This post provides a look into the motivations behind our decision to change the way our custom crawl detects duplicate and near-duplicate web pages at a high level. Enjoy!
Improving our page similarity measurement
The problem: avoid false duplicates
- The two pages are not actually duplicates or near-duplicates,
- The current fingerprints heuristic correctly views them as different, but
- The simhash heuristic incorrectly views them as similar.
The solution: visualizing the data
- Sample about 10 million pairs of pages from about 25 crawls selected at random.
- For each pair of pages sampled, plot their difference as measured by the legacy fingerprints heuristic on the horizontal axis (0 to 128), and their difference as measured by simhash on the vertical axis (0 to 64).
Picking a threshold
The visible results
- We may still miss some near-duplicates. Like the current heuristic, only a subset of the near-duplicate pages is reported.
- Completely identical pages will still be reported. Two pages that are completely identical will have the same simhash value, and thus a difference of zero as measured by the simhash heuristic. So, all completely identical pages will still be reported.
Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!