I recently needed to find all the broken links on a website. Using a little
Scrapy, I was able to crawl the whole site quickly.
You'll need to have Scrapy installed in order to run the following code. Tested on
Scrapy 1.1.2 and
Python 2.7.10. Newer versions should work as well.
from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.item import Item, Field class LinkItem(Item): url = Field() referer = Field() status = Field() class LinkSpider(CrawlSpider): name = "linkSpider" # Filter out other sites. No need dig into outside websites and check their links. allowed_domains = ["matthewhoelter.com"] # String together multiple domains if needed with a comma (,) # i.e. ['https://www.matthewhoelter.com', 'https://blog.matthewhoelter.com'] start_urls = ['https://www.matthewhoelter.com'] handle_httpstatus_list =  rules = (Rule(SgmlLinkExtractor(), callback='parse_item', follow=True),) def parse_item(self, response): if response.status == 404: link = LinkItem() link['url'] = response.url link['status'] = response.status link['referer'] = response.request.headers.get('Referer') return link
Then run it via:
scrapy runspider 404_scraper.py -o output.json
This was adapted from the code found on alecxe's GitHub.