November 27th, 2018

Find all broken (404) links on a website with Python and Scrapy


I recently needed to find all the broken links on a website. Using a little Python and Scrapy, I was able to crawl the whole site quickly.

Prerequisites

You'll need to have Scrapy installed in order to run the following code. Tested on Scrapy 1.1.2 and Python 2.7.10. Newer versions should work as well.

404_scraper.py
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.item import Item, Field

class LinkItem(Item):
    url = Field()
    referer = Field()
    status = Field()

class LinkSpider(CrawlSpider):
    name = "linkSpider"

    # Filter out other sites. No need dig into outside websites and check their links.
    allowed_domains = ["matthewhoelter.com"]

    # String together multiple domains if needed with a comma (,)
    # i.e. ['https://www.matthewhoelter.com', 'https://blog.matthewhoelter.com']
    start_urls = ['https://www.matthewhoelter.com'] 

    handle_httpstatus_list = [404]
    rules = (Rule(SgmlLinkExtractor(), callback='parse_item', follow=True),)

    def parse_item(self, response):
        if response.status == 404:
            link = LinkItem()

            link['url'] = response.url
            link['status'] = response.status
            link['referer'] = response.request.headers.get('Referer')

            return link

Then run it via:

Terminal
scrapy runspider 404_scraper.py -o output.json

Special Thanks

This was adapted from the code found on alecxe's GitHub.