Checking your Sitemap for Broken Links with Python

See Python: Tips and Tricks for similar articles.

When we redid our website in 2016, I wrote a Python script that checked our old sitemap to make sure that we had all our 301 redirects properly in place. Five years later, we are again launching a new site, so I revisited that script.

Originally, I used BeautifulSoup, but lately I prefer lxml for parsing HTML and XML documents.

Here’s my new code:

import os
from pathlib import Path
import requests
from lxml import etree

sitemaps = [
    "https://www.nasa.gov/sitemap-1.xml",
    "https://www.nasa.gov/sitemap-2.xml",
    "https://www.nasa.gov/sitemap-3.xml",
    "https://www.nasa.gov/sitemap-4.xml",
]


def check_sitemap_urls(sitemap, limit=50):
    """Attempts to resolve all urls in a sitemap and returns the results

    Args:
        sitemap (str): A URL
        limit (int, optional): The maximum number of URLs to check. Defaults to 50.
            Pass None for no limit.

    Returns:
        list of tuples: [(status_code, history, url, msg)].
            The history contains a list of redirects.
    """
    results = []
    name = os.path.basename(sitemap).split(".")[0]
    res = requests.get(sitemap)
    doc = etree.XML(res.content)

    # xpath query for selecting all element nodes in namespace
    query = "descendant-or-self::*[namespace-uri()!='']"
    # for each element returned by the above xpath query...
    for element in doc.xpath(query):
        # replace element name with its local name
        element.tag = etree.QName(element).localname

    # get all the loc elements
    links = doc.xpath(".//loc")
    for i, link in enumerate(links, 1):
        try:
            url = link.text
            print(f"{i}. Checking {url}")
            r = requests.get(url)

            if r.history:
                result = (
                    r.status_code,
                    r.history,
                    url,
                    "No error. Redirect to " + r.url,
                )
            elif r.status_code == 200:
                result = (r.status_code, r.history, url, "No error. No redirect.")
            else:
                result = (r.status_code, r.history, url, "Error?")
        except Exception as e:
            result = (0, [], url, e)

        results.append(result)

        if limit and i >= limit:
            break

    # Sort by status and then by history length
    results.sort(key=lambda result: (result[0], len(result[1])))

    return results


def main():
    for sitemap in sitemaps:
        results = check_sitemap_urls(sitemap)

        name = os.path.basename(sitemap).split(".")[0]
        report_path = Path(f"{name}.txt")
        report = f"{sitemap}\n\n"

        # 301s - may want to clean up 301s if you have multiple redirects
        report += "301s\n"
        i = 0
        for result in results:
            if len(result[1]):  # history
                i += 1
                report += f"{i}. "
                for response in result[1]:
                    report += f">> {response.url}\n\t"
                report += f">>>> {result[3]}\n"

        # non-200s
        report += "\n\n==========\nERRORS\n"
        for result in results:
            if result[0] != 200:
                report += f"{result[0]} - {result[2]}\n"

        report_path.write_text(report)

main()

A few explanatory notes:

  1. The list of sitemaps comes from https://www.nasa.gov/sitemap.xml. This is a sitemap index file, which contains links to the sitemaps that list the page URLs. For more information on this, see Google’s Split up your large sitemaps.
  2. If you’re new (or even not so new) to XPath, you may find this code a bit confusing:
    # xpath query for selecting all element nodes in namespace
    query = "descendant-or-self::*[namespace-uri()!='']"
    # for each element returned by the above xpath query...
    for element in doc.xpath(query):
        # replace element name with its local name
        element.tag = etree.QName(element).localname
    The sitemap XML files use a namespace:
    <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    That namespace makes writing XPaths a little trickier. There are various ways to deal with this, but I find a simple way is just to remove the namespace, which is what the code above does.
    For more information on working with XPath and Namespaces using lxml, see the lxml documentation.
  3. The report on 301 redirects will look like this:
    301s
    1. >> http://www.nasa.gov/
        >>>> No error. Redirect to https://www.nasa.gov/
    2. >> http://www.nasa.gov/connect/apps.html
        >>>> No error. Redirect to https://www.nasa.gov/connect/apps.html
    3. >> http://www.nasa.gov/multimedia/imagegallery/iotd.html
        >>>> No error. Redirect to https://www.nasa.gov/multimedia/imagegallery/iotd.html
    4. >> http://www.nasa.gov/archive/archive/about/career/index.html
        >>>> No error. Redirect to https://www.nasa.gov/archive/archive/about/career/index.html
    5. >> http://www.nasa.gov/multimedia/videogallery/index.html
        >>>> No error. Redirect to https://www.nasa.gov/multimedia/videogallery/index.html
    This indicates that the URLs using “http” will redirect to the same URLs using “https”. Sometimes, a URL will get redirected multiple times. That will look like this:
    1. >> http://www.nasa.gov/index.html
    >> http://www.nasa.gov/
        >>>> No error. Redirect to https://www.nasa.gov/
    If you find a page redirects more than once, you may want to fix that so that the URL redirects to its final destination in one step. How you do that will depend on how you are handling redirects. If you’re using Django, I recommend you check out django-redirects.

You can find the original script here.


In the video below, I explain the Python code that checks for 301s:

Written by Nat Dunn. Follow Nat on Twitter.


Related Articles

  1. Fixing WebVTT Times with Python
  2. Using Python to Convert Images to WEBP
  3. Scientific Notation in Python
  4. Understanding Python’s __main__ variable
  5. Converting Leading Tabs to Spaces with Python
  6. pow(x, y, z) more efficient than x**y % z and other options
  7. A Python Model for Ping Pong Matches
  8. Bulk Convert Python files to IPython Notebook Files (py to ipynb conversion)
  9. Python’s date.strftime() slower than str(), split, unpack, and concatenate?
  10. Basic Python Programming Exercise: A Penny Doubled Every Day
  11. Bi-directional Dictionary in Python
  12. How to find all your Python installations on Windows (and Mac)
  13. Associate Python Files with IDLE
  14. Change Default autosave Interval in JupyterLab
  15. Python: isdigit() vs. isdecimal()
  16. Python Clocks Explained
  17. Python Color Constants Module
  18. Maximum recursion depth exceeded while calling a Python object
  19. When to use Static Methods in Python? Never
  20. Finally, a use case for finally – Python Exception Handling
  21. Creating an Email Decorator with Python and AWS
  22. Python Coding Challenge: Two People with the Same Birthday
  23. How to Create a Simple Simulation in Python – Numeric Data
  24. Collatz Conjecture in Python
  25. Simple Python Script for Extracting Text from an SRT File
  26. Python Virtual Environments with venv
  27. Mapping python to Python 3 on Your Mac
  28. How to Make IDLE the Default Editor for Python Files on Windows
  29. How to Do Ternary Operator Assignment in Python
  30. How to Convert Seconds to Years with Python
  31. How to Create a Python Package
  32. How to Read a File with Python
  33. How to Check the Operating System with Python
  34. How to Use enumerate() to Print a Numbered List in Python
  35. How to Repeatedly Append to a String in Python
  36. Checking your Sitemap for Broken Links with Python (this article)
  37. How to do Simultaneous Assignment in Python
  38. Visual Studio Code - Opening Files with Python open()
  39. How to Slice Strings in Python
  40. How Python Finds Imported Modules
  41. How to Merge Dictionaries in Python
  42. How to Index Strings in Python
  43. How to Create a Tuple in Python