Checking your Sitemap for Broken Links with Python

See Python: Tips and Tricks for similar articles.
Looking to improve your Python skills? Check out Webucator’s Python classes.

When we redid our website in 2016, I wrote a Python script that checked our old sitemap to make sure that we had all our 301 redirects properly in place. Five years later, we are again launching a new site, so I revisited that script.

Originally, I used BeautifulSoup, but lately I prefer lxml for parsing HTML and XML documents.

Here’s my new code:

import os
from pathlib import Path
import requests
from lxml import etree

sitemaps = [
    "https://www.nasa.gov/sitemap-1.xml",
    "https://www.nasa.gov/sitemap-2.xml",
    "https://www.nasa.gov/sitemap-3.xml",
    "https://www.nasa.gov/sitemap-4.xml",
]


def check_sitemap_urls(sitemap, limit=50):
    """Attempts to resolve all urls in a sitemap and returns the results

    Args:
        sitemap (str): A URL
        limit (int, optional): The maximum number of URLs to check. Defaults to 50.
            Pass None for no limit.

    Returns:
        list of tuples: [(status_code, history, url, msg)].
            The history contains a list of redirects.
    """
    results = []
    name = os.path.basename(sitemap).split(".")[0]
    res = requests.get(sitemap)
    doc = etree.XML(res.content)

    # xpath query for selecting all element nodes in namespace
    query = "descendant-or-self::*[namespace-uri()!='']"
    # for each element returned by the above xpath query...
    for element in doc.xpath(query):
        # replace element name with its local name
        element.tag = etree.QName(element).localname

    # get all the loc elements
    links = doc.xpath(".//loc")
    for i, link in enumerate(links, 1):
        try:
            url = link.text
            print(f"{i}. Checking {url}")
            r = requests.get(url)

            if r.history:
                result = (
                    r.status_code,
                    r.history,
                    url,
                    "No error. Redirect to " + r.url,
                )
            elif r.status_code == 200:
                result = (r.status_code, r.history, url, "No error. No redirect.")
            else:
                result = (r.status_code, r.history, url, "Error?")
        except Exception as e:
            result = (0, [], url, e)

        results.append(result)

        if limit and i >= limit:
            break

    # Sort by status and then by history length
    results.sort(key=lambda result: (result[0], len(result[1])))

    return results


def main():
    for sitemap in sitemaps:
        results = check_sitemap_urls(sitemap)

        name = os.path.basename(sitemap).split(".")[0]
        report_path = Path(f"{name}.txt")
        report = f"{sitemap}\n\n"

        # 301s - may want to clean up 301s if you have multiple redirects
        report += "301s\n"
        i = 0
        for result in results:
            if len(result[1]):  # history
                i += 1
                report += f"{i}. "
                for response in result[1]:
                    report += f">> {response.url}\n\t"
                report += f">>>> {result[3]}\n"

        # non-200s
        report += "\n\n==========\nERRORS\n"
        for result in results:
            if result[0] != 200:
                report += f"{result[0]} - {result[2]}\n"

        report_path.write_text(report)

main()

A few explanatory notes:

  1. The list of sitemaps comes from https://www.nasa.gov/sitemap.xml. This is a sitemap index file, which contains links to the sitemaps that list the page URLs. For more information on this, see Google’s Split up your large sitemaps.
  2. If you’re new (or even not so new) to XPath, you may find this code a bit confusing:
    # xpath query for selecting all element nodes in namespace
    query = "descendant-or-self::*[namespace-uri()!='']"
    # for each element returned by the above xpath query...
    for element in doc.xpath(query):
        # replace element name with its local name
        element.tag = etree.QName(element).localname
    The sitemap XML files use a namespace:
    <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    That namespace makes writing XPaths a little trickier. There are various ways to deal with this, but I find a simple way is just to remove the namespace, which is what the code above does.
    For more information on working with XPath and Namespaces using lxml, see the lxml documentation.
  3. The report on 301 redirects will look like this:
    301s
    1. >> http://www.nasa.gov/
        >>>> No error. Redirect to https://www.nasa.gov/
    2. >> http://www.nasa.gov/connect/apps.html
        >>>> No error. Redirect to https://www.nasa.gov/connect/apps.html
    3. >> http://www.nasa.gov/multimedia/imagegallery/iotd.html
        >>>> No error. Redirect to https://www.nasa.gov/multimedia/imagegallery/iotd.html
    4. >> http://www.nasa.gov/archive/archive/about/career/index.html
        >>>> No error. Redirect to https://www.nasa.gov/archive/archive/about/career/index.html
    5. >> http://www.nasa.gov/multimedia/videogallery/index.html
        >>>> No error. Redirect to https://www.nasa.gov/multimedia/videogallery/index.html
    This indicates that the URLs using “http” will redirect to the same URLs using “https”. Sometimes, a URL will get redirected multiple times. That will look like this:
    1. >> http://www.nasa.gov/index.html
    >> http://www.nasa.gov/
        >>>> No error. Redirect to https://www.nasa.gov/
    If you find a page redirects more than once, you may want to fix that so that the URL redirects to its final destination in one step. How you do that will depend on how you are handling redirects. If you’re using Django, I recommend you check out django-redirects.

You can find the original script here.


In the video below, I explain the Python code that checks for 301s:

Written by Nat Dunn.


Related Articles

  1. Scientific Notation in Python
  2. Understanding Python’s __main__ variable
  3. Associate Python Files with IDLE
  4. Python: isdigit() vs. isdecimal()
  5. Python Color Constants Module
  6. Python: pow(x, y, z) less efficient than x**y % z
  7. A Python Model for Ping Pong Matches
  8. Bulk Convert Python files to IPython Notebook Files (py to ipynb conversion)
  9. Collatz Conjecture in Python
  10. Finally, a use case for finally – Python Exception Handling
  11. Python Clocks Explained
  12. Python’s date.strftime() slower than str(), split, unpack, and concatenate?
  13. Bi-directional Dictionary in Python
  14. Maximum recursion depth exceeded while calling a Python object
  15. Basic Python Programming Exercise: A Penny Doubled Every Day
  16. Creating an Email Decorator with Python and AWS
  17. How to Create a Simple Simulation in Python – Numeric Data
  18. Python Coding Challenge: Two People with the Same Birthday
  19. How to find all your Python installations on Windows
  20. Change Default autosave Interval in JupyterLab
  21. Interactive Quiz using IPython Notebook
  22. When to use Static Methods in Python? Never
  23. Converting Leading Tabs to Spaces with Python
  24. Simple Python Script for Extracting Text from an SRT File
  25. Python Virtual Environments with venv
  26. Mapping python to Python 3 on Your Mac
  27. How to Make IDLE the Default Editor for Python Files on Windows
  28. How to Do Ternary Operator Assignment in Python
  29. How to Convert Seconds to Years with Python
  30. How to Create a Python Package
  31. How to Read a File with Python
  32. How to Check the Operating System with Python
  33. How to Use enumerate() to Print a Numbered List in Python
  34. How to Repeatedly Append to a String in Python
  35. Checking your Sitemap for Broken Links with Python (this article)
  36. How to do Simultaneous Assignment in Python
  37. Visual Studio Code - Opening Files with Python open()
  38. How to Slice Strings in Python
  39. How Python Finds Imported Modules
  40. How to Merge Dictionaries in Python
  41. How to Index Strings in Python
  42. How to Create a Tuple in Python