Checking your Sitemap for Broken Links with Python

Written by Nat Dunn. Follow Nat on Twitter.

See Python: Tips and Tricks for similar articles.

When we redid our website in 2016, I wrote a Python script that checked our old sitemap to make sure that we had all our 301 redirects properly in place. Five years later, we are again launching a new site, so I revisited that script.

Originally, I used BeautifulSoup, but lately I prefer lxml for parsing HTML and XML documents.

Here’s my new code:

import os
from pathlib import Path
import requests
from lxml import etree

sitemaps = [
    "https://www.nasa.gov/sitemap-1.xml",
    "https://www.nasa.gov/sitemap-2.xml",
    "https://www.nasa.gov/sitemap-3.xml",
    "https://www.nasa.gov/sitemap-4.xml",
]


def check_sitemap_urls(sitemap, limit=50):
    """Attempts to resolve all urls in a sitemap and returns the results

    Args:
        sitemap (str): A URL
        limit (int, optional): The maximum number of URLs to check. Defaults to 50.
            Pass None for no limit.

    Returns:
        list of tuples: [(status_code, history, url, msg)].
            The history contains a list of redirects.
    """
    results = []
    name = os.path.basename(sitemap).split(".")[0]
    res = requests.get(sitemap)
    doc = etree.XML(res.content)

    # xpath query for selecting all element nodes in namespace
    query = "descendant-or-self::*[namespace-uri()!='']"
    # for each element returned by the above xpath query...
    for element in doc.xpath(query):
        # replace element name with its local name
        element.tag = etree.QName(element).localname

    # get all the loc elements
    links = doc.xpath(".//loc")
    for i, link in enumerate(links, 1):
        try:
            url = link.text
            print(f"{i}. Checking {url}")
            r = requests.get(url)

            if r.history:
                result = (
                    r.status_code,
                    r.history,
                    url,
                    "No error. Redirect to " + r.url,
                )
            elif r.status_code == 200:
                result = (r.status_code, r.history, url, "No error. No redirect.")
            else:
                result = (r.status_code, r.history, url, "Error?")
        except Exception as e:
            result = (0, [], url, e)

        results.append(result)

        if limit and i >= limit:
            break

    # Sort by status and then by history length
    results.sort(key=lambda result: (result[0], len(result[1])))

    return results


def main():
    for sitemap in sitemaps:
        results = check_sitemap_urls(sitemap)

        name = os.path.basename(sitemap).split(".")[0]
        report_path = Path(f"{name}.txt")
        report = f"{sitemap}\n\n"

        # 301s - may want to clean up 301s if you have multiple redirects
        report += "301s\n"
        i = 0
        for result in results:
            if len(result[1]):  # history
                i += 1
                report += f"{i}. "
                for response in result[1]:
                    report += f">> {response.url}\n\t"
                report += f">>>> {result[3]}\n"

        # non-200s
        report += "\n\n==========\nERRORS\n"
        for result in results:
            if result[0] != 200:
                report += f"{result[0]} - {result[2]}\n"

        report_path.write_text(report)

main()

A few explanatory notes:

The list of sitemaps comes from https://www.nasa.gov/sitemap.xml. This is a sitemap index file, which contains links to the sitemaps that list the page URLs. For more information on this, see Google’s Split up your large sitemaps.
If you’re new (or even not so new) to XPath, you may find this code a bit confusing:
```
# xpath query for selecting all element nodes in namespace
query = "descendant-or-self::*[namespace-uri()!='']"
# for each element returned by the above xpath query...
for element in doc.xpath(query):
    # replace element name with its local name
    element.tag = etree.QName(element).localname
```
The sitemap XML files use a namespace:
```
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
```
That namespace makes writing XPaths a little trickier. There are various ways to deal with this, but I find a simple way is just to remove the namespace, which is what the code above does.
For more information on working with XPath and Namespaces using lxml, see the lxml documentation.

The report on 301 redirects will look like this:

301s
1. >> http://www.nasa.gov/
    >>>> No error. Redirect to https://www.nasa.gov/
2. >> http://www.nasa.gov/connect/apps.html
    >>>> No error. Redirect to https://www.nasa.gov/connect/apps.html
3. >> http://www.nasa.gov/multimedia/imagegallery/iotd.html
    >>>> No error. Redirect to https://www.nasa.gov/multimedia/imagegallery/iotd.html
4. >> http://www.nasa.gov/archive/archive/about/career/index.html
    >>>> No error. Redirect to https://www.nasa.gov/archive/archive/about/career/index.html
5. >> http://www.nasa.gov/multimedia/videogallery/index.html
    >>>> No error. Redirect to https://www.nasa.gov/multimedia/videogallery/index.html

This indicates that the URLs using “http” will redirect to the same URLs using “https”. Sometimes, a URL will get redirected multiple times. That will look like this:

1. >> http://www.nasa.gov/index.html
>> http://www.nasa.gov/
    >>>> No error. Redirect to https://www.nasa.gov/

If you find a page redirects more than once, you may want to fix that so that the URL redirects to its final destination in one step. How you do that will depend on how you are handling redirects. If you’re using Django, I recommend you check out django-redirects.

You can find the original script here.

In the video below, I explain the Python code that checks for 301s: