Checking your Sitemap for Broken Links with Python
See Python: Tips and Tricks for similar articles.When we redid our website in 2016, I wrote a Python script that checked our old sitemap to make sure that we had all our 301 redirects properly in place. Five years later, we are again launching a new site, so I revisited that script.
Originally, I used BeautifulSoup, but lately I prefer lxml for parsing HTML and XML documents.
Here’s my new code:
import os
from pathlib import Path
import requests
from lxml import etree
sitemaps = [
"https://www.nasa.gov/sitemap-1.xml",
"https://www.nasa.gov/sitemap-2.xml",
"https://www.nasa.gov/sitemap-3.xml",
"https://www.nasa.gov/sitemap-4.xml",
]
def check_sitemap_urls(sitemap, limit=50):
"""Attempts to resolve all urls in a sitemap and returns the results
Args:
sitemap (str): A URL
limit (int, optional): The maximum number of URLs to check. Defaults to 50.
Pass None for no limit.
Returns:
list of tuples: [(status_code, history, url, msg)].
The history contains a list of redirects.
"""
results = []
name = os.path.basename(sitemap).split(".")[0]
res = requests.get(sitemap)
doc = etree.XML(res.content)
# xpath query for selecting all element nodes in namespace
query = "descendant-or-self::*[namespace-uri()!='']"
# for each element returned by the above xpath query...
for element in doc.xpath(query):
# replace element name with its local name
element.tag = etree.QName(element).localname
# get all the loc elements
links = doc.xpath(".//loc")
for i, link in enumerate(links, 1):
try:
url = link.text
print(f"{i}. Checking {url}")
r = requests.get(url)
if r.history:
result = (
r.status_code,
r.history,
url,
"No error. Redirect to " + r.url,
)
elif r.status_code == 200:
result = (r.status_code, r.history, url, "No error. No redirect.")
else:
result = (r.status_code, r.history, url, "Error?")
except Exception as e:
result = (0, [], url, e)
results.append(result)
if limit and i >= limit:
break
# Sort by status and then by history length
results.sort(key=lambda result: (result[0], len(result[1])))
return results
def main():
for sitemap in sitemaps:
results = check_sitemap_urls(sitemap)
name = os.path.basename(sitemap).split(".")[0]
report_path = Path(f"{name}.txt")
report = f"{sitemap}\n\n"
# 301s - may want to clean up 301s if you have multiple redirects
report += "301s\n"
i = 0
for result in results:
if len(result[1]): # history
i += 1
report += f"{i}. "
for response in result[1]:
report += f">> {response.url}\n\t"
report += f">>>> {result[3]}\n"
# non-200s
report += "\n\n==========\nERRORS\n"
for result in results:
if result[0] != 200:
report += f"{result[0]} - {result[2]}\n"
report_path.write_text(report)
main()
A few explanatory notes:
- The list of sitemaps comes from https://www.nasa.gov/sitemap.xml. This is a sitemap index file, which contains links to the sitemaps that list the page URLs. For more information on this, see Google’s Split up your large sitemaps.
- If you’re new (or even not so new) to XPath, you may find this code a bit confusing:
The sitemap XML files use a namespace:# xpath query for selecting all element nodes in namespace query = "descendant-or-self::*[namespace-uri()!='']" # for each element returned by the above xpath query... for element in doc.xpath(query): # replace element name with its local name element.tag = etree.QName(element).localname
That namespace makes writing XPaths a little trickier. There are various ways to deal with this, but I find a simple way is just to remove the namespace, which is what the code above does.<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
For more information on working with XPath and Namespaces using lxml, see the lxml documentation. - The report on 301 redirects will look like this:
This indicates that the URLs using “http” will redirect to the same URLs using “https”. Sometimes, a URL will get redirected multiple times. That will look like this:301s 1. >> http://www.nasa.gov/ >>>> No error. Redirect to https://www.nasa.gov/ 2. >> http://www.nasa.gov/connect/apps.html >>>> No error. Redirect to https://www.nasa.gov/connect/apps.html 3. >> http://www.nasa.gov/multimedia/imagegallery/iotd.html >>>> No error. Redirect to https://www.nasa.gov/multimedia/imagegallery/iotd.html 4. >> http://www.nasa.gov/archive/archive/about/career/index.html >>>> No error. Redirect to https://www.nasa.gov/archive/archive/about/career/index.html 5. >> http://www.nasa.gov/multimedia/videogallery/index.html >>>> No error. Redirect to https://www.nasa.gov/multimedia/videogallery/index.html
If you find a page redirects more than once, you may want to fix that so that the URL redirects to its final destination in one step. How you do that will depend on how you are handling redirects. If you’re using Django, I recommend you check out django-redirects.1. >> http://www.nasa.gov/index.html >> http://www.nasa.gov/ >>>> No error. Redirect to https://www.nasa.gov/
You can find the original script here.
In the video below, I explain the Python code that checks for 301s:
Related Articles
- Fixing WebVTT Times with Python
- Using Python to Convert Images to WEBP
- Scientific Notation in Python
- Understanding Python’s __main__ variable
- Converting Leading Tabs to Spaces with Python
- pow(x, y, z) more efficient than x**y % z and other options
- A Python Model for Ping Pong Matches
- Bulk Convert Python files to IPython Notebook Files (py to ipynb conversion)
- Python’s date.strftime() slower than str(), split, unpack, and concatenate?
- Basic Python Programming Exercise: A Penny Doubled Every Day
- Bi-directional Dictionary in Python
- How to find all your Python installations on Windows (and Mac)
- Associate Python Files with IDLE
- Change Default autosave Interval in JupyterLab
- Python: isdigit() vs. isdecimal()
- Python Clocks Explained
- Python Color Constants Module
- Maximum recursion depth exceeded while calling a Python object
- When to use Static Methods in Python? Never
- Finally, a use case for finally – Python Exception Handling
- Creating an Email Decorator with Python and AWS
- Python Coding Challenge: Two People with the Same Birthday
- How to Create a Simple Simulation in Python – Numeric Data
- Collatz Conjecture in Python
- Simple Python Script for Extracting Text from an SRT File
- Python Virtual Environments with venv
- Mapping python to Python 3 on Your Mac
- How to Make IDLE the Default Editor for Python Files on Windows
- How to Do Ternary Operator Assignment in Python
- How to Convert Seconds to Years with Python
- How to Create a Python Package
- How to Read a File with Python
- How to Check the Operating System with Python
- How to Use enumerate() to Print a Numbered List in Python
- How to Repeatedly Append to a String in Python
- Checking your Sitemap for Broken Links with Python (this article)
- How to do Simultaneous Assignment in Python
- Visual Studio Code - Opening Files with Python open()
- How to Slice Strings in Python
- How Python Finds Imported Modules
- How to Merge Dictionaries in Python
- How to Index Strings in Python
- How to Create a Tuple in Python