Webucator Blog

Checking your Sitemap for Broken Links with Python

We are about to launch a new website and, in doing so, we have restructured our URLs and used 301 redirects to point the old URL to the new URL. To make sure that we caught all the changes, we compiled a list of URLs on our old site using our sitemap and the Landing Page report from Google Analytics. We then used the Python’s request library to check all the URLs.

Using BeautifulSoup, you can easily check your own sitemap to make sure you don’t have any links pointing to missing pages. The code below shows how to do this using the NASA sitemap.

And here’s the result:

  1. >> http://www.nasa.gov/exploration/home/index.html
  	>>>> No error. Redirect to http://www.nasa.gov/topics/journeytomars/index.html
  2. >> http://www.nasa.gov/topics/universe/index.html
  	>>>> No error. Redirect to http://www.nasa.gov/topics/solarsystem/index.html
  3. >> http://www.nasa.gov/topics/nasalife/index.html
  	>>>> No error. Redirect to http://www.nasa.gov/topics/benefits/index.html
  4. >> http://www.nasa.gov/about/directorates/index.html
  	>>>> No error. Redirect to http://www.nasa.gov/about/org_index.html
  5. >> http://www.nasa.gov/about/speakers/index.html
  	>>>> No error. Redirect to http://www.nasa.gov/about/exhibits/index.html
  6. >> http://www.nasa.gov/missions/schedule/index.html
  	>> http://www.nasa.gov/launchschedule
  	>>>> No error. Redirect to http://www.nasa.gov/launchschedule/
  7. >> http://www.nasa.gov/connect/index.html
  	>> http://www.nasa.gov/socialmedia
  	>>>> No error. Redirect to http://www.nasa.gov/socialmedia/

  404 - http://www.nasa.gov/worldbook/index.html
  404 - http://www.nasa.gov/centers/dryden/site9/index.html
  404 - http://www.nasa.gov/centers/johnson/business%20/index.html
  404 - http://www.nasa.gov/centers/wallops/business/index.html
  404 - http://www.nasa.gov/centers/wallops/about/history.html

Notice that numbers 6 and 7 under 301s each have two redirects. That’s probably not an issue, but if there are a lot of chained 301s, you may want to clean that up.

It looks like NASA has some 404s to fix.

In the video below, I explain the Python code that checks for 301s.

Related Training: Python