Python Project: TextAnalyzer Class

In this project, you will create a TextAnalyzer class. The methods of the class are described below. You will do your work in the Analyzing Text IPython notebook included in the project in your Python class files. Be sure to comment your code well.

Be Sure to Download the Latest Project Files

Create a TextAnalyzer class with the following methods:


TextAnalyzer objects are instantiated by passing in one of the following to the src parameter:

  • A valid URL beginning with "http"
  • A path to a text file ending with the file extension "txt"
  • A string of text

The __init__() method also includes a src_type parameter, which is used to specify the type of the src argument. Options are:

  • discover (default) - You must write code to discover the type of src.
    • If the src begins with "http", it is a url.
    • If the src ends in "txt", it is a path.
    • Otherwise, it is text.
  • url
  • path
  • text

You should set self._src_typeself._content, and self._orig_content in the __init__() method.

set_content_to_tag(self, tag, tag_id=None)

Changes _content to the text within a specific element of an HTML document.

Keyword arguments:

  • tag (str) -- Tag to read
  • tag_id (str) -- ID of tag to read

It's possible the HTML does not contain the tag being searched. You should use exception handling to catch any errors.


Resets _content to full text that was originally loaded. Useful after a call to set_content_to_tag().

_words(self, casesensitive=False):

Returns words in _content as list.

Keyword arguments:

  • casesensitive (bool) -- If False makes all words uppercase.


  1. After splitting the text into words using the split() method, strip any leading and trailing punctuation using:
    [word.strip(string.punctuation) for word in words]

common_words(self, minlen=1, maxlen=100, count=10, casesensitive=False)

Returns a list of 2-element tuples of the structure (word, num), where num is the number of times word shows up in _content.

Keyword arguments:

  • minlen (int) - Minimum length of words to include.
  • maxlen (int) - Maximum length of words to include.
  • count (int) - Number of words to include.
  • casesensitive (bool) -- If False makes all words uppercase

char_distribution(self, casesensitive=False, letters_only=False)

Returns a list of 2-element tuples of the format (char, num), where num is the number of times char shows up in _content. The list should be sorted by num in descending order.

Keyword arguments:

  • casesensitive (bool) -- Consider case?
  • letters_only (bool) -- Exclude non-letters?

plot_common_words(self, minlen=1, maxlen=100, count=10, casesensitive=False)

Plots most common words.

Keyword arguments:

  • minlen (int) -- Minimum length of words to include.
  • maxlen (int) -- Maximum length of words to include.
  • count (int) -- Number of words to include.
  • casesensitive (bool) -- If False makes all words uppercase.

plot_char_distribution(self, casesensitive=False, letters_only=False)

Plots character distribution.

Keyword arguments:

  • casesensitive (bool) -- If False makes all words uppercase.
  • letters_only (bool) -- Exclude non-letters?


In addition, the class must include these properties:


The average word length in _content rounded to the 100th place (e.g, 3.82).


The number of words in _content.


The number of distinct words in _content. This should not be case sensitive: "You" and "you" should be considered the same word.


A list of all words used in _content, including repeats, in all uppercase letters.


A positivity score calculated as follows:

  1. Create local tally variable with initial value of 0.
  2. Increment tally by 1 for every word in self.words found in positive.txt (in same directory)
  3. Decrement tally by 1 for every word in self.words found in negative.txt (in same directory)
  4. Calculate score as follows:
    round( tally / self.word_count * 1000)

Testing Your Class

When you have finished, you should run the tests included in the Analyzing Text IPython notebook. If you get errors, you should do your very best to fix those errors before submitting the project.

If you submit your project while still getting errors, you should explain that in your project submission email. The very first thing we will do to grade your project is run it through these tests. If it fails any of the tests, and you have not indicated that you are aware of specific test failures, we will stop grading and ask you to resubmit.

You should also run the last cell in the Analyzing Text IPython notebook to make sure your plot methods work. They should produce plots that look like the following images:  Common WordsCharacter Distribution

Finally, you should run help(TextAnalyzer) to make sure the help documentation for the class is useful and thorough.

Project Rules

This project is meant for you to use your own skills and knowledge. This means that we expect the work to be your own work. We also expect that you will want to look some stuff up. Please feel free to use your course manual and the course content to help you along. You may also use the Internet as a source of help, especially for looking up documentation and errors.

Note that the instructor is not a resource during this project. The purpose of the project is to evaluate how well you can do without access to the instructor.


The Python exam for this course is based on this project. You must:

  1. Complete the project before taking the exam.
  2. Complete the exam before submitting the project.

Once you have created the project and used your solution to answer the exam questions, you should then submit the project.

Submitting Project

  • After you have completed all of the requirements of the project and completed the exam for this course, email the Analyzing Text.ipynb file to If you haven't received confirmation of receipt within one business day, please let us know.
  • If you are having trouble sending the project attachment via email we suggest using a file sharing service such as or Google Drive.
  • Be sure to indicate in your email if any of the tests failed or if there are any other issues you had making the project work.
  • Please allow 10 business days for us to review the exam submission.
  • You must complete the project before the expiration date of your course. We estimate it will take about 20 hours. Be sure to leave yourself enough time.

Project Grading

Your project will be graded as follows:

  • 5 points for each unittest that passes (possible total: 55 points).
  • 5 points for each generated plot (possible total: 10 points).
  • 3 points for correct code showing how you got the answer to each exam question: (possible total: 30 points).
  • 5 points for proper documentation (possible total: 5 points).
Author: Nat Dunn

Nat Dunn is the founder of Webucator (, a company that has provided training for tens of thousands of students from thousands of organizations. Nat started the company in 2003 to combine his passion for technical training with his business expertise, and to help companies benefit from both. His previous experience was in sales, business and technical training, and management. Nat has an MBA from Harvard Business School and a BA in International Relations from Pomona College.

Follow Nat on Twitter at @natdunn and Webucator at @webucator.

About Webucator

Webucator provides instructor-led training to students throughout the US and Canada. We have trained over 90,000 students from over 16,000 organizations on technologies such as Microsoft ASP.NET, Microsoft Office, Azure, Windows, Java, Adobe, Python, SQL, JavaScript, Angular and much more. Check out our complete course catalog.