Python Project: TextAnalyzer Class

In Brief...

In this project, you will create a TextAnalyzer class. The methods of the class are described below. You will do your work in the Analyzing Text IPython notebook included in the project in your Python class files. Be sure to comment your code well.


Create a TextAnalyzer class with the following method:


TextAnalyzer objects are instantiated by passing in one of the following to the src parameter:

  • A valid URL beginning with "http"
  • A path to a text file ending with the file extension "txt"
  • A string of text

The __init__() method also includes a src_type parameter, which is used to specify the type of the src argument. Options are:

  • discover (default) - You must write code to discover the type of src.
    • If the src begins with "http", it is a url.
    • If the src ends in "txt", it is a path.
    • Otherwise, it is text.
  • url
  • path
  • text

You should set self._src_typeself._content, and self._orig_content in the __init__() method.

set_content_to_tag(self, tag, tag_id=None)

Changes _content to the text within a specific element of an HTML document.

Keyword arguments:

  • tag (str) -- Tag to read
  • tag_id (str) -- ID of tag to read

It's possible the HTML does not contain the tag being searched. You should use exception handling to catch any errors.


Resets _content to full text. Useful after a call to set_content_to_tag().

_words(self, casesensitive=False):

Returns words in _content as list.

Keyword arguments:

  • casesensitive (bool) -- If True makes all words uppercase.


  1. Before splitting the string into words, strip any leading and trailing punctuation using:
    text.strip(string.whitespace + string.punctuation)
  2. You can use this regular expression pattern to split text into words that only include letters:

common_words(self, minlen=1, maxlen=100, count=10, casesensitive=False)

Returns a list of 2-element tuples of the format (word: num), where num is the number of times word shows up in _content.

Keyword arguments:

  • minlen (int) - Minimum length of words to include.
  • maxlen (int) - Maximum length of words to include.
  • count (int) - Number of words to include.
  • casesensitive (bool) -- If True makes all words uppercase

char_distribution(self, casesensitive=False, letters_only=False)

Returns a list of 2-element tuples of the format (char: num), where num is the number of times char shows up in _content.

Keyword arguments:

  • casesensitive (bool) -- Consider case?
  • letters_only (bool) -- Exclude non-letters?

plot_common_words(self, minlen=1, maxlen=100, count=10, casesensitive=False)

Plots most common words.

Keyword arguments:

  • minlen (int) -- Minimum length of words to include.
  • maxlen (int) -- Maximum length of words to include.
  • count (int) -- Number of words to include.
  • casesensitive (bool) -- If True makes all words uppercase.

plot_char_distribution(self, casesensitive=False, letters_only=False)

Plots character distribution.

Keyword arguments:

  • casesensitive (bool) -- If True makes all words uppercase.
  • letters_only (bool) -- Exclude non-letters?


In addition, the class must include these properties:


The average word length in _content rounded to the 100th place (e.g, 3.82).


The number of words in _content.


The number of distinct words in _content.


A list of all words used in _content, including repeats, in all uppercase letters.


A positivity score calculated as follows:

  1. Create local tally variable with initial value of 0.
  2. Increment tally by 1 for every word in self.words found in positivity.txt (in same directory)
  3. Decrement tally by 1 for every word in self.words found in negativity.txt (in same directory)
  4. Calculate score as follows:
    round( tally / self.word_count * 1000)

Testing Your Class

When you have finished, you should run the tests included in the Analyzing Text IPython notebook. If you get errors, you should do your very best to fix those errors before submitting the project.

If you submit your project while still getting errors, you should explain that in your project submission email. The very first thing we will do to grade your project is run it through these tests. If it fails any of the tests, and you have not indicated that you are aware of specific test failures, we will stop grading and ask you to resubmit.

You should also run the last cell in the Analyzing Text IPython notebook to make sure your plot methods work. They should produce plots that look like the following images: Character DistributionCommon Words

Finally, you should run help(TextAnalyzer) to make sure the help documentation for the class is useful and thorough.

Project Rules

This project is meant for you to use your own skills and knowledge. This means that we expect the work to be your own work. We also expect that you will want to look some stuff up. Please feel free to use your course textbook and the course content to help you along. You may also use the Internet as a source of help, especially for looking up documentation and errors.

Note that the instructor is not a resource during this project. The purpose of the project is to evaluate how well you can do without access to the instructor.


The Python exam for this course is based on this project. You must:

  1. Complete the project before taking the exam.
  2. Complete the exam before submitting the project.

Once you have created the project and used your solution to answer the exam questions, you should then submit the project.

Submitting Project

  • After you have completed all of the requirements of the project and completed the exam for this course, email the Analyzing Text.ipynb file to
  • If you are having trouble sending the project attachment via email we suggest using a file sharing service such as or Google Drive.
  • Be sure to indicate in your email if any of the tests failed or if there are any other issues you had making the project work.
  • Please allow 10 business days for us to review the exam submission.
  • You must complete the project before the expiration date of your course. We estimate it will take about 20 hours. Be sure to leave yourself enough time.

Project Grading

Your project will be graded as follows:

  • 5 points for each unittest that passes (possible total: 55 points).
  • 5 points for each generated plot (possible total: 10 points).
  • 3 points for correct code showing how you got the answer to each exam question: (possible total: 30 points).
  • 5 points for proper documentation (possible total: 5 points).

Author: Nat Dunn

Nat Dunn founded Webucator in 2003 to combine his passion for web development with his business expertise and to help companies benefit from both.