Webucator Blog

Python Project for Job Application

You have an opening for a Python developer in your company. Your company is awesome, so you expect to get a lot of applications. You only want to interview developers who have some experience writing object-oriented Python code. You don’t need to test the breadth of their immediate knowledge as much as their ability to get a project done using all the resources available to them.

Here’s an idea for a project you can give the candidates to demonstrate their proficiency with Python.

Associated Files

The attached archive includes the following text files:

  1. pride-and-prejudice.txt (for testing)
  2. positive.txt – list of positive opinion words
  3. negative.txt – list of negative opinion words
  4. opinion-lexicon.txt – explanation of source of positive.txt and negative.txt

TextAnalyzer Class

Your job is to create a TextAnalyzer class with the following methods. Be sure to document your code using Docstring conventions.

__init__()

TextAnalyzer objects are instantiated by passing in one of the following to the src parameter:

  • A valid URL beginning with “http”
  • A path to a text file ending with the file extension “txt”
  • A string of text

The __init__() method also includes a src_type parameter, which is used to specify the type of the src argument. Options are:

  • discover (default) – You must write code to discover the type of src. While you can make this as throrough you like, you can also just use these simple rules:
    • If the src begins with “http”, it is a url.
    • If the src ends in “txt”, it is a path.
    • Otherwise, it is text.
  • url
  • path
  • text

You should set self._src_type, self._content, and self._orig_content in the __init__() method.

set_content_to_tag(self, tag, tag_id=None)

Changes _content to the text within a specific element of an HTML document

Keyword arguments:

  • tag (str) — Tag to read
  • tag_id (str) — ID of tag to read

It’s possible the HTML does not contain the tag being searched. You should use exception handling to catch any errors.

reset_content(self)

Resets _content to full text. Useful after a call to set_content_to_tag()

_words(self, casesensitive=False):

Returns words in _content as list.

Keyword arguments:

  • casesensitive (bool) — If True makes all words uppercase

Hints

  1. Before splitting the string into words, strip any leading and trailing punctuation using:
    text.strip(string.whitespace + string.punctuation)
  2. You can use this regular expression pattern to split text into words that only include letters:
    [_\W\d]+

common_words(self, minlen=1, maxlen=100, count=10, casesensitive=False)

Returns a list of 2-element tuples of the format (word: num),
where num is the number of times word shows up in content

Keyword arguments:

  • minlen (int) – Minimum length of words to include.
  • maxlen (int) – Maximum length of words to include.
  • count (int) – Number of words to include.
  • casesensitive (bool) — If True makes all words uppercase

char_distribution(self, casesensitive=False, letters_only=False)

Returns a list of 2-element tuples of the format (char: num), where num is the number of times char shows up in content

Keyword arguments:

  • casesensitive (bool) — Consider case?
  • letters_only (bool) — Exclude non-letters?

plot_common_words(self, minlen=1, maxlen=100, count=10, casesensitive=False)

Plots most common words

Keyword arguments:

  • minlen (int) — Minimum length of words to include.
  • maxlen (int) — Maximum length of words to include.
  • count (int) — Number of words to include.
  • casesensitive (bool) — If True makes all words uppercase

plot_char_distribution(self, casesensitive=False, letters_only=False)

Plots character distribution

Keyword arguments:

  • casesensitive (bool) — If True makes all words uppercase
  • letters_only (bool) — Exclude non-letters?

Properties

In addition, the class must include these properties:

avg_word_length(self)

The average word length in _content rounded to the 100th place (e.g, 3.82)

word_count(self)

The number of words in _content

distinct_word_count(self)

The number of distinct words in _content

words(self)

A list of all words used in _content, including repeats, in all uppercase letters.

positivity(self)

A positivity score calculated as follows:

  1. Create local tally variable with initial value of 0.
  2. Increment tally by 1 for every word in self.words found in positivity.txt (in same directory)
  3. Decrement tally by 1 for every word in self.words found in negativity.txt (in same directory)
  4. Calculate score as follows:
    round( tally / self.word_count * 1000)

Testing Your Class

When you have finished, you should run the unit tests below. If you get errors, you should do your very best to fix those errors before submitting the project.

import unittest

url = 'http://www.inaugural.senate.gov/swearing-in/address/address-by-bill-clinton-1997'
path = 'pride-and-prejudice.txt'
text = '''The outlook wasn't brilliant for the Mudville Nine that day;
the score stood four to two, with but one inning more to play.
And then when Cooney died at first, and Barrows did the same,
a sickly silence fell upon the patrons of the game.'''

class TestTextAnalyzer(unittest.TestCase):
    def test_discover_url(self):
        ta = TextAnalyzer(url)
        self.assertEqual(ta._src_type, 'url')
    def test_discover_path(self):
        ta = TextAnalyzer(path)
        self.assertEqual(ta._src_type, 'path')
    def test_discover_text(self):
        ta = TextAnalyzer(text)
        self.assertEqual(ta._src_type, 'text')
    def test_set_content_to_tag(self):
        ta = TextAnalyzer(url)
        ta.set_content_to_tag('div','content-main')
        self.assertEqual(ta._content[0:25], '\n\nAddress by Bill Clinton')
    def test_reset_content(self):
        ta = TextAnalyzer(url)
        ta.set_content_to_tag('div','content-main')
        ta.reset_content()
        self.assertEqual(ta._content[0], '<')
    def test_common_words(self):
        ta = TextAnalyzer(path, src_type='path')
        common_words = ta.common_words(minlen=4, maxlen=10)
        liz = common_words[3]
        self.assertEqual(liz[0],'ELIZABETH')
    def test_avg_word_length(self):
        ta = TextAnalyzer(text, src_type='text')
        self.assertEqual(ta.avg_word_length, 4.04)
    def test_word_count(self):
        ta = TextAnalyzer(text, src_type='text')
        self.assertEqual(ta.word_count, 46)
    def test_distinct_word_count(self):
        ta = TextAnalyzer(text, src_type='text')
        self.assertEqual(ta.distinct_word_count, 39)
    def test_char_distribution(self):
        ta = TextAnalyzer(text, src_type='text')
        char_dist = ta.char_distribution(letters_only=True)
        self.assertEqual(char_dist[1][1], 20)
    def test_positivity(self):
        ta = TextAnalyzer(text, src_type='text')
        positivity = ta.positivity
        self.assertEqual(positivity, -43)
        
suite = unittest.TestLoader().loadTestsFromTestCase(TestTextAnalyzer)
unittest.TextTestRunner().run(suite)

You should also run the following code to make sure your plot methods work.

ta = TextAnalyzer('pride-and-prejudice.txt', src_type='path')
ta.plot_common_words(minlen=5)
ta.plot_char_distribution(letters_only=True)

It should produce plots that look like the following images:

character distribution common words

Finally, you should run help(TextAnalyzer) to make sure the help documentation for the class is useful and thorough.

Project Rules

This project is meant for you to use your own skills and knowledge. This means that we expect the work to be your own work. You may use the any books and Python documentation to help you.

Using the Class

Use your class to answer the following questions:

  1. How many words are in the text of William Henry Harrison's 1841 inaugaral address? The contents of the are in a div tag with the id 'content-main'.
  2. What is the least common letter in pride-and-prejudice.txt?
  3. What is the most common 11-letter word in pride-and-prejudice.txt?
  4. What is the average word length in pride-and-prejudice.txt?
  5. How many distinct words are there in pride-and-prejudice.txt?
  6. How many words, ignoring case, are used only once in pride-and-prejudice.txt?
  7. How many words pride-and-prejudice.txt have less than five characters, at least one of which is a capital 'A'?
  8. A palindrome is a word spelled the same forwards and backwards, like BOB. How many distinct palindromes at least three letters long are there in pride-and-prejudice.txt?
  9. What is the positivity rating of 'pride-and-prejudice.txt'?
  10. Which of the following addresses has the lowest positivity rating?
    1. http://www.inaugural.senate.gov/swearing-in/address/address-by-george-w-bush-2001
    2. http://www.inaugural.senate.gov/swearing-in/address/address-by-harry-s-truman-1949
    3. http://www.inaugural.senate.gov/swearing-in/address/address-by-william-mckinley-1901
    4. http://www.inaugural.senate.gov/swearing-in/address/address-by-zachary-taylor-1849

Suggested Project Grading

To bucket your candidates, you could grade projects as follows:

  • 5 points for each unittest that passes (possible total: 55 points)
  • 5 points for each generated plot (possible total: 10 points)
  • 3 points for correct code showing how they got the answer to each exam question: (possible total: 30 points)
  • 5 points for proper documentation (possible total: 5 points)

I would love to hear from anyone who uses or plans to use this project when interviewing Python developers.


Related Training: Python