{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "dc4e571f",
   "metadata": {},
   "source": [
    "# Assignment 2: Parts-of-Speech Tagging (POS)\n",
    "\n",
    "Welcome to the second assignment of Course 2 in the Natural Language Processing specialization. This assignment will develop skills in part-of-speech (POS) tagging, the process of assigning a part-of-speech tag (Noun, Verb, Adjective...) to each word in an input text.  Tagging is difficult because some words can represent more than one part of speech at different times. They are  **Ambiguous**. Let's look at the following example: \n",
    "\n",
    "- The whole team played **well**. [adverb]\n",
    "- You are doing **well** for yourself. [adjective]\n",
    "- **Well**, this assignment took me forever to complete. [interjection]\n",
    "- The **well** is dry. [noun]\n",
    "- Tears were beginning to **well** in her eyes. [verb]\n",
    "\n",
    "Distinguishing the parts-of-speech of a word in a sentence will help you better understand the meaning of a sentence. This would be critically important in search queries. Identifying the proper noun, the organization, the stock symbol, or anything similar would greatly improve everything ranging from speech recognition to search. By completing this assignment, you will: \n",
    "\n",
    "- Learn how parts-of-speech tagging works\n",
    "- Compute the transition matrix A in a Hidden Markov Model\n",
    "- Compute the emission matrix B in a Hidden Markov Model\n",
    "- Compute the Viterbi algorithm \n",
    "- Compute the accuracy of your own model \n",
    "\n",
    "## Important Note on Submission to the AutoGrader\n",
    "\n",
    "Before submitting your assignment to the AutoGrader, please make sure you are not doing the following:\n",
    "\n",
    "1. You have not added any _extra_ `print` statement(s) in the assignment.\n",
    "2. You have not added any _extra_ code cell(s) in the assignment.\n",
    "3. You have not changed any of the function parameters.\n",
    "4. You are not using any global variables inside your graded exercises. Unless specifically instructed to do so, please refrain from it and use the local variables instead.\n",
    "5. You are not changing the assignment code where it is not required, like creating _extra_ variables.\n",
    "\n",
    "If you do any of the following, you will get something like, `Grader not found` (or similarly unexpected) error upon submitting your assignment. Before asking for help/debugging the errors in your assignment, check for these first. If this is the case, and you don't remember the changes you have made, you can get a fresh copy of the assignment by following these [instructions](https://www.coursera.org/learn/probabilistic-models-in-nlp/supplement/saGQf/how-to-refresh-your-workspace)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a16d60dd",
   "metadata": {},
   "source": [
    "## Outline\n",
    "\n",
    "- [0 Data Sources](#0)\n",
    "- [1 POS Tagging](#1)\n",
    "    - [1.1 Training](#1.1)\n",
    "        - [Exercise 01](#ex-01)\n",
    "    - [1.2 Testing](#1.2)\n",
    "        - [Exercise 02](#ex-02)\n",
    "- [2 Hidden Markov Models](#2)\n",
    "    - [2.1 Generating Matrices](#2.1)\n",
    "        - [Exercise 03](#ex-03)\n",
    "        - [Exercise 04](#ex-04)\n",
    "- [3 Viterbi Algorithm](#3)\n",
    "    - [3.1 Initialization](#3.1)\n",
    "        - [Exercise 05](#ex-05)\n",
    "    - [3.2 Viterbi Forward](#3.2)\n",
    "        - [Exercise 06](#ex-06)\n",
    "    - [3.3 Viterbi Backward](#3.3)\n",
    "        - [Exercise 07](#ex-07)\n",
    "- [4 Predicting on a data set](#4)\n",
    "    - [Exercise 08](#ex-08)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "986f1fd5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Importing packages and loading in the data set \n",
    "from utils_pos import get_word_tag, preprocess  \n",
    "import pandas as pd\n",
    "from collections import defaultdict\n",
    "import math\n",
    "import numpy as np\n",
    "import w2_unittest"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0b27ddd8",
   "metadata": {},
   "source": [
    "<a name='0'></a>\n",
    "## Part 0: Data Sources\n",
    "This assignment will use two tagged data sets collected from the **Wall Street Journal (WSJ)**. \n",
    "\n",
    "[Here](http://relearn.be/2015/training-common-sense/sources/software/pattern-2.6-critical-fork/docs/html/mbsp-tags.html) is an example 'tag-set' or Part of Speech designation describing the two or three letter tag and their meaning. \n",
    "- One data set (**WSJ-2_21.pos**) will be used for **training**.\n",
    "- The other (**WSJ-24.pos**) for **testing**. \n",
    "- The tagged training data has been preprocessed to form a vocabulary (**hmm_vocab.txt**). \n",
    "- The words in the vocabulary are words from the training set that were used two or more times. \n",
    "- The vocabulary is augmented with a set of 'unknown word tokens', described below. \n",
    "\n",
    "The training set will be used to create the emission, transmission and tag counts. \n",
    "\n",
    "The test set (WSJ-24.pos) is read in to create `y`. \n",
    "- This contains both the test text and the true tag. \n",
    "- The test set has also been preprocessed to remove the tags to form **test_words.txt**. \n",
    "- This is read in and further processed to identify the end of sentences and handle words not in the vocabulary using functions provided in **utils_pos.py**. \n",
    "- This forms the list `prep`, the preprocessed text used to test our  POS taggers.\n",
    "\n",
    "A POS tagger will necessarily encounter words that are not in its datasets. \n",
    "- To improve accuracy, these words are further analyzed during preprocessing to extract available hints as to their appropriate tag. \n",
    "- For example, the suffix 'ize' is a hint that the word is a verb, as in 'final-ize' or 'character-ize'. \n",
    "- A set of unknown-tokens, such as '--unk-verb--' or '--unk-noun--' will replace the unknown words in both the training and test corpus and will appear in the emission, transmission and tag data structures.\n",
    "\n",
    "\n",
    "<img src = \"images/DataSources1.PNG\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8153f2e0",
   "metadata": {},
   "source": [
    "Implementation note: \n",
    "\n",
    "- For python 3.6 and beyond, dictionaries retain the insertion order. \n",
    "- Furthermore, their hash-based lookup makes them suitable for rapid membership tests. \n",
    "    - If _di_ is a dictionary, `key in di` will return `True` if _di_ has a key _key_, else `False`. \n",
    "\n",
    "The dictionary `vocab` will utilize these features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "8025b64e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "A few items of the training corpus list\n",
      "['In\\tIN\\n', 'an\\tDT\\n', 'Oct.\\tNNP\\n', '19\\tCD\\n', 'review\\tNN\\n']\n"
     ]
    }
   ],
   "source": [
    "# load in the training corpus\n",
    "with open(\"./data/WSJ_02-21.pos\", 'r') as f:\n",
    "    training_corpus = f.readlines()\n",
    "\n",
    "print(f\"A few items of the training corpus list\")\n",
    "print(training_corpus[0:5])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "77de7e91",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "A few items of the vocabulary list\n",
      "['!', '#', '$', '%', '&', \"'\", \"''\", \"'40s\", \"'60s\", \"'70s\", \"'80s\", \"'86\", \"'90s\", \"'N\", \"'S\", \"'d\", \"'em\", \"'ll\", \"'m\", \"'n'\", \"'re\", \"'s\", \"'til\", \"'ve\", '(', ')', ',', '-', '--', '--n--', '--unk--', '--unk_adj--', '--unk_adv--', '--unk_digit--', '--unk_noun--', '--unk_punct--', '--unk_upper--', '--unk_verb--', '.', '...', '0.01', '0.0108', '0.02', '0.03', '0.05', '0.1', '0.10', '0.12', '0.13', '0.15']\n",
      "\n",
      "A few items at the end of the vocabulary list\n",
      "['yards', 'yardstick', 'year', 'year-ago', 'year-before', 'year-earlier', 'year-end', 'year-on-year', 'year-round', 'year-to-date', 'year-to-year', 'yearlong', 'yearly', 'years', 'yeast', 'yelled', 'yelling', 'yellow', 'yen', 'yes', 'yesterday', 'yet', 'yield', 'yielded', 'yielding', 'yields', 'you', 'young', 'younger', 'youngest', 'youngsters', 'your', 'yourself', 'youth', 'youthful', 'yuppie', 'yuppies', 'zero', 'zero-coupon', 'zeroing', 'zeros', 'zinc', 'zip', 'zombie', 'zone', 'zones', 'zoning', '{', '}', '']\n"
     ]
    }
   ],
   "source": [
    "# read the vocabulary data, split by each line of text, and save the list\n",
    "with open(\"./data/hmm_vocab.txt\", 'r') as f:\n",
    "    voc_l = f.read().split('\\n')\n",
    "\n",
    "print(\"A few items of the vocabulary list\")\n",
    "print(voc_l[0:50])\n",
    "print()\n",
    "print(\"A few items at the end of the vocabulary list\")\n",
    "print(voc_l[-50:])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "bf966095",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Vocabulary dictionary, key is the word, value is a unique integer\n",
      ":0\n",
      "!:1\n",
      "#:2\n",
      "$:3\n",
      "%:4\n",
      "&:5\n",
      "':6\n",
      "'':7\n",
      "'40s:8\n",
      "'60s:9\n",
      "'70s:10\n",
      "'80s:11\n",
      "'86:12\n",
      "'90s:13\n",
      "'N:14\n",
      "'S:15\n",
      "'d:16\n",
      "'em:17\n",
      "'ll:18\n",
      "'m:19\n",
      "'n':20\n"
     ]
    }
   ],
   "source": [
    "# vocab: dictionary that has the index of the corresponding words\n",
    "vocab = {}\n",
    "\n",
    "# Get the index of the corresponding words. \n",
    "for i, word in enumerate(sorted(voc_l)): \n",
    "    vocab[word] = i       \n",
    "    \n",
    "print(\"Vocabulary dictionary, key is the word, value is a unique integer\")\n",
    "cnt = 0\n",
    "for k,v in vocab.items():\n",
    "    print(f\"{k}:{v}\")\n",
    "    cnt += 1\n",
    "    if cnt > 20:\n",
    "        break"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "05aaad39",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "A sample of the test corpus\n",
      "['The\\tDT\\n', 'economy\\tNN\\n', \"'s\\tPOS\\n\", 'temperature\\tNN\\n', 'will\\tMD\\n', 'be\\tVB\\n', 'taken\\tVBN\\n', 'from\\tIN\\n', 'several\\tJJ\\n', 'vantage\\tNN\\n']\n"
     ]
    }
   ],
   "source": [
    "# load in the test corpus\n",
    "with open(\"./data/WSJ_24.pos\", 'r') as f:\n",
    "    y = f.readlines()\n",
    "    \n",
    "print(\"A sample of the test corpus\")\n",
    "print(y[0:10])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "2ba0550c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The length of the preprocessed test corpus:  34199\n",
      "This is a sample of the test_corpus: \n",
      "['The', 'economy', \"'s\", 'temperature', 'will', 'be', 'taken', 'from', 'several', '--unk--']\n"
     ]
    }
   ],
   "source": [
    "#corpus without tags, preprocessed\n",
    "_, prep = preprocess(vocab, \"./data/test.words\")     \n",
    "\n",
    "print('The length of the preprocessed test corpus: ', len(prep))\n",
    "print('This is a sample of the test_corpus: ')\n",
    "print(prep[0:10])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e2c3a9d4",
   "metadata": {},
   "source": [
    "<a name='1'></a>\n",
    "# Part 1: Parts-of-speech tagging \n",
    "\n",
    "<a name='1.1'></a>\n",
    "## Part 1.1 - Training\n",
    "You will start with the simplest possible parts-of-speech tagger and we will build up to the state of the art. \n",
    "\n",
    "In this section, you will find the words that are not ambiguous. \n",
    "- For example, the word `is` is a verb and it is not ambiguous. \n",
    "- In the `WSJ` corpus, $86$% of the token are unambiguous (meaning they have only one tag) \n",
    "- About $14\\%$ are ambiguous (meaning that they have more than one tag)\n",
    "\n",
    "<img src = \"images/pos.png\" style=\"width:400px;height:250px;\"/>\n",
    "\n",
    "Before you start predicting the tags of each word, you will need to compute a few dictionaries that will help you to generate the tables. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c58dfdcf",
   "metadata": {},
   "source": [
    "#### Transition counts\n",
    "- The first dictionary is the `transition_counts` dictionary which computes the number of times each tag happened next to another tag. \n",
    "\n",
    "This dictionary will be used to compute: \n",
    "$$P(t_i |t_{i-1}) \\tag{1}$$\n",
    "\n",
    "This is the probability of a tag at position $i$ given the tag at position $i-1$.\n",
    "\n",
    "In order for you to compute equation 1, you will create a `transition_counts` dictionary where \n",
    "- The keys are `(prev_tag, tag)`\n",
    "- The values are the number of times those two tags appeared in that order. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "27576bac",
   "metadata": {},
   "source": [
    "#### Emission counts\n",
    "\n",
    "The second dictionary you will compute is the `emission_counts` dictionary. This dictionary will be used to compute:\n",
    "\n",
    "$$P(w_i|t_i)\\tag{2}$$\n",
    "\n",
    "In other words, you will use it to compute the probability of a word given its tag. \n",
    "\n",
    "In order for you to compute equation 2, you will create an `emission_counts` dictionary where \n",
    "- The keys are `(tag, word)` \n",
    "- The values are the number of times that pair showed up in your training set. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7014aff6",
   "metadata": {},
   "source": [
    "#### Tag counts\n",
    "\n",
    "The last dictionary you will compute is the `tag_counts` dictionary. \n",
    "- The key is the tag \n",
    "- The value is the number of times each tag appeared."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "38d8f7c4",
   "metadata": {},
   "source": [
    "<a name='ex-01'></a>\n",
    "### Exercise 01\n",
    "\n",
    "**Instructions:** Write a program that takes in the `training_corpus` and returns the three dictionaries mentioned above `transition_counts`, `emission_counts`, and `tag_counts`. \n",
    "- `emission_counts`: maps (tag, word) to the number of times it happened. \n",
    "- `transition_counts`: maps (prev_tag, tag) to the number of times it has appeared. \n",
    "- `tag_counts`: maps (tag) to the number of times it has occured. \n",
    "\n",
    "Implementation note: This routine utilises *defaultdict*, which is a subclass of *dict*. \n",
    "- A standard Python dictionary throws a *KeyError* if you try to access an item with a key that is not currently in the dictionary. \n",
    "- In contrast, the *defaultdict* will create an item of the type of the argument, in this case an integer with the default value of 0. \n",
    "- See [defaultdict](https://docs.python.org/3.3/library/collections.html#defaultdict-objects)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "8ae2eb80",
   "metadata": {},
   "outputs": [],
   "source": [
    "# UNQ_C1 GRADED FUNCTION: create_dictionaries\n",
    "def create_dictionaries(training_corpus, vocab, verbose=True):\n",
    "    \"\"\"\n",
    "    Input: \n",
    "        training_corpus: a corpus where each line has a word followed by its tag.\n",
    "        vocab: a dictionary where keys are words in vocabulary and value is an index\n",
    "    Output: \n",
    "        emission_counts: a dictionary where the keys are (tag, word) and the values are the counts\n",
    "        transition_counts: a dictionary where the keys are (prev_tag, tag) and the values are the counts\n",
    "        tag_counts: a dictionary where the keys are the tags and the values are the counts\n",
    "    \"\"\"\n",
    "    \n",
    "    # initialize the dictionaries using defaultdict\n",
    "    emission_counts = defaultdict(int)\n",
    "    transition_counts = defaultdict(int)\n",
    "    tag_counts = defaultdict(int)\n",
    "    \n",
    "    # Initialize \"prev_tag\" (previous tag) with the start state, denoted by '--s--'\n",
    "    prev_tag = '--s--' \n",
    "    \n",
    "    # use 'i' to track the line number in the corpus\n",
    "    i = 0 \n",
    "    \n",
    "    # Each item in the training corpus contains a word and its POS tag\n",
    "    # Go through each word and its tag in the training corpus\n",
    "    for word_tag in training_corpus:\n",
    "        \n",
    "        # Increment the word_tag count\n",
    "        i += 1\n",
    "        \n",
    "        # Every 50,000 words, print the word count\n",
    "        if i % 50000 == 0 and verbose:\n",
    "            print(f\"word count = {i}\")\n",
    "            \n",
    "        ### START CODE HERE ###\n",
    "        # get the word and tag using the get_word_tag helper function (imported from utils_pos.py)\n",
    "        # the function is defined as: get_word_tag(line, vocab)\n",
    "        word, tag = get_word_tag(word_tag, vocab)\n",
    "        \n",
    "        # Increment the transition count for the previous word and tag\n",
    "        transition_counts[(prev_tag, tag)] += 1\n",
    "        \n",
    "        # Increment the emission count for the tag and word\n",
    "        emission_counts[(tag, word)] += 1\n",
    "\n",
    "        # Increment the tag count\n",
    "        tag_counts[tag] += 1\n",
    "\n",
    "        # Set the previous tag to this tag (for the next iteration of the loop)\n",
    "        prev_tag = tag\n",
    "        \n",
    "        ### END CODE HERE ###\n",
    "        \n",
    "    return emission_counts, transition_counts, tag_counts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "740a6ef8",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "word count = 50000\n",
      "word count = 100000\n",
      "word count = 150000\n",
      "word count = 200000\n",
      "word count = 250000\n",
      "word count = 300000\n",
      "word count = 350000\n",
      "word count = 400000\n",
      "word count = 450000\n",
      "word count = 500000\n",
      "word count = 550000\n",
      "word count = 600000\n",
      "word count = 650000\n",
      "word count = 700000\n",
      "word count = 750000\n",
      "word count = 800000\n",
      "word count = 850000\n",
      "word count = 900000\n",
      "word count = 950000\n"
     ]
    }
   ],
   "source": [
    "emission_counts, transition_counts, tag_counts = create_dictionaries(training_corpus, vocab)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "91b3f6ec",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of POS tags (number of 'states'): 46\n",
      "View these POS tags (states)\n",
      "['#', '$', \"''\", '(', ')', ',', '--s--', '.', ':', 'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNP', 'NNPS', 'NNS', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB', '``']\n"
     ]
    }
   ],
   "source": [
    "# get all the POS states\n",
    "states = sorted(tag_counts.keys())\n",
    "print(f\"Number of POS tags (number of 'states'): {len(states)}\")\n",
    "print(\"View these POS tags (states)\")\n",
    "print(states)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1cc8fb5c",
   "metadata": {},
   "source": [
    "##### Expected Output\n",
    "\n",
    "```CPP\n",
    "Number of POS tags (number of 'states'46\n",
    "View these states\n",
    "['#', '$', \"''\", '(', ')', ',', '--s--', '.', ':', 'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNP', 'NNPS', 'NNS', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB', '``']\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "a2115330",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m All tests passed\n"
     ]
    }
   ],
   "source": [
    "# Test your function\n",
    "w2_unittest.test_create_dictionaries(create_dictionaries, training_corpus, vocab)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3bd1d930",
   "metadata": {},
   "source": [
    "The 'states' are the Parts-of-speech designations found in the training data. They will also be referred to as 'tags' or POS in this assignment. \n",
    "\n",
    "- \"NN\" is noun, singular, \n",
    "- 'NNS' is noun, plural. \n",
    "- In addition, there are helpful tags like '--s--' which indicate a start of a sentence.\n",
    "- You can get a more complete description at [Penn Treebank II tag set](https://www.clips.uantwerpen.be/clips.bak/pages/mbsp-tags). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "a46a5ca5",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "transition examples: \n",
      "(('--s--', 'IN'), 5050)\n",
      "(('IN', 'DT'), 32364)\n",
      "(('DT', 'NNP'), 9044)\n",
      "\n",
      "emission examples: \n",
      "(('DT', 'any'), 721)\n",
      "(('NN', 'decrease'), 7)\n",
      "(('NN', 'insider-trading'), 5)\n",
      "\n",
      "ambiguous word example: \n",
      "('RB', 'back') 304\n",
      "('VB', 'back') 20\n",
      "('RP', 'back') 84\n",
      "('JJ', 'back') 25\n",
      "('NN', 'back') 29\n",
      "('VBP', 'back') 4\n"
     ]
    }
   ],
   "source": [
    "print(\"transition examples: \")\n",
    "for ex in list(transition_counts.items())[:3]:\n",
    "    print(ex)\n",
    "print()\n",
    "\n",
    "print(\"emission examples: \")\n",
    "for ex in list(emission_counts.items())[200:203]:\n",
    "    print (ex)\n",
    "print()\n",
    "\n",
    "print(\"ambiguous word example: \")\n",
    "for tup,cnt in emission_counts.items():\n",
    "    if tup[1] == 'back': print (tup, cnt) "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "140f3090",
   "metadata": {},
   "source": [
    "##### Expected Output\n",
    "\n",
    "```CPP\n",
    "transition examples: \n",
    "(('--s--', 'IN'), 5050)\n",
    "(('IN', 'DT'), 32364)\n",
    "(('DT', 'NNP'), 9044)\n",
    "\n",
    "emission examples: \n",
    "(('DT', 'any'), 721)\n",
    "(('NN', 'decrease'), 7)\n",
    "(('NN', 'insider-trading'), 5)\n",
    "\n",
    "ambiguous word example: \n",
    "('RB', 'back') 304\n",
    "('VB', 'back') 20\n",
    "('RP', 'back') 84\n",
    "('JJ', 'back') 25\n",
    "('NN', 'back') 29\n",
    "('VBP', 'back') 4\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3108e68a",
   "metadata": {},
   "source": [
    "<a name='1.2'></a>\n",
    "### Part 1.2 - Testing\n",
    "\n",
    "Now you will test the accuracy of your parts-of-speech tagger using your `emission_counts` dictionary. \n",
    "- Given your preprocessed test corpus `prep`, you will assign a parts-of-speech tag to every word in that corpus. \n",
    "- Using the original tagged test corpus `y`, you will then compute what percent of the tags you got correct. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "859f1ba9",
   "metadata": {},
   "source": [
    "<a name='ex-02'></a>\n",
    "### Exercise 02\n",
    "\n",
    "**Instructions:** Implement `predict_pos` that computes the accuracy of your model. \n",
    "\n",
    "- This is a warm up exercise. \n",
    "- To assign a part of speech to a word, assign the most frequent POS for that word in the training set. \n",
    "- Then evaluate how well this approach works.  Each time you predict based on the most frequent POS for the given word, check whether the actual POS of that word is the same.  If so, the prediction was correct!\n",
    "- Calculate the accuracy as the number of correct predictions divided by the total number of words for which you predicted the POS tag."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "ec089bcb",
   "metadata": {},
   "outputs": [],
   "source": [
    "# UNQ_C2 GRADED FUNCTION: predict_pos\n",
    "def predict_pos(prep, y, emission_counts, vocab, states):\n",
    "    '''\n",
    "    Input: \n",
    "        prep: a preprocessed version of 'y'. A list with the 'word' component of the tuples.\n",
    "        y: a corpus composed of a list of tuples where each tuple consists of (word, POS)\n",
    "        emission_counts: a dictionary where the keys are (tag,word) tuples and the value is the count\n",
    "        vocab: a dictionary where keys are words in vocabulary and value is an index\n",
    "        states: a sorted list of all possible tags for this assignment\n",
    "    Output: \n",
    "        accuracy: Number of times you classified a word correctly\n",
    "    '''\n",
    "    \n",
    "    # Initialize the number of correct predictions to zero\n",
    "    num_correct = 0\n",
    "    \n",
    "    # Get the (tag, word) tuples, stored as a set\n",
    "    all_words = set(emission_counts.keys())\n",
    "    \n",
    "    # Get the number of (word, POS) tuples in the corpus 'y'\n",
    "    total = len(y)\n",
    "    for word, y_tup in zip(prep, y): \n",
    "\n",
    "        # Split the (word, POS) string into a list of two items\n",
    "        y_tup_l = y_tup.split()\n",
    "        \n",
    "        # Verify that y_tup contain both word and POS\n",
    "        if len(y_tup_l) == 2:\n",
    "            \n",
    "            # Set the true POS label for this word\n",
    "            true_label = y_tup_l[1]\n",
    "\n",
    "        else:\n",
    "            # If the y_tup didn't contain word and POS, go to next word\n",
    "            continue\n",
    "    \n",
    "        count_final = 0\n",
    "        pos_final = ''\n",
    "        \n",
    "        # If the word is in the vocabulary...\n",
    "        if word in vocab:\n",
    "            for pos in states:\n",
    "\n",
    "            ### START CODE HERE (Replace instances of 'None' with your code) ###\n",
    "            \n",
    "                # define the key as the tuple containing the POS and word\n",
    "                key = (pos, word)\n",
    "\n",
    "                # check if the (pos, word) key exists in the emission_counts dictionary\n",
    "                if key in emission_counts: # Replace None in this line with the proper condition.\n",
    "\n",
    "                # get the emission count of the (pos,word) tuple \n",
    "                    count = emission_counts[key]\n",
    "\n",
    "                    # keep track of the POS with the largest count\n",
    "                    if count > count_final: # Replace None in this line with the proper condition.\n",
    "\n",
    "                        # update the final count (largest count)\n",
    "                        count_final = count\n",
    "\n",
    "                        # update the final POS\n",
    "                        pos_final = pos\n",
    "\n",
    "            # If the final POS (with the largest count) matches the true POS:\n",
    "            if pos_final == true_label: # Replace None in this line with the proper condition.\n",
    "                # Update the number of correct predictions\n",
    "                num_correct += 1\n",
    "            \n",
    "    ### END CODE HERE ###\n",
    "    accuracy = num_correct / total\n",
    "    \n",
    "    return accuracy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "36c3c317",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy of prediction using predict_pos is 0.8889\n"
     ]
    }
   ],
   "source": [
    "accuracy_predict_pos = predict_pos(prep, y, emission_counts, vocab, states)\n",
    "print(f\"Accuracy of prediction using predict_pos is {accuracy_predict_pos:.4f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c09e3a84",
   "metadata": {},
   "source": [
    "##### Expected Output\n",
    "```CPP\n",
    "Accuracy of prediction using predict_pos is 0.8889\n",
    "```\n",
    "\n",
    "88.9% is really good for this warm up exercise. With hidden markov models, you should be able to get **95% accuracy.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "ce0e95a7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m All tests passed\n"
     ]
    }
   ],
   "source": [
    "# Test your function\n",
    "w2_unittest.test_predict_pos(predict_pos, prep, y, emission_counts, vocab, states)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "64f7d58a",
   "metadata": {},
   "source": [
    "<a name='2'></a>\n",
    "# Part 2: Hidden Markov Models for POS\n",
    "\n",
    "Now you will build something more context specific. Concretely, you will be implementing a Hidden Markov Model (HMM) with a Viterbi decoder\n",
    "- The HMM is one of the most commonly used algorithms in Natural Language Processing, and is a foundation to many deep learning techniques you will see in this specialization. \n",
    "- In addition to parts-of-speech tagging, HMM is used in speech recognition, speech synthesis, etc. \n",
    "- By completing this part of the assignment you will get a 95% accuracy on the same dataset you used in Part 1.\n",
    "\n",
    "The Markov Model contains a number of states and the probability of transition between those states. \n",
    "- In this case, the states are the parts-of-speech. \n",
    "- A Markov Model utilizes a transition matrix, `A`. \n",
    "- A Hidden Markov Model adds an observation or emission matrix `B` which describes the probability of a visible observation when we are in a particular state. \n",
    "- In this case, the emissions are the words in the corpus\n",
    "- The state, which is hidden, is the POS tag of that word."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1269eca0",
   "metadata": {},
   "source": [
    "<a name='2.1'></a>\n",
    "## Part 2.1 Generating Matrices\n",
    "\n",
    "### Creating the 'A' transition probabilities matrix\n",
    "Now that you have your `emission_counts`, `transition_counts`, and `tag_counts`, you will start implementing the Hidden Markov Model. \n",
    "\n",
    "This will allow you to quickly construct the \n",
    "- `A` transition probabilities matrix.\n",
    "- and the `B` emission probabilities matrix. \n",
    "\n",
    "You will also use some smoothing when computing these matrices. \n",
    "\n",
    "Here is an example of what the `A` transition matrix would look like (it is simplified to 5 tags for viewing. It is 46x46 in this assignment.):\n",
    "\n",
    "\n",
    "|**A**  |...|         RBS  |          RP  |         SYM  |      TO  |          UH|...\n",
    "| --- ||---:-------------| ------------ | ------------ | -------- | ---------- |----\n",
    "|**RBS**  |...|2.217069e-06  |2.217069e-06  |2.217069e-06  |0.008870  |2.217069e-06|...\n",
    "|**RP**   |...|3.756509e-07  |7.516775e-04  |3.756509e-07  |0.051089  |3.756509e-07|...\n",
    "|**SYM**  |...|1.722772e-05  |1.722772e-05  |1.722772e-05  |0.000017  |1.722772e-05|...\n",
    "|**TO**   |...|4.477336e-05  |4.472863e-08  |4.472863e-08  |0.000090  |4.477336e-05|...\n",
    "|**UH**  |...|1.030439e-05  |1.030439e-05  |1.030439e-05  |0.061837  |3.092348e-02|...\n",
    "| ... |...| ...          | ...          | ...          | ...      | ...        | ...\n",
    "\n",
    "Note that the matrix above was computed with smoothing. \n",
    "\n",
    "Each cell gives you the probability to go from one part of speech to another. \n",
    "- In other words, there is a 4.47e-8 chance of going from parts-of-speech `TO` to `RP`. \n",
    "- The sum of each row has to equal 1, because we assume that the next POS tag must be one of the available columns in the table.\n",
    "\n",
    "The smoothing was done as follows: \n",
    "\n",
    "$$ P(t_i | t_{i-1}) = \\frac{C(t_{i-1}, t_{i}) + \\alpha }{C(t_{i-1}) +\\alpha * N}\\tag{3}$$\n",
    "\n",
    "- $N$ is the total number of tags\n",
    "- $C(t_{i-1}, t_{i})$ is the count of the tuple (previous POS, current POS) in `transition_counts` dictionary.\n",
    "- $C(t_{i-1})$ is the count of the previous POS in the `tag_counts` dictionary.\n",
    "- $\\alpha$ is a smoothing parameter."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "36e70c73",
   "metadata": {},
   "source": [
    "<a name='ex-03'></a>\n",
    "### Exercise 03\n",
    "\n",
    "**Instructions:** Implement the `create_transition_matrix` below for all tags. Your task is to output a matrix that computes equation 3 for each cell in matrix `A`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "84cdfdfa",
   "metadata": {},
   "outputs": [],
   "source": [
    "# UNQ_C3 GRADED FUNCTION: create_transition_matrix\n",
    "def create_transition_matrix(alpha, tag_counts, transition_counts):\n",
    "    ''' \n",
    "    Input: \n",
    "        alpha: number used for smoothing\n",
    "        tag_counts: a dictionary mapping each tag to its respective count\n",
    "        transition_counts: a dictionary where the keys are (prev_tag, tag) and the values are the counts\n",
    "    Output:\n",
    "        A: matrix of dimension (num_tags,num_tags)\n",
    "    '''\n",
    "    # Get a sorted list of unique POS tags\n",
    "    all_tags = sorted(tag_counts.keys())\n",
    "    \n",
    "    # Count the number of unique POS tags\n",
    "    num_tags = len(all_tags)\n",
    "    \n",
    "    # Initialize the transition matrix 'A'\n",
    "    A = np.zeros((num_tags,num_tags))\n",
    "    \n",
    "    # Get the unique transition tuples (previous POS, current POS)\n",
    "    trans_keys = set(transition_counts.keys())\n",
    "    \n",
    "    ### START CODE HERE ### \n",
    "    \n",
    "    # Go through each row of the transition matrix A\n",
    "    for i in range(num_tags):\n",
    "        \n",
    "        # Go through each column of the transition matrix A\n",
    "        for j in range(num_tags):\n",
    "\n",
    "            # Initialize the count of the (prev POS, current POS) to zero\n",
    "            count = 0\n",
    "        \n",
    "            # Define the tuple (prev POS, current POS)\n",
    "            # Get the tag at position i and tag at position j (from the all_tags list)\n",
    "            key = (all_tags[i], all_tags[j]) # tuple of form (tag,tag)\n",
    "\n",
    "            # Check if the (prev POS, current POS) tuple \n",
    "            # exists in the transition counts dictionary\n",
    "            if key in trans_keys: # Replace None in this line with the proper condition.\n",
    "                \n",
    "                # Get count from the transition_counts dictionary \n",
    "                # for the (prev POS, current POS) tuple\n",
    "                count = transition_counts[key]                \n",
    "\n",
    "            # Get the count of the previous tag (index position i) from tag_counts\n",
    "            count_prev_tag = tag_counts[all_tags[i]]\n",
    "            \n",
    "            # Apply smoothing using count of the tuple, alpha, \n",
    "            # count of previous tag, alpha, and total number of tags\n",
    "            A[i,j] = (count + alpha)/ (count_prev_tag+num_tags*alpha)\n",
    "\n",
    "    ### END CODE HERE ###\n",
    "    return A"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "303d5281",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "A at row 0, col 0: 0.000007040\n",
      "A at row 3, col 1: 0.1691\n",
      "View a subset of transition matrix A\n",
      "              RBS            RP           SYM        TO            UH\n",
      "RBS  2.217069e-06  2.217069e-06  2.217069e-06  0.008870  2.217069e-06\n",
      "RP   3.756509e-07  7.516775e-04  3.756509e-07  0.051089  3.756509e-07\n",
      "SYM  1.722772e-05  1.722772e-05  1.722772e-05  0.000017  1.722772e-05\n",
      "TO   4.477336e-05  4.472863e-08  4.472863e-08  0.000090  4.477336e-05\n",
      "UH   1.030439e-05  1.030439e-05  1.030439e-05  0.061837  3.092348e-02\n"
     ]
    }
   ],
   "source": [
    "alpha = 0.001\n",
    "A = create_transition_matrix(alpha, tag_counts, transition_counts)\n",
    "# Testing your function\n",
    "print(f\"A at row 0, col 0: {A[0,0]:.9f}\")\n",
    "print(f\"A at row 3, col 1: {A[3,1]:.4f}\")\n",
    "\n",
    "print(\"View a subset of transition matrix A\")\n",
    "A_sub = pd.DataFrame(A[30:35,30:35], index=states[30:35], columns = states[30:35] )\n",
    "print(A_sub)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "da7d5899",
   "metadata": {},
   "source": [
    "##### Expected Output\n",
    "```CPP\n",
    "A at row 0, col 0: 0.000007040\n",
    "A at row 3, col 1: 0.1691\n",
    "View a subset of transition matrix A\n",
    "              RBS            RP           SYM        TO            UH\n",
    "RBS  2.217069e-06  2.217069e-06  2.217069e-06  0.008870  2.217069e-06\n",
    "RP   3.756509e-07  7.516775e-04  3.756509e-07  0.051089  3.756509e-07\n",
    "SYM  1.722772e-05  1.722772e-05  1.722772e-05  0.000017  1.722772e-05\n",
    "TO   4.477336e-05  4.472863e-08  4.472863e-08  0.000090  4.477336e-05\n",
    "UH   1.030439e-05  1.030439e-05  1.030439e-05  0.061837  3.092348e-02\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "7b0295b1",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m All tests passed\n"
     ]
    }
   ],
   "source": [
    "# Test your function\n",
    "w2_unittest.test_create_transition_matrix(create_transition_matrix, tag_counts, transition_counts)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "099a4a7f",
   "metadata": {},
   "source": [
    "### Create the 'B' emission probabilities matrix\n",
    "\n",
    "Now you will create the `B` transition matrix which computes the emission probability. \n",
    "\n",
    "You will use smoothing as defined below: \n",
    "\n",
    "$$P(w_i | t_i) = \\frac{C(t_i, word_i)+ \\alpha}{C(t_{i}) +\\alpha * N}\\tag{4}$$\n",
    "\n",
    "- $C(t_i, word_i)$ is the number of times $word_i$ was associated with $tag_i$ in the training data (stored in `emission_counts` dictionary).\n",
    "- $C(t_i)$ is the number of times $tag_i$ was in the training data (stored in `tag_counts` dictionary).\n",
    "- $N$ is the number of words in the vocabulary\n",
    "- $\\alpha$ is a smoothing parameter. \n",
    "\n",
    "The matrix `B` is of dimension (num_tags, N), where num_tags is the number of possible parts-of-speech tags. \n",
    "\n",
    "Here is an example of the matrix, only a subset of tags and words are shown: \n",
    "<p style='text-align: center;'> <b>B Emissions Probability Matrix (subset)</b>  </p>\n",
    "\n",
    "|**B**| ...|          725 |     adroitly |    engineers |     promoted |      synergy| ...|\n",
    "|----|----|--------------|--------------|--------------|--------------|-------------|----|\n",
    "|**CD**  | ...| **8.201296e-05** | 2.732854e-08 | 2.732854e-08 | 2.732854e-08 | 2.732854e-08| ...|\n",
    "|**NN**  | ...| 7.521128e-09 | 7.521128e-09 | 7.521128e-09 | 7.521128e-09 | **2.257091e-05**| ...|\n",
    "|**NNS** | ...| 1.670013e-08 | 1.670013e-08 |**4.676203e-04** | 1.670013e-08 | 1.670013e-08| ...|\n",
    "|**VB**  | ...| 3.779036e-08 | 3.779036e-08 | 3.779036e-08 | 3.779036e-08 | 3.779036e-08| ...|\n",
    "|**RB**  | ...| 3.226454e-08 | **6.456135e-05** | 3.226454e-08 | 3.226454e-08 | 3.226454e-08| ...|\n",
    "|**RP**  | ...| 3.723317e-07 | 3.723317e-07 | 3.723317e-07 | **3.723317e-07** | 3.723317e-07| ...|\n",
    "| ...    | ...|     ...      |     ...      |     ...      |     ...      |     ...      | ...|\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d5df20ff",
   "metadata": {},
   "source": [
    "<a name='ex-04'></a>\n",
    "### Exercise 04\n",
    "**Instructions:** Implement the `create_emission_matrix` below that computes the `B` emission probabilities matrix. Your function takes in $\\alpha$, the smoothing parameter, `tag_counts`, which is a dictionary mapping each tag to its respective count, the `emission_counts` dictionary where the keys are (tag, word) and the values are the counts. Your task is to output a matrix that computes equation 4 for each cell in matrix `B`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "cf6539ac",
   "metadata": {},
   "outputs": [],
   "source": [
    "# UNQ_C4 GRADED FUNCTION: create_emission_matrix\n",
    "\n",
    "def create_emission_matrix(alpha, tag_counts, emission_counts, vocab):\n",
    "    '''\n",
    "    Input: \n",
    "        alpha: tuning parameter used in smoothing \n",
    "        tag_counts: a dictionary mapping each tag to its respective count\n",
    "        emission_counts: a dictionary where the keys are (tag, word) and the values are the counts\n",
    "        vocab: a dictionary where keys are words in vocabulary and value is an index.\n",
    "               within the function it'll be treated as a list\n",
    "    Output:\n",
    "        B: a matrix of dimension (num_tags, len(vocab))\n",
    "    '''\n",
    "    \n",
    "    # get the number of POS tag\n",
    "    num_tags = len(tag_counts)\n",
    "    \n",
    "    # Get a list of all POS tags\n",
    "    all_tags = sorted(tag_counts.keys())\n",
    "    \n",
    "    # Get the total number of unique words in the vocabulary\n",
    "    num_words = len(vocab)\n",
    "    \n",
    "    # Initialize the emission matrix B with places for\n",
    "    # tags in the rows and words in the columns\n",
    "    B = np.zeros((num_tags, num_words))\n",
    "    \n",
    "    # Get a set of all (POS, word) tuples \n",
    "    # from the keys of the emission_counts dictionary\n",
    "    emis_keys = set(list(emission_counts.keys()))\n",
    "    \n",
    "    ### START CODE HERE (Replace instances of 'None' with your code) ###\n",
    "    \n",
    "    # Go through each row (POS tags)\n",
    "    for i in range(num_tags): # Replace None in this line with the proper range.\n",
    "        \n",
    "        # Go through each column (words)\n",
    "        for j in range(num_words): # Replace None in this line with the proper range.\n",
    "\n",
    "            # Initialize the emission count for the (POS tag, word) to zero\n",
    "            count = 0 \n",
    "                    \n",
    "            # Define the (POS tag, word) tuple for this row and column\n",
    "            key = (all_tags[i], vocab[j]) # tuple of form (tag,word)\n",
    "\n",
    "            # check if the (POS tag, word) tuple exists as a key in emission counts\n",
    "            if key in emis_keys: # Replace None in this line with the proper condition.\n",
    "        \n",
    "                # Get the count of (POS tag, word) from the emission_counts d\n",
    "                count = emission_counts[key]\n",
    "                \n",
    "            # Get the count of the POS tag\n",
    "            count_tag = tag_counts[all_tags[i]]\n",
    "                \n",
    "            # Apply smoothing and store the smoothed value \n",
    "            # into the emission matrix B for this row and column\n",
    "            B[i,j] = (count+alpha)/ (count_tag+num_words*alpha)\n",
    "\n",
    "    ### END CODE HERE ###\n",
    "    return B"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "00f67d33",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "View Matrix position at row 0, column 0: 0.000006032\n",
      "View Matrix position at row 3, column 1: 0.000000720\n",
      "              725      adroitly     engineers      promoted       synergy\n",
      "CD   8.201296e-05  2.732854e-08  2.732854e-08  2.732854e-08  2.732854e-08\n",
      "NN   7.521128e-09  7.521128e-09  7.521128e-09  7.521128e-09  2.257091e-05\n",
      "NNS  1.670013e-08  1.670013e-08  4.676203e-04  1.670013e-08  1.670013e-08\n",
      "VB   3.779036e-08  3.779036e-08  3.779036e-08  3.779036e-08  3.779036e-08\n",
      "RB   3.226454e-08  6.456135e-05  3.226454e-08  3.226454e-08  3.226454e-08\n",
      "RP   3.723317e-07  3.723317e-07  3.723317e-07  3.723317e-07  3.723317e-07\n"
     ]
    }
   ],
   "source": [
    "# creating your emission probability matrix. this takes a few minutes to run. \n",
    "alpha = 0.001\n",
    "B = create_emission_matrix(alpha, tag_counts, emission_counts, list(vocab))\n",
    "\n",
    "print(f\"View Matrix position at row 0, column 0: {B[0,0]:.9f}\")\n",
    "print(f\"View Matrix position at row 3, column 1: {B[3,1]:.9f}\")\n",
    "\n",
    "# Try viewing emissions for a few words in a sample dataframe\n",
    "cidx  = ['725','adroitly','engineers', 'promoted', 'synergy']\n",
    "\n",
    "# Get the integer ID for each word\n",
    "cols = [vocab[a] for a in cidx]\n",
    "\n",
    "# Choose POS tags to show in a sample dataframe\n",
    "rvals =['CD','NN','NNS', 'VB','RB','RP']\n",
    "\n",
    "# For each POS tag, get the row number from the 'states' list\n",
    "rows = [states.index(a) for a in rvals]\n",
    "\n",
    "# Get the emissions for the sample of words, and the sample of POS tags\n",
    "B_sub = pd.DataFrame(B[np.ix_(rows,cols)], index=rvals, columns = cidx )\n",
    "print(B_sub)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d9c76c43",
   "metadata": {},
   "source": [
    "##### Expected Output\n",
    "\n",
    "```CPP\n",
    "View Matrix position at row 0, column 0: 0.000006032\n",
    "View Matrix position at row 3, column 1: 0.000000720\n",
    "              725      adroitly     engineers      promoted       synergy\n",
    "CD   8.201296e-05  2.732854e-08  2.732854e-08  2.732854e-08  2.732854e-08\n",
    "NN   7.521128e-09  7.521128e-09  7.521128e-09  7.521128e-09  2.257091e-05\n",
    "NNS  1.670013e-08  1.670013e-08  4.676203e-04  1.670013e-08  1.670013e-08\n",
    "VB   3.779036e-08  3.779036e-08  3.779036e-08  3.779036e-08  3.779036e-08\n",
    "RB   3.226454e-08  6.456135e-05  3.226454e-08  3.226454e-08  3.226454e-08\n",
    "RP   3.723317e-07  3.723317e-07  3.723317e-07  3.723317e-07  3.723317e-07\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "c5a664f8",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m All tests passed\n"
     ]
    }
   ],
   "source": [
    "# Test your function\n",
    "w2_unittest.test_create_emission_matrix(create_emission_matrix, tag_counts, emission_counts, list(vocab))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a0254def",
   "metadata": {},
   "source": [
    "<a name='3'></a>\n",
    "# Part 3: Viterbi Algorithm and Dynamic Programming\n",
    "\n",
    "In this part of the assignment you will implement the Viterbi algorithm which makes use of dynamic programming. Specifically, you will use your two matrices, `A` and `B` to compute the Viterbi algorithm. We have decomposed this process into three main steps for you. \n",
    "\n",
    "* **Initialization** - In this part you initialize the `best_paths` and `best_probabilities` matrices that you will be populating in `feed_forward`.\n",
    "* **Feed forward** - At each step, you calculate the probability of each path happening and the best paths up to that point. \n",
    "* **Feed backward**: This allows you to find the best path with the highest probabilities. \n",
    "\n",
    "<a name='3.1'></a>\n",
    "## Part 3.1:  Initialization \n",
    "\n",
    "You will start by initializing two matrices of the same dimension. \n",
    "\n",
    "- best_probs: Each cell contains the probability of going from one POS tag to a word in the corpus.\n",
    "\n",
    "- best_paths: A matrix that helps you trace through the best possible path in the corpus. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "93550ed3",
   "metadata": {},
   "source": [
    "<a name='ex-05'></a>\n",
    "### Exercise 05\n",
    "**Instructions**: \n",
    "Write a program below that initializes the `best_probs` and the `best_paths` matrix. \n",
    "\n",
    "Both matrices will be initialized to zero except for column zero of `best_probs`.  \n",
    "- Column zero of `best_probs` is initialized with the assumption that the first word of the corpus was preceded by a start token (\"--s--\"). \n",
    "- This allows you to reference the **A** matrix for the transition probability\n",
    "\n",
    "Here is how to initialize column 0 of `best_probs`:\n",
    "- The probability of the best path going from the start index to a given POS tag indexed by integer $i$ is denoted by $\\textrm{best_probs}[s_{idx}, i]$.\n",
    "- This is estimated as the probability that the start tag transitions to the POS denoted by index $i$: $\\mathbf{A}[s_{idx}, i]$ AND that the POS tag denoted by $i$ emits the first word of the given corpus, which is $\\mathbf{B}[i, vocab[corpus[0]]]$.\n",
    "- Note that vocab[corpus[0]] refers to the first word of the corpus (the word at position 0 of the corpus). \n",
    "- **vocab** is a dictionary that returns the unique integer that refers to that particular word.\n",
    "\n",
    "Conceptually, it looks like this:\n",
    "$\\textrm{best_probs}[s_{idx}, i] = \\mathbf{A}[s_{idx}, i] \\times \\mathbf{B}[i, corpus[0] ]$\n",
    "\n",
    "\n",
    "In order to avoid multiplying and storing small values on the computer, we'll take the log of the product, which becomes the sum of two logs:\n",
    "\n",
    "$best\\_probs[i,0] = log(A[s_{idx}, i]) + log(B[i, vocab[corpus[0]]$\n",
    "\n",
    "Also, to avoid taking the log of 0 (which is defined as negative infinity), the code itself will just set $best\\_probs[i,0] = float('-inf')$ when $A[s_{idx}, i] == 0$\n",
    "\n",
    "\n",
    "So the implementation to initialize $best\\_probs$ looks like this:\n",
    "\n",
    "$ \\textrm{if}\\ A[s_{idx}, i] <> 0 : best\\_probs[i,0] = log(A[s_{idx}, i]) + log(B[i, vocab[corpus[0]]])$\n",
    "\n",
    "$ \\textrm{if}\\ A[s_{idx}, i] == 0 : best\\_probs[i,0] = float('-inf')$\n",
    "\n",
    "Please use [math.log](https://docs.python.org/3/library/math.html) to compute the natural logarithm."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "019010e5",
   "metadata": {},
   "source": [
    "The example below shows the initialization assuming the corpus starts with the phrase \"Loss tracks upward\".\n",
    "\n",
    "<img src = \"images/Initialize4.png\"/>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1bf280ee",
   "metadata": {},
   "source": [
    "Represent infinity and negative infinity like this:\n",
    "\n",
    "```CPP\n",
    "float('inf')\n",
    "float('-inf')\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "cb56f605",
   "metadata": {},
   "outputs": [],
   "source": [
    "# UNQ_C5 GRADED FUNCTION: initialize\n",
    "def initialize(states, tag_counts, A, B, corpus, vocab):\n",
    "    '''\n",
    "    Input: \n",
    "        states: a list of all possible parts-of-speech\n",
    "        tag_counts: a dictionary mapping each tag to its respective count\n",
    "        A: Transition Matrix of dimension (num_tags, num_tags)\n",
    "        B: Emission Matrix of dimension (num_tags, len(vocab))\n",
    "        corpus: a sequence of words whose POS is to be identified in a list \n",
    "        vocab: a dictionary where keys are words in vocabulary and value is an index\n",
    "    Output:\n",
    "        best_probs: matrix of dimension (num_tags, len(corpus)) of floats\n",
    "        best_paths: matrix of dimension (num_tags, len(corpus)) of integers\n",
    "    '''\n",
    "    # Get the total number of unique POS tags\n",
    "    num_tags = len(tag_counts)\n",
    "    \n",
    "    # Initialize best_probs matrix \n",
    "    # POS tags in the rows, number of words in the corpus as the columns\n",
    "    best_probs = np.zeros((num_tags, len(corpus)))\n",
    "    \n",
    "    # Initialize best_paths matrix\n",
    "    # POS tags in the rows, number of words in the corpus as columns\n",
    "    best_paths = np.zeros((num_tags, len(corpus)), dtype=int)\n",
    "    \n",
    "    # Define the start token\n",
    "    s_idx = states.index(\"--s--\")\n",
    "    ### START CODE HERE (Replace instances of 'None' with your code) ###\n",
    "    \n",
    "    # Go through each of the POS tags\n",
    "    for i in range(num_tags): # Replace None in this line with the proper range.\n",
    "        \n",
    "        # Handle the special case when the transition from start token to POS tag i is zero\n",
    "        if A[s_idx, i] == 0: # Replace None in this line with the proper condition. # POS by word\n",
    "            \n",
    "            # Initialize best_probs at POS tag 'i', column 0, to negative infinity\n",
    "            best_probs[i,0] = float(\"-inf\")\n",
    "        \n",
    "        # For all other cases when transition from start token to POS tag i is non-zero:\n",
    "        else:\n",
    "            \n",
    "            # Initialize best_probs at POS tag 'i', column 0\n",
    "            # Check the formula in the instructions above\n",
    "            best_probs[i,0] = math.log(A[s_idx, i]) + math.log(B[i, vocab[corpus[0]]])\n",
    "            \n",
    "    ### END CODE HERE ### \n",
    "    return best_probs, best_paths"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "e3929c10",
   "metadata": {},
   "outputs": [],
   "source": [
    "best_probs, best_paths = initialize(states, tag_counts, A, B, prep, vocab)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "8cc8406f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "best_probs[0,0]: -22.6098\n",
      "best_paths[2,3]: 0.0000\n"
     ]
    }
   ],
   "source": [
    "# Test the function\n",
    "print(f\"best_probs[0,0]: {best_probs[0,0]:.4f}\")\n",
    "print(f\"best_paths[2,3]: {best_paths[2,3]:.4f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dbf89225",
   "metadata": {},
   "source": [
    "##### Expected Output\n",
    "\n",
    "```CPP\n",
    "best_probs[0,0]: -22.6098\n",
    "best_paths[2,3]: 0.0000\n",
    "```\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "6ff6a585",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m All tests passed\n"
     ]
    }
   ],
   "source": [
    "# Test your function\n",
    "w2_unittest.test_initialize(initialize, states, tag_counts, A, B, prep, vocab)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0835ecdc",
   "metadata": {},
   "source": [
    "<a name='3.2'></a>\n",
    "## Part 3.2 Viterbi Forward\n",
    "\n",
    "In this part of the assignment, you will implement the `viterbi_forward` segment. In other words, you will populate your `best_probs` and `best_paths` matrices.\n",
    "- Walk forward through the corpus.\n",
    "- For each word, compute a probability for each possible tag. \n",
    "- Unlike the previous algorithm `predict_pos` (the 'warm-up' exercise), this will include the path up to that (word,tag) combination. \n",
    "\n",
    "Here is an example with a three-word corpus \"Loss tracks upward\":\n",
    "- Note, in this example, only a subset of states (POS tags) are shown in the diagram below, for easier reading. \n",
    "- In the diagram below, the first word \"Loss\" is already initialized. \n",
    "- The algorithm will compute a probability for each of the potential tags in the second and future words. \n",
    "\n",
    "Compute the probability that the tag of the second work ('tracks') is a verb, 3rd person singular present (VBZ).  \n",
    "- In the `best_probs` matrix, go to the column of the second word ('tracks'), and row 40 (VBZ), this cell is highlighted in light orange in the diagram below.\n",
    "- Examine each of the paths from the tags of the first word ('Loss') and choose the most likely path.  \n",
    "- An example of the calculation for **one** of those paths is the path from ('Loss', NN) to ('tracks', VBZ).\n",
    "- The log of the probability of the path up to and including the first word 'Loss' having POS tag NN is $-14.32$.  The `best_probs` matrix contains this value -14.32 in the column for 'Loss' and row for 'NN'.\n",
    "- Find the probability that NN transitions to VBZ.  To find this probability, go to the `A` transition matrix, and go to the row for 'NN' and the column for 'VBZ'.  The value is $4.37e-02$, which is circled in the diagram, so add $-14.32 + log(4.37e-02)$. \n",
    "- Find the log of the probability that the tag VBS would 'emit' the word 'tracks'.  To find this, look at the 'B' emission matrix in row 'VBZ' and the column for the word 'tracks'.  The value $4.61e-04$ is circled in the diagram below.  So add $-14.32 + log(4.37e-02) + log(4.61e-04)$.\n",
    "- The sum of $-14.32 + log(4.37e-02) + log(4.61e-04)$ is $-25.13$. Store $-25.13$ in the `best_probs` matrix at row 'VBZ' and column 'tracks' (as seen in the cell that is highlighted in light orange in the diagram).\n",
    "- All other paths in best_probs are calculated.  Notice that $-25.13$ is greater than all of the other values in column 'tracks' of matrix `best_probs`, and so the most likely path to 'VBZ' is from 'NN'.  'NN' is in row 20 of the `best_probs` matrix, so $20$ is the most likely path.\n",
    "- Store the most likely path $20$ in the `best_paths` table.  This is highlighted in light orange in the diagram below."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fb02171c",
   "metadata": {},
   "source": [
    "The formula to compute the probability and path for the $i^{th}$ word in the $corpus$, the prior word $i-1$ in the corpus, current POS tag $j$, and previous POS tag $k$ is:\n",
    "\n",
    "$\\mathrm{prob} = \\mathbf{best\\_prob}_{k, i-1} + \\mathrm{log}(\\mathbf{A}_{k, j}) + \\mathrm{log}(\\mathbf{B}_{j, vocab(corpus_{i})})$\n",
    "\n",
    "where $corpus_{i}$ is the word in the corpus at index $i$, and $vocab$ is the dictionary that gets the unique integer that represents a given word.\n",
    "\n",
    "$\\mathrm{path} = k$\n",
    "\n",
    "where $k$ is the integer representing the previous POS tag.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3e74cdf5",
   "metadata": {},
   "source": [
    "<a name='ex-06'></a>\n",
    "\n",
    "### Exercise 06\n",
    "\n",
    "Instructions: Implement the `viterbi_forward` algorithm and store the best_path and best_prob for every possible tag for each word in the matrices `best_probs` and `best_tags` using the pseudo code below.\n",
    "\n",
    "```\n",
    "for each word in the corpus\n",
    "\n",
    "    for each POS tag type that this word may be\n",
    "    \n",
    "        for POS tag type that the previous word could be\n",
    "        \n",
    "            compute the probability that the previous word had a given POS tag, that the current word has a given POS tag, and that the POS tag would emit this current word.\n",
    "            \n",
    "            retain the highest probability computed for the current word\n",
    "            \n",
    "            set best_probs to this highest probability\n",
    "            \n",
    "            set best_paths to the index 'k', representing the POS tag of the previous word which produced the highest probability \n",
    "```\n",
    "\n",
    "Please use [math.log](https://docs.python.org/3/library/math.html) to compute the natural logarithm."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4d1d2ac4",
   "metadata": {},
   "source": [
    "<img src = \"images/Forward4.PNG\"/>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6cca86df",
   "metadata": {},
   "source": [
    "<details>    \n",
    "<summary>\n",
    "    <font size=\"3\" color=\"darkgreen\"><b>Hints</b></font>\n",
    "</summary>\n",
    "<p>\n",
    "<ul>\n",
    "    <li>Remember that when accessing emission matrix B, the column index is the unique integer ID associated with the word.  It can be accessed by using the 'vocab' dictionary, where the key is the word, and the value is the unique integer ID for that word.</li>\n",
    "</ul>\n",
    "</p>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "5cfab197",
   "metadata": {},
   "outputs": [],
   "source": [
    "# UNQ_C6 GRADED FUNCTION: viterbi_forward\n",
    "def viterbi_forward(A, B, test_corpus, best_probs, best_paths, vocab, verbose=True):\n",
    "    '''\n",
    "    Input: \n",
    "        A, B: The transition and emission matrices respectively\n",
    "        test_corpus: a list containing a preprocessed corpus\n",
    "        best_probs: an initilized matrix of dimension (num_tags, len(corpus))\n",
    "        best_paths: an initilized matrix of dimension (num_tags, len(corpus))\n",
    "        vocab: a dictionary where keys are words in vocabulary and value is an index \n",
    "    Output: \n",
    "        best_probs: a completed matrix of dimension (num_tags, len(corpus))\n",
    "        best_paths: a completed matrix of dimension (num_tags, len(corpus))\n",
    "    '''\n",
    "    # Get the number of unique POS tags (which is the num of rows in best_probs)\n",
    "    num_tags = best_probs.shape[0]\n",
    "    \n",
    "    # Go through every word in the corpus starting from word 1\n",
    "    # Recall that word 0 was initialized in `initialize()`\n",
    "    for i in range(1, len(test_corpus)): \n",
    "        \n",
    "        # Print number of words processed, every 5000 words\n",
    "        if i % 5000 == 0 and verbose:\n",
    "            print(\"Words processed: {:>8}\".format(i))\n",
    "            \n",
    "        ### START CODE HERE  (Replace instances of 'None' with your code EXCEPT the first 'best_path_i = None')  ###\n",
    "        # For each unique POS tag that the current word can be\n",
    "        for j in range(num_tags): # Replace None in this line with the proper range. # for every pos tag\n",
    "            \n",
    "            # Initialize best_prob for word i to negative infinity\n",
    "            best_prob_i = float(\"-inf\")\n",
    "            \n",
    "            # Initialize best_path for current word i to None\n",
    "            best_path_i = None # Do not replace this None # @KEEPTHIS\n",
    "\n",
    "            # For each POS tag that the previous word can be:\n",
    "            for k in range(num_tags): # Replace None in this line with the proper range.\n",
    "            \n",
    "                # Calculate the probability = None\n",
    "                # best probs of POS tag k, previous word i-1 + \n",
    "                # log(prob of transition from POS k to POS j) + \n",
    "                # log(prob that emission of POS j is word i)\n",
    "                prob = best_probs[k,i-1] + math.log(A[k,j]) + math.log(B[j, vocab[test_corpus[i]]])\n",
    "\n",
    "                # check if this path's probability is greater than\n",
    "                # the best probability up to and before this point\n",
    "                if prob > best_prob_i: # Replace None in this line with the proper condition.\n",
    "                    \n",
    "                    # Keep track of the best probability\n",
    "                    best_prob_i = prob\n",
    "                    \n",
    "                    # keep track of the POS tag of the previous word\n",
    "                    # that is part of the best path.  \n",
    "                    # Save the index (integer) associated with \n",
    "                    # that previous word's POS tag\n",
    "                    best_path_i = k\n",
    "\n",
    "            # Save the best probability for the \n",
    "            # given current word's POS tag\n",
    "            # and the position of the current word inside the corpus\n",
    "            best_probs[j,i] = best_prob_i\n",
    "            \n",
    "            # Save the unique integer ID of the previous POS tag\n",
    "            # into best_paths matrix, for the POS tag of the current word\n",
    "            # and the position of the current word inside the corpus.\n",
    "            best_paths[j,i] = best_path_i\n",
    "\n",
    "        ### END CODE HERE ###\n",
    "    return best_probs, best_paths"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "687cb89f",
   "metadata": {},
   "source": [
    "Run the `viterbi_forward` function to fill in the `best_probs` and `best_paths` matrices.\n",
    "\n",
    "**Note** that this will take a few minutes to run.  There are about 30,000 words to process."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "0927b52d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Words processed:     5000\n",
      "Words processed:    10000\n",
      "Words processed:    15000\n",
      "Words processed:    20000\n",
      "Words processed:    25000\n",
      "Words processed:    30000\n"
     ]
    }
   ],
   "source": [
    "# this will take a few minutes to run => processes ~ 30,000 words\n",
    "best_probs, best_paths = viterbi_forward(A, B, prep, best_probs, best_paths, vocab)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "49f8e4fe",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "best_probs[0,1]: -24.7822\n",
      "best_probs[0,4]: -49.5601\n"
     ]
    }
   ],
   "source": [
    "# Test this function \n",
    "print(f\"best_probs[0,1]: {best_probs[0,1]:.4f}\") \n",
    "print(f\"best_probs[0,4]: {best_probs[0,4]:.4f}\") "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8f82266a",
   "metadata": {},
   "source": [
    "##### Expected Output\n",
    "\n",
    "```CPP\n",
    "best_probs[0,1]: -24.7822\n",
    "best_probs[0,4]: -49.5601\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "2e2698db",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m All tests passed\n"
     ]
    }
   ],
   "source": [
    "# Test your function: this test may take some time to run\n",
    "w2_unittest.test_viterbi_forward(viterbi_forward, A, B, prep, vocab)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "31c52683",
   "metadata": {},
   "source": [
    "<a name='3.3'></a>\n",
    "## Part 3.3 Viterbi backward\n",
    "\n",
    "Now you will implement the Viterbi backward algorithm.\n",
    "- The Viterbi backward algorithm gets the predictions of the POS tags for each word in the corpus using the `best_paths` and the `best_probs` matrices.\n",
    "\n",
    "The example below shows how to walk backwards through the best_paths matrix to get the POS tags of each word in the corpus. Recall that this example corpus has three words: \"Loss tracks upward\".\n",
    "\n",
    "POS tag for 'upward' is `RB`\n",
    "- Select the the most likely POS tag for the last word in the corpus, 'upward' in the `best_prob` table.\n",
    "- Look for the row in the column for 'upward' that has the largest probability.\n",
    "- Notice that in row 28 of `best_probs`, the estimated probability is -34.99, which is larger than the other values in the column.  So the most likely POS tag for 'upward' is `RB` an adverb, at row 28 of `best_prob`. \n",
    "- The variable `z` is an array that stores the unique integer ID of the predicted POS tags for each word in the corpus.  In array z, at position 2, store the value 28 to indicate that the word 'upward' (at index 2 in the corpus), most likely has the POS tag associated with unique ID 28 (which is `RB`).\n",
    "- The variable `pred` contains the POS tags in string form.  So `pred` at index 2 stores the string `RB`.\n",
    "\n",
    "\n",
    "POS tag for 'tracks' is `VBZ`\n",
    "- The next step is to go backward one word in the corpus ('tracks').  Since the most likely POS tag for 'upward' is `RB`, which is uniquely identified by integer ID 28, go to the `best_paths` matrix in column 2, row 28.  The value stored in `best_paths`, column 2, row 28 indicates the unique ID of the POS tag of the previous word.  In this case, the value stored here is 40, which is the unique ID for POS tag `VBZ` (verb, 3rd person singular present).\n",
    "- So the previous word at index 1 of the corpus ('tracks'), most likely has the POS tag with unique ID 40, which is `VBZ`.\n",
    "- In array `z`, store the value 40 at position 1, and for array `pred`, store the string `VBZ` to indicate that the word 'tracks' most likely has POS tag `VBZ`.\n",
    "\n",
    "POS tag for 'Loss' is `NN`\n",
    "- In `best_paths` at column 1, the unique ID stored at row 40 is 20.  20 is the unique ID for POS tag `NN`.\n",
    "- In array `z` at position 0, store 20.  In array `pred` at position 0, store `NN`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "45731382",
   "metadata": {},
   "source": [
    "<img src = \"images/Backwards5.PNG\"/>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b786ba07",
   "metadata": {},
   "source": [
    "<a name='ex-07'></a>\n",
    "### Exercise 07\n",
    "Implement the `viterbi_backward` algorithm, which returns a list of predicted POS tags for each word in the corpus.\n",
    "\n",
    "- Note that the numbering of the index positions starts at 0 and not 1. \n",
    "- `m` is the number of words in the corpus.  \n",
    "    - So the indexing into the corpus goes from `0` to `m - 1`.\n",
    "    - Also, the columns in `best_probs` and `best_paths` are indexed from `0` to `m - 1`\n",
    "\n",
    "\n",
    "**In Step 1:**       \n",
    "Loop through all the rows (POS tags) in the last entry of `best_probs` and find the row (POS tag) with the maximum value.\n",
    "Convert the unique integer ID to a tag (a string representation) using the list `states`.  \n",
    "\n",
    "Referring to the three-word corpus described above:\n",
    "- `z[2] = 28`: For the word 'upward' at position 2 in the corpus, the POS tag ID is 28.  Store 28 in `z` at position 2.\n",
    "- `states[28]` is 'RB': The POS tag ID 28 refers to the POS tag 'RB'.\n",
    "- `pred[2] = 'RB'`: In array `pred`, store the POS tag for the word 'upward'.\n",
    "\n",
    "**In Step 2:**  \n",
    "- Starting at the last column of best_paths, use `best_probs` to find the most likely POS tag for the last word in the corpus.\n",
    "- Then use `best_paths` to find the most likely POS tag for the previous word. \n",
    "- Update the POS tag for each word in `z` and in `preds`.\n",
    "\n",
    "Referring to the three-word example from above, read best_paths at column 2 and fill in z at position 1.  \n",
    "`z[1] = best_paths[z[2],2]`  \n",
    "\n",
    "The small test following the routine prints the last few words of the corpus and their states to aid in debug."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "id": "5f41281a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# UNQ_C7 GRADED FUNCTION: viterbi_backward\n",
    "def viterbi_backward(best_probs, best_paths, corpus, states):\n",
    "    '''\n",
    "    This function returns the best path.\n",
    "    \n",
    "    '''\n",
    "    # Get the number of words in the corpus\n",
    "    # which is also the number of columns in best_probs, best_paths\n",
    "    m = best_paths.shape[1] \n",
    "    \n",
    "    # Initialize array z, same length as the corpus\n",
    "    z = [None ] * m\n",
    "    \n",
    "    # Get the number of unique POS tags\n",
    "    num_tags = best_probs.shape[0]\n",
    "    \n",
    "    # Initialize the best probability for the last word\n",
    "    best_prob_for_last_word = float('-inf')\n",
    "    \n",
    "    # Initialize pred array, same length as corpus\n",
    "    pred = [None] * m\n",
    "    \n",
    "    ### START CODE HERE (Replace instances of 'None' with your code) ###\n",
    "    ## Step 1 ##\n",
    "    \n",
    "    # Go through each POS tag for the last word (last column of best_probs)\n",
    "    # in order to find the row (POS tag integer ID) \n",
    "    # with highest probability for the last word\n",
    "    for k in range(num_tags): # Replace None in this line with the proper range.\n",
    "\n",
    "        # If the probability of POS tag at row k\n",
    "        prob = best_probs[k, m-1]\n",
    "        # is better than the previously best probability for the last word:\n",
    "        if prob > best_prob_for_last_word: # Replace None in this line with the proper condition.\n",
    "            \n",
    "            # Store the new best probability for the last word\n",
    "            best_prob_for_last_word = prob\n",
    "\n",
    "            # Store the unique integer ID of the POS tag\n",
    "            # which is also the row number in best_probs\n",
    "            z[m - 1] = k\n",
    "            \n",
    "    # Convert the last word's predicted POS tag\n",
    "    # from its unique integer ID into the string representation\n",
    "    # using the 'states' list\n",
    "    # store this in the 'pred' array for the last word\n",
    "    pred[m - 1] = states[z[m-1]]\n",
    "    \n",
    "    ## Step 2 ##\n",
    "    # Find the best POS tags by walking backward through the best_paths\n",
    "    # From the last word in the corpus to the 0th word in the corpus\n",
    "    for i in range(m-1, 0, -1): # Replace None in this line with the proper range.\n",
    "        # Retrieve the unique integer ID of\n",
    "        # the POS tag for the word at position 'i' in the corpus\n",
    "        pos_tag_for_word_i = states[z[i]]\n",
    "        \n",
    "        # In best_paths, go to the row representing the POS tag of word i\n",
    "        # and the column representing the word's position in the corpus\n",
    "        # to retrieve the predicted POS for the word at position i-1 in the corpus\n",
    "        z[i - 1] = best_paths[z[i], i]\n",
    "        \n",
    "        # Get the previous word's POS tag in string form\n",
    "        # Use the 'states' list, \n",
    "        # where the key is the unique integer ID of the POS tag,\n",
    "        # and the value is the string representation of that POS tag\n",
    "        pred[i - 1] = states[z[i-1]]\n",
    "        \n",
    "     ### END CODE HERE ###\n",
    "    return pred"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "id": "2687a52c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The prediction for pred[-7:m-1] is: \n",
      " ['see', 'them', 'here', 'with', 'us', '.'] \n",
      " ['VB', 'PRP', 'RB', 'IN', 'PRP', '.'] \n",
      "\n",
      "The prediction for pred[0:8] is: \n",
      " ['DT', 'NN', 'POS', 'NN', 'MD', 'VB', 'VBN'] \n",
      " ['The', 'economy', \"'s\", 'temperature', 'will', 'be', 'taken']\n"
     ]
    }
   ],
   "source": [
    "# Run and test your function\n",
    "pred = viterbi_backward(best_probs, best_paths, prep, states)\n",
    "m=len(pred)\n",
    "print('The prediction for pred[-7:m-1] is: \\n', prep[-7:m-1], \"\\n\", pred[-7:m-1], \"\\n\")\n",
    "print('The prediction for pred[0:8] is: \\n', pred[0:7], \"\\n\", prep[0:7])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fd3df1fc",
   "metadata": {},
   "source": [
    "**Expected Output:**   \n",
    "\n",
    "```CPP\n",
    "The prediction for pred[-7:m-1] is:  \n",
    " ['see', 'them', 'here', 'with', 'us', '.']  \n",
    " ['VB', 'PRP', 'RB', 'IN', 'PRP', '.']   \n",
    "The prediction for pred[0:8] is:    \n",
    " ['DT', 'NN', 'POS', 'NN', 'MD', 'VB', 'VBN']   \n",
    " ['The', 'economy', \"'s\", 'temperature', 'will', 'be', 'taken'] \n",
    "```\n",
    "\n",
    "Now you just have to compare the predicted labels to the true labels to evaluate your model on the accuracy metric!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "id": "3c1fdd6c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m All tests passed\n"
     ]
    }
   ],
   "source": [
    "# Test your function\n",
    "w2_unittest.test_viterbi_backward(viterbi_backward, prep, states)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "492fd910",
   "metadata": {},
   "source": [
    "<a name='4'></a>\n",
    "# Part 4: Predicting on a data set\n",
    "\n",
    "Compute the accuracy of your prediction by comparing it with the true `y` labels. \n",
    "- `pred` is a list of predicted POS tags corresponding to the words of the `test_corpus`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "id": "8be6fa6f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The third word is: temperature\n",
      "Your prediction is: NN\n",
      "Your corresponding label y is:  temperature\tNN\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print('The third word is:', prep[3])\n",
    "print('Your prediction is:', pred[3])\n",
    "print('Your corresponding label y is: ', y[3])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a72db4f8",
   "metadata": {},
   "source": [
    "<a name='ex-08'></a>\n",
    "### Exercise 08\n",
    "\n",
    "Implement a function to compute the accuracy of the viterbi algorithm's POS tag predictions.\n",
    "- To split y into the word and its tag you can use `y.split()`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "id": "38fe6e9e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# UNQ_C8 GRADED FUNCTION: compute_accuracy\n",
    "def compute_accuracy(pred, y):\n",
    "    '''\n",
    "    Input: \n",
    "        pred: a list of the predicted parts-of-speech \n",
    "        y: a list of lines where each word is separated by a '\\t' (i.e. word \\t tag)\n",
    "    Output: \n",
    "        \n",
    "    '''\n",
    "    num_correct = 0\n",
    "    total = 0\n",
    "    \n",
    "    # Zip together the prediction and the labels\n",
    "    for prediction, y in zip(pred, y):\n",
    "        ### START CODE HERE (Replace instances of 'None' with your code) ###\n",
    "        # Split the label into the word and the POS tag\n",
    "        word_tag_tuple = y.split()\n",
    "        \n",
    "        # Check that there is actually a word and a tag\n",
    "        # no more and no less than 2 items\n",
    "        if len(word_tag_tuple) != 2: # Replace None in this line with the proper condition.\n",
    "            continue\n",
    "\n",
    "        # store the word and tag separately\n",
    "        word, tag = word_tag_tuple\n",
    "        \n",
    "        # Check if the POS tag label matches the prediction\n",
    "        if tag == prediction: # Replace None in this line with the proper condition.\n",
    "            \n",
    "            # count the number of times that the prediction\n",
    "            # and label match\n",
    "            num_correct += 1\n",
    "            \n",
    "        # keep track of the total number of examples (that have valid labels)\n",
    "        total += 1\n",
    "\n",
    "        ### END CODE HERE ###\n",
    "    return num_correct/total"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "id": "31ff8774",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy of the Viterbi algorithm is 0.9531\n"
     ]
    }
   ],
   "source": [
    "print(f\"Accuracy of the Viterbi algorithm is {compute_accuracy(pred, y):.4f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fd0192e2",
   "metadata": {},
   "source": [
    "##### Expected Output\n",
    "\n",
    "```CPP\n",
    "Accuracy of the Viterbi algorithm is 0.9531\n",
    "```\n",
    "\n",
    "Congratulations you were able to classify the parts-of-speech with 95% accuracy. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "id": "df6eaedd",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m All tests passed\n"
     ]
    }
   ],
   "source": [
    "# Test your function\n",
    "w2_unittest.test_compute_accuracy(compute_accuracy, pred, y)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e535b29c",
   "metadata": {},
   "source": [
    "### Key Points and overview\n",
    "\n",
    "In this assignment you learned about parts-of-speech tagging. \n",
    "- In this assignment, you predicted POS tags by walking forward through a corpus and knowing the previous word.\n",
    "- There are other implementations that use bidirectional POS tagging.\n",
    "- Bidirectional POS tagging requires knowing the previous word and the next word in the corpus when predicting the current word's POS tag.\n",
    "- Bidirectional POS tagging would tell you more about the POS instead of just knowing the previous word. \n",
    "- Since you have learned to implement the unidirectional approach, you have the foundation to implement other POS taggers used in industry."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a576c3b9",
   "metadata": {},
   "source": [
    "### References\n",
    "\n",
    "- [\"Speech and Language Processing\", Dan Jurafsky and James H. Martin](https://web.stanford.edu/~jurafsky/slp3/)\n",
    "- We would like to thank Melanie Tosik for her help and inspiration"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "433d8d1f",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "jupytext": {
   "encoding": "# -*- coding: utf-8 -*-"
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}