{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "-Jv7Y4hXwt0j"
},
"source": [
"# Assignment 2: Deep N-grams\n",
"\n",
"Welcome to the second assignment of course 3. In this assignment you will explore Recurrent Neural Networks `RNN`.\n",
"- You will be using the fundamentals of google's [trax](https://github.com/google/trax) package to implement any kind of deeplearning model. \n",
"\n",
"By completing this assignment, you will learn how to implement models from scratch:\n",
"- How to convert a line of text into a tensor\n",
"- Create an iterator to feed data to the model\n",
"- Define a GRU model using `trax`\n",
"- Train the model using `trax`\n",
"- Compute the accuracy of your model using the perplexity\n",
"- Predict using your own model\n",
"\n",
"## Important Note on Submission to the AutoGrader\n",
"\n",
"Before submitting your assignment to the AutoGrader, please make sure you are not doing the following:\n",
"\n",
"1. You have not added any _extra_ `print` statement(s) in the assignment.\n",
"2. You have not added any _extra_ code cell(s) in the assignment.\n",
"3. You have not changed any of the function parameters.\n",
"4. You are not using any global variables inside your graded exercises. Unless specifically instructed to do so, please refrain from it and use the local variables instead.\n",
"5. You are not changing the assignment code where it is not required, like creating _extra_ variables.\n",
"\n",
"If you do any of the following, you will get something like, `Grader not found` (or similarly unexpected) error upon submitting your assignment. Before asking for help/debugging the errors in your assignment, check for these first. If this is the case, and you don't remember the changes you have made, you can get a fresh copy of the assignment by following these [instructions](https://www.coursera.org/learn/sequence-models-in-nlp/supplement/6ThZO/how-to-refresh-your-workspace)."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "V8_3fIOdkGv1"
},
"source": [
"## Outline\n",
"\n",
"- [Overview](#0)\n",
"- [Part 1: Importing the Data](#1)\n",
" - [1.1 Loading in the data](#1.1)\n",
" - [1.2 Convert a line to tensor](#1.2)\n",
" - [Exercise 01](#ex01)\n",
" - [1.3 Batch generator](#1.3)\n",
" - [Exercise 02](#ex02)\n",
" - [1.4 Repeating Batch generator](#1.4) \n",
"- [Part 2: Defining the GRU model](#2)\n",
" - [Exercise 03](#ex03)\n",
"- [Part 3: Training](#3)\n",
" - [3.1 Training the Model](#3.1)\n",
" - [Exercise 04](#ex04)\n",
"- [Part 4: Evaluation](#4)\n",
" - [4.1 Evaluating using the deep nets](#4.1)\n",
" - [Exercise 05](#ex05)\n",
"- [Part 5: Generating the language with your own model](#5) \n",
"- [Summary](#6)\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "JP0inrk5kGv3"
},
"source": [
"\n",
"### Overview\n",
"\n",
"Your task will be to predict the next set of characters using the previous characters. \n",
"- Although this task sounds simple, it is pretty useful.\n",
"- You will start by converting a line of text into a tensor\n",
"- Then you will create a generator to feed data into the model\n",
"- You will train a neural network in order to predict the new set of characters of defined length. \n",
"- You will use embeddings for each character and feed them as inputs to your model. \n",
" - Many natural language tasks rely on using embeddings for predictions. \n",
"- Your model will convert each character to its embedding, run the embeddings through a Gated Recurrent Unit `GRU`, and run it through a linear layer to predict the next set of characters.\n",
"\n",
"\n",
"\n",
"The figure above gives you a summary of what you are about to implement. \n",
"- You will get the embeddings;\n",
"- Stack the embeddings on top of each other;\n",
"- Run them through two layers with a relu activation in the middle;\n",
"- Finally, you will compute the softmax. \n",
"\n",
"To predict the next character:\n",
"- Use the softmax output and identify the word with the highest probability.\n",
"- The word with the highest probability is the prediction for the next word."
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 1000
},
"colab_type": "code",
"id": "RVSwzQ5Bwt0m",
"outputId": "9b51a13e-cf54-457f-e1ea-2574f9d67453"
},
"outputs": [],
"source": [
"import os\n",
"import shutil\n",
"import trax\n",
"import trax.fastmath.numpy as np\n",
"import pickle\n",
"import numpy\n",
"import random as rnd\n",
"from trax import fastmath\n",
"from trax import layers as tl\n",
"\n",
"import w2_unittest\n",
"\n",
"# set random seed\n",
"rnd.seed(32)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "4sF9Hqzgwt0l"
},
"source": [
"\n",
"# Part 1: Importing the Data\n",
"\n",
"\n",
"### 1.1 Loading in the data\n",
"\n",
"\n",
"\n",
"Now import the dataset and do some processing. \n",
"- The dataset has one sentence per line.\n",
"- You will be doing character generation, so you have to process each sentence by converting each **character** (and not word) to a number. \n",
"- You will use the `ord` function to convert a unique character to a unique integer ID. \n",
"- Store each line in a list.\n",
"- Create a data generator that takes in the `batch_size` and the `max_length`. \n",
" - The `max_length` corresponds to the maximum length of the sentence."
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {
"lines_to_next_cell": 2
},
"outputs": [],
"source": [
"dirname = 'data/'\n",
"filename = 'shakespeare_data.txt'\n",
"lines = [] # storing all the lines in a variable. \n",
"\n",
"counter = 0\n",
"\n",
"with open(os.path.join(dirname, filename)) as files:\n",
" for line in files: \n",
" # remove leading and trailing whitespace\n",
" pure_line = line.strip()\n",
"\n",
" # if pure_line is not the empty string,\n",
" if pure_line:\n",
" # append it to the list\n",
" lines.append(pure_line)"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 68
},
"colab_type": "code",
"id": "-zMCe7aJkGwA",
"outputId": "c0eace05-246a-47d9-9542-f009a4940836"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of lines: 125097\n",
"Sample line at position 0 A LOVER'S COMPLAINT\n",
"Sample line at position 999 With this night's revels and expire the term\n"
]
}
],
"source": [
"n_lines = len(lines)\n",
"print(f\"Number of lines: {n_lines}\")\n",
"print(f\"Sample line at position 0 {lines[0]}\")\n",
"print(f\"Sample line at position 999 {lines[999]}\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "G6XsiyHvkGwD"
},
"source": [
"Notice that the letters are both uppercase and lowercase. In order to reduce the complexity of the task, we will convert all characters to lowercase. This way, the model only needs to predict the likelihood that a letter is 'a' and not decide between uppercase 'A' and lowercase 'a'."
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 68
},
"colab_type": "code",
"id": "UBO9jI8EkGwE",
"outputId": "55b55d61-a5b1-4381-ff88-d146071ac671"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of lines: 125097\n",
"Sample line at position 0 a lover's complaint\n",
"Sample line at position 999 with this night's revels and expire the term\n"
]
}
],
"source": [
"# go through each line\n",
"for i, line in enumerate(lines):\n",
" # convert to all lowercase\n",
" lines[i] = line.lower()\n",
"\n",
"print(f\"Number of lines: {n_lines}\")\n",
"print(f\"Sample line at position 0 {lines[0]}\")\n",
"print(f\"Sample line at position 999 {lines[999]}\")"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of lines for training: 124097\n",
"Number of lines for validation: 1000\n"
]
}
],
"source": [
"eval_lines = lines[-1000:] # Create a holdout validation set\n",
"lines = lines[:-1000] # Leave the rest for training\n",
"\n",
"print(f\"Number of lines for training: {len(lines)}\")\n",
"print(f\"Number of lines for validation: {len(eval_lines)}\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "BDcxEmX31y3d"
},
"source": [
"\n",
"### 1.2 Convert a line to tensor\n",
"\n",
"Now that you have your list of lines, you will convert each character in that list to a number. You can use Python's `ord` function to do it. \n",
"\n",
"Given a string representing of one Unicode character, the `ord` function return an integer representing the Unicode code point of that character.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 187
},
"colab_type": "code",
"id": "Cc_B8ae3kGwI",
"outputId": "94eb6798-827a-4494-dec0-7bb84523ce34"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"ord('a'): 97\n",
"ord('b'): 98\n",
"ord('c'): 99\n",
"ord(' '): 32\n",
"ord('x'): 120\n",
"ord('y'): 121\n",
"ord('z'): 122\n",
"ord('1'): 49\n",
"ord('2'): 50\n",
"ord('3'): 51\n"
]
}
],
"source": [
"# View the unique unicode integer associated with each character\n",
"print(f\"ord('a'): {ord('a')}\")\n",
"print(f\"ord('b'): {ord('b')}\")\n",
"print(f\"ord('c'): {ord('c')}\")\n",
"print(f\"ord(' '): {ord(' ')}\")\n",
"print(f\"ord('x'): {ord('x')}\")\n",
"print(f\"ord('y'): {ord('y')}\")\n",
"print(f\"ord('z'): {ord('z')}\")\n",
"print(f\"ord('1'): {ord('1')}\")\n",
"print(f\"ord('2'): {ord('2')}\")\n",
"print(f\"ord('3'): {ord('3')}\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "ZWB9qOLOkGwL"
},
"source": [
"\n",
"### Exercise 01\n",
"\n",
"**Instructions:** Write a function that takes in a single line and transforms each character into its unicode integer. This returns a list of integers, which we'll refer to as a tensor.\n",
"- Use a special integer to represent the end of the sentence (the end of the line).\n",
"- This will be the EOS_int (end of sentence integer) parameter of the function.\n",
"- Include the EOS_int as the last integer of the \n",
"- For this exercise, you will use the number `1` to represent the end of a sentence."
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "IO4NSPkOITNK"
},
"outputs": [],
"source": [
"# UNQ_C1 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)\n",
"# GRADED FUNCTION: line_to_tensor\n",
"def line_to_tensor(line, EOS_int=1):\n",
" \"\"\"Turns a line of text into a tensor\n",
"\n",
" Args:\n",
" line (str): A single line of text.\n",
" EOS_int (int, optional): End-of-sentence integer. Defaults to 1.\n",
"\n",
" Returns:\n",
" list: a list of integers (unicode values) for the characters in the `line`.\n",
" \"\"\"\n",
" \n",
" # Initialize the tensor as an empty list\n",
" tensor = []\n",
" \n",
" ### START CODE HERE (Replace instances of 'None' with your code) ###\n",
" # for each character:\n",
" for c in line:\n",
" \n",
" # convert to unicode int\n",
" c_int = ord(c)\n",
" \n",
" # append the unicode integer to the tensor list\n",
" tensor.append(c_int)\n",
" \n",
" # include the end-of-sentence integer\n",
" tensor.append(EOS_int)\n",
" \n",
" ### END CODE HERE ###\n",
"\n",
" return tensor"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"colab_type": "code",
"id": "D9Z_vtI7tTcw",
"outputId": "0423ad21-af3e-4e6d-a558-472f4bf5f964"
},
"outputs": [
{
"data": {
"text/plain": [
"[97, 98, 99, 32, 120, 121, 122, 1]"
]
},
"execution_count": 66,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Testing your output\n",
"line_to_tensor('abc xyz')"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "7MwEspKCtTc4"
},
"source": [
"##### Expected Output\n",
"```CPP\n",
"[97, 98, 99, 32, 120, 121, 122, 1]\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[92m All tests passed\n"
]
}
],
"source": [
"# Test your function\n",
"w2_unittest.test_line_to_tensor(line_to_tensor)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "iFOR19cX2TQs"
},
"source": [
"\n",
"### 1.3 Batch generator \n",
"\n",
"Most of the time in Natural Language Processing, and AI in general we use batches when training our data sets. Here, you will build a data generator that takes in a text and returns a batch of text lines (lines are sentences).\n",
"- The generator converts text lines (sentences) into numpy arrays of integers padded by zeros so that all arrays have the same length, which is the length of the longest sentence in the entire data set.\n",
"\n",
"Once you create the generator, you can iterate on it like this:\n",
"\n",
"```\n",
"next(data_generator)\n",
"```\n",
"\n",
"This generator returns the data in a format that you could directly use in your model when computing the feed-forward of your algorithm. This iterator returns a batch of lines and per token mask. The batch is a tuple of three parts: inputs, targets, mask. The inputs and targets are identical. The second column will be used to evaluate your predictions. Mask is 1 for non-padding tokens.\n",
"\n",
"\n",
"### Exercise 02\n",
"**Instructions:** Implement the data generator below. Here are some things you will need. \n",
"\n",
"- While True loop: this will yield one batch at a time.\n",
"- if index >= num_lines, set index to 0. \n",
"- The generator should return shuffled batches of data. To achieve this without modifying the actual lines a list containing the indexes of `data_lines` is created. This list can be shuffled and used to get random batches everytime the index is reset.\n",
"- if len(line) < max_length append line to cur_batch.\n",
" - Note that a line that has length equal to max_length should not be appended to the batch. \n",
" - This is because when converting the characters into a tensor of integers, an additional end of sentence token id will be added. \n",
" - So if max_length is 5, and a line has 4 characters, the tensor representing those 4 characters plus the end of sentence character will be of length 5, which is the max length.\n",
"- if len(cur_batch) == batch_size, go over every line, convert it to an int and store it.\n",
"\n",
"**Remember that when calling np you are really calling trax.fastmath.numpy which is trax’s version of numpy that is compatible with JAX. As a result of this, where you used to encounter the type numpy.ndarray now you will find the type jax.interpreters.xla.DeviceArray.**"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "_ekSEQlvtTc7"
},
"source": [
" \n",
"\n",
" Hints\n",
"
\n",
"\n",
"
\n",
"
\n", "