Assignment 4 - Naive Machine Translation and LSH#

You will now implement your first machine translation system and then you will see how locality sensitive hashing works. Let’s get started by importing the required functions!

If you are running this notebook in your local computer, don’t forget to download the twitter samples and stopwords from nltk.

nltk.download('stopwords')
nltk.download('twitter_samples')

Important Note on Submission to the AutoGrader#

Before submitting your assignment to the AutoGrader, please make sure you are not doing the following:

  1. You have not added any extra print statement(s) in the assignment.

  2. You have not added any extra code cell(s) in the assignment.

  3. You have not changed any of the function parameters.

  4. You are not using any global variables inside your graded exercises. Unless specifically instructed to do so, please refrain from it and use the local variables instead.

  5. You are not changing the assignment code where it is not required, like creating extra variables.

If you do any of the following, you will get something like, Grader not found (or similarly unexpected) error upon submitting your assignment. Before asking for help/debugging the errors in your assignment, check for these first. If this is the case, and you don’t remember the changes you have made, you can get a fresh copy of the assignment by following these instructions.

This assignment covers the folowing topics:#

import pdb
import pickle
import string

import time

import nltk
import numpy as np
from nltk.corpus import stopwords, twitter_samples

from utils import (cosine_similarity, get_dict,
                   process_tweet)
from os import getcwd

import w4_unittest
# add folder, tmp2, from our local workspace containing pre-downloaded corpora files to nltk's data path
filePath = f"{getcwd()}/tmp2/"
nltk.data.path.append(filePath)

1. The word embeddings data for English and French words#

Write a program that translates English to French.

The data#

The full dataset for English embeddings is about 3.64 gigabytes, and the French embeddings are about 629 megabytes. To prevent the Coursera workspace from crashing, we’ve extracted a subset of the embeddings for the words that you’ll use in this assignment.

The subset of data#

To do the assignment on the Coursera workspace, we’ll use the subset of word embeddings.

en_embeddings_subset = pickle.load(open("./data/en_embeddings.p", "rb"))
fr_embeddings_subset = pickle.load(open("./data/fr_embeddings.p", "rb"))

Look at the data#

  • en_embeddings_subset: the key is an English word, and the value is a 300 dimensional array, which is the embedding for that word.

'the': array([ 0.08007812,  0.10498047,  0.04980469,  0.0534668 , -0.06738281, ....
  • fr_embeddings_subset: the key is a French word, and the value is a 300 dimensional array, which is the embedding for that word.

'la': array([-6.18250e-03, -9.43867e-04, -8.82648e-03,  3.24623e-02,...
```+
023.

Load two dictionaries mapping the English to French words#

  • A training dictionary

  • and a testing dictionary.

# loading the english to french dictionaries
en_fr_train = get_dict('./data/en-fr.train.txt')
print('The length of the English to French training dictionary is', len(en_fr_train))
en_fr_test = get_dict('./data/en-fr.test.txt')
print('The length of the English to French test dictionary is', len(en_fr_test))
The length of the English to French training dictionary is 5000
The length of the English to French test dictionary is 1500

Looking at the English French dictionary#

  • en_fr_train is a dictionary where the key is the English word and the value is the French translation of that English word.

{'the': 'la',
 'and': 'et',
 'was': 'était',
 'for': 'pour',
  • en_fr_test is similar to en_fr_train, but is a test set. We won’t look at it until we get to testing.

1.1 Generate embedding and transform matrices#

Exercise 01: Translating English dictionary to French by using embeddings#

You will now implement a function get_matrices, which takes the loaded data and returns matrices X and Y.

Inputs:

  • en_fr : English to French dictionary

  • en_embeddings : English to embeddings dictionary

  • fr_embeddings : French to embeddings dictionary

Returns:

  • Matrix X and matrix Y, where each row in X is the word embedding for an english word, and the same row in Y is the word embedding for the French version of that English word.

alternate text Figure 1

Use the en_fr dictionary to ensure that the ith row in the X matrix corresponds to the ith row in the Y matrix.

Instructions: Complete the function get_matrices():

  • Iterate over English words in en_fr dictionary.

  • Check if the word have both English and French embedding.

Hints

  • Sets are useful data structures that can be used to check if an item is a member of a group.
  • You can get words which are embedded into the language by using keys method.
  • Keep vectors in `X` and `Y` sorted in list. You can use np.vstack() to merge them into the numpy matrix.
  • numpy.vstack stacks the items in a list as rows in a matrix.

# UNQ_C1 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def get_matrices(en_fr, french_vecs, english_vecs):
    """
    Input:
        en_fr: English to French dictionary
        french_vecs: French words to their corresponding word embeddings.
        english_vecs: English words to their corresponding word embeddings.
    Output: 
        X: a matrix where the columns are the English embeddings.
        Y: a matrix where the columns correspong to the French embeddings.
        R: the projection matrix that minimizes the F norm ||X R -Y||^2.
    """

    ### START CODE HERE ###

    # X_l and Y_l are lists of the english and french word embeddings
    X_l = list()
    Y_l = list()

    # get the english words (the keys in the dictionary) and store in a set()
    english_set = set(en_embeddings_subset.keys())

    # get the french words (keys in the dictionary) and store in a set()
    french_set = set(fr_embeddings_subset.keys())

    # store the french words that are part of the english-french dictionary (these are the values of the dictionary)
    french_words = set(en_fr.values())

    # loop through all english, french word pairs in the english french dictionary
    for en_word, fr_word in en_fr.items():

        # check that the french word has an embedding and that the english word has an embedding
        if fr_word in french_set and en_word in english_set:

            # get the english embedding
            en_vec = english_vecs[en_word]

            # get the french embedding
            fr_vec = french_vecs[fr_word]

            # add the english embedding to the list
            X_l.append(en_vec)

            # add the french embedding to the list
            Y_l.append(fr_vec)

    # stack the vectors of X_l into a matrix X
    X = np.array(X_l)

    # stack the vectors of Y_l into a matrix Y
    Y = np.array(Y_l)
    ### END CODE HERE ###

    return X, Y

Now we will use function get_matrices() to obtain sets X_train and Y_train of English and French word embeddings into the corresponding vector space models.

# UNQ_C2 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# You do not have to input any code in this cell, but it is relevant to grading, so please do not change anything

# getting the training set:
X_train, Y_train = get_matrices(
    en_fr_train, fr_embeddings_subset, en_embeddings_subset)
# Test your function
w4_unittest.test_get_matrices(get_matrices)
 All tests passed

2. Translations#

alternate text Figure 2

Write a program that translates English words to French words using word embeddings and vector space models.

2.1 Translation as linear transformation of embeddings#

Given dictionaries of English and French word embeddings you will create a transformation matrix R

  • Given an English word embedding, \(\mathbf{e}\), you can multiply \(\mathbf{eR}\) to get a new word embedding \(\mathbf{f}\).

    • Both \(\mathbf{e}\) and \(\mathbf{f}\) are row vectors.

  • You can then compute the nearest neighbors to f in the french embeddings and recommend the word that is most similar to the transformed word embedding.

Describing translation as the minimization problem#

Find a matrix R that minimizes the following equation.

\[\arg \min _{\mathbf{R}}\| \mathbf{X R} - \mathbf{Y}\|_{F}\tag{1} \]

Frobenius norm#

The Frobenius norm of a matrix \(A\) (assuming it is of dimension \(m,n\)) is defined as the square root of the sum of the absolute squares of its elements:

\[\|\mathbf{A}\|_{F} \equiv \sqrt{\sum_{i=1}^{m} \sum_{j=1}^{n}\left|a_{i j}\right|^{2}}\tag{2}\]

Actual loss function#

In the real world applications, the Frobenius norm loss:

\[\| \mathbf{XR} - \mathbf{Y}\|_{F}\]

is often replaced by it’s squared value divided by \(m\):

\[ \frac{1}{m} \| \mathbf{X R} - \mathbf{Y} \|_{F}^{2}\]

where \(m\) is the number of examples (rows in \(\mathbf{X}\)).

  • The same R is found when using this loss function versus the original Frobenius norm.

  • The reason for taking the square is that it’s easier to compute the gradient of the squared Frobenius.

  • The reason for dividing by \(m\) is that we’re more interested in the average loss per embedding than the loss for the entire training set.

    • The loss for all training set increases with more words (training examples), so taking the average helps us to track the average loss regardless of the size of the training set.

[Optional] Detailed explanation why we use norm squared instead of the norm:#

Click for optional details

  • The norm is always nonnegative (we're summing up absolute values), and so is the square.
  • When we take the square of all non-negative (positive or zero) numbers, the order of the data is preserved.
  • For example, if 3 > 2, 3^2 > 2^2
  • Using the norm or squared norm in gradient descent results in the same location of the minimum.
  • Squaring cancels the square root in the Frobenius norm formula. Because of the chain rule, we would have to do more calculations if we had a square root in our expression for summation.
  • Dividing the function value by the positive number doesn't change the optimum of the function, for the same reason as described above.
  • We're interested in transforming English embedding into the French. Thus, it is more important to measure average loss per embedding than the loss for the entire dictionary (which increases as the number of words in the dictionary increases).

Exercise 02: Implementing translation mechanism described in this section.#

Step 1: Computing the loss#

  • The loss function will be squared Frobenoius norm of the difference between matrix and its approximation, divided by the number of training examples \(m\).

  • Its formula is: $\( L(X, Y, R)=\frac{1}{m}\sum_{i=1}^{m} \sum_{j=1}^{n}\left( a_{i j} \right)^{2}\)$

where \(a_{i j}\) is value in \(i\)th row and \(j\)th column of the matrix \(\mathbf{XR}-\mathbf{Y}\).

Instructions: complete the compute_loss() function#

  • Compute the approximation of Y by matrix multiplying X and R

  • Compute difference XR - Y

  • Compute the squared Frobenius norm of the difference and divide it by \(m\).

Hints

  • Useful functions: Numpy dot , Numpy sum, Numpy square, Numpy norm
  • Be careful about which operation is elementwise and which operation is a matrix multiplication.
  • Try to use matrix operations instead of the numpy norm function. If you choose to use norm function, take care of extra arguments and that it's returning loss squared, and not the loss itself.

# UNQ_C3 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def compute_loss(X, Y, R):
    '''
    Inputs: 
        X: a matrix of dimension (m,n) where the columns are the English embeddings.
        Y: a matrix of dimension (m,n) where the columns correspong to the French embeddings.
        R: a matrix of dimension (n,n) - transformation matrix from English to French vector space embeddings.
    Outputs:
        L: a matrix of dimension (m,n) - the value of the loss function for given X, Y and R.
    '''
    ### START CODE HERE ###
    # m is the number of rows in X
    m = X.shape[0]
        
    # diff is XR - Y    
    diff = X@R -Y

    # diff_squared is the element-wise square of the difference    
    diff_squared = diff**2

    # sum_diff_squared is the sum of the squared elements
    sum_diff_squared = np.sum(diff_squared)

    # loss i is the sum_diff_squard divided by the number of examples (m)
    loss = sum_diff_squared/m
    ### END CODE HERE ###
    return loss
# Testing your implementation.
np.random.seed(123)
m = 10
n = 5
X = np.random.rand(m, n)
Y = np.random.rand(m, n) * .1
R = np.random.rand(n, n)
print(f"Expected loss for an experiment with random matrices: {compute_loss(X, Y, R):.4f}" ) 
Expected loss for an experiment with random matrices: 8.1866

Expected output:

Expected loss for an experiment with random matrices: 8.1866
# Test your function
w4_unittest.test_compute_loss(compute_loss)
 All tests passed

Exercise 03#

Step 2: Computing the gradient of loss in respect to transform matrix R#

  • Calculate the gradient of the loss with respect to transform matrix R.

  • The gradient is a matrix that encodes how much a small change in R affect the change in the loss function.

  • The gradient gives us the direction in which we should decrease R to minimize the loss.

  • \(m\) is the number of training examples (number of rows in \(X\)).

  • The formula for the gradient of the loss function \(𝐿(𝑋,𝑌,𝑅)\) is:

\[\frac{d}{dR}𝐿(𝑋,𝑌,𝑅)=\frac{d}{dR}\Big(\frac{1}{m}\| X R -Y\|_{F}^{2}\Big) = \frac{2}{m}X^{T} (X R - Y)\]

Instructions: Complete the compute_gradient function below.

Hints

# UNQ_C4 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def compute_gradient(X, Y, R):
    '''
    Inputs: 
        X: a matrix of dimension (m,n) where the columns are the English embeddings.
        Y: a matrix of dimension (m,n) where the columns correspong to the French embeddings.
        R: a matrix of dimension (n,n) - transformation matrix from English to French vector space embeddings.
    Outputs:
        g: a scalar value - gradient of the loss function L for given X, Y and R.
    '''
    ### START CODE HERE ###
    # m is the number of rows in X
    m = X.shape[0]

    # gradient is X^T(XR - Y) * 2/m    
    gradient = 2/m*(X.T@(X@R-Y))
    
    ### END CODE HERE ###
    return gradient
# Testing your implementation.
np.random.seed(123)
m = 10
n = 5
X = np.random.rand(m, n)
Y = np.random.rand(m, n) * .1
R = np.random.rand(n, n)
gradient = compute_gradient(X, Y, R)
print(f"First row of the gradient matrix: {gradient[0]}")
First row of the gradient matrix: [1.3498175  1.11264981 0.69626762 0.98468499 1.33828969]

Expected output:

First row of the gradient matrix: [1.3498175  1.11264981 0.69626762 0.98468499 1.33828969]
# Test your function
w4_unittest.test_compute_gradient(compute_gradient)
 All tests passed

Step 3: Finding the optimal R with gradient descent algorithm#

Gradient descent#

Gradient descent is an iterative algorithm which is used in searching for the optimum of the function.

  • Earlier, we’ve mentioned that the gradient of the loss with respect to the matrix encodes how much a tiny change in some coordinate of that matrix affect the change of loss function.

  • Gradient descent uses that information to iteratively change matrix R until we reach a point where the loss is minimized.

Training with a fixed number of iterations#

Most of the time we iterate for a fixed number of training steps rather than iterating until the loss falls below a threshold.

OPTIONAL: explanation for fixed number of iterations#
click here for detailed discussion

  • You cannot rely on training loss getting low -- what you really want is the validation loss to go down, or validation accuracy to go up. And indeed - in some cases people train until validation accuracy reaches a threshold, or -- commonly known as "early stopping" -- until the validation accuracy starts to go down, which is a sign of over-fitting.
  • Why not always do "early stopping"? Well, mostly because well-regularized models on larger data-sets never stop improving. Especially in NLP, you can often continue training for months and the model will continue getting slightly and slightly better. This is also the reason why it's hard to just stop at a threshold -- unless there's an external customer setting the threshold, why stop, where do you put the threshold?
  • Stopping after a certain number of steps has the advantage that you know how long your training will take - so you can keep some sanity and not train for months. You can then try to get the best performance within this time budget. Another advantage is that you can fix your learning rate schedule -- e.g., lower the learning rate at 10% before finish, and then again more at 1% before finishing. Such learning rate schedules help a lot, but are harder to do if you don't know how long you're training.

Pseudocode:

  1. Calculate gradient \(g\) of the loss with respect to the matrix \(R\).

  2. Update \(R\) with the formula: $\(R_{\text{new}}= R_{\text{old}}-\alpha g\)$

Where \(\alpha\) is the learning rate, which is a scalar.

Learning rate#

  • The learning rate or “step size” \(\alpha\) is a coefficient which decides how much we want to change \(R\) in each step.

  • If we change \(R\) too much, we could skip the optimum by taking too large of a step.

  • If we make only small changes to \(R\), we will need many steps to reach the optimum.

  • Learning rate \(\alpha\) is used to control those changes.

  • Values of \(\alpha\) are chosen depending on the problem, and we’ll use learning_rate\(=0.0003\) as the default value for our algorithm.

Exercise 04#

Instructions: Implement align_embeddings()#

Hints

  • Use the 'compute_gradient()' function to get the gradient in each step

# UNQ_C5 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def align_embeddings(X, Y, train_steps=100, learning_rate=0.0003, verbose=True, compute_loss=compute_loss, compute_gradient=compute_gradient):
    '''
    Inputs:
        X: a matrix of dimension (m,n) where the columns are the English embeddings.
        Y: a matrix of dimension (m,n) where the columns correspong to the French embeddings.
        train_steps: positive int - describes how many steps will gradient descent algorithm do.
        learning_rate: positive float - describes how big steps will  gradient descent algorithm do.
    Outputs:
        R: a matrix of dimension (n,n) - the projection matrix that minimizes the F norm ||X R -Y||^2
    '''
    np.random.seed(129)

    # the number of columns in X is the number of dimensions for a word vector (e.g. 300)
    # R is a square matrix with length equal to the number of dimensions in th  word embedding
    R = np.random.rand(X.shape[1], X.shape[1])

    for i in range(train_steps):
        if verbose and i % 25 == 0:
            print(f"loss at iteration {i} is: {compute_loss(X, Y, R):.4f}")
        ### START CODE HERE ###
        # use the function that you defined to compute the gradient
        gradient = compute_gradient(X, Y, R)

        # update R by subtracting the learning rate times gradient
        R -= learning_rate*gradient
        ### END CODE HERE ###
    return R
# UNQ_C6 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# You do not have to input any code in this cell, but it is relevant to grading, so please do not change anything

# Testing your implementation.
np.random.seed(129)
m = 10
n = 5
X = np.random.rand(m, n)
Y = np.random.rand(m, n) * .1
R = align_embeddings(X, Y)
loss at iteration 0 is: 3.7242
loss at iteration 25 is: 3.6283
loss at iteration 50 is: 3.5350
loss at iteration 75 is: 3.4442

Expected Output:

loss at iteration 0 is: 3.7242
loss at iteration 25 is: 3.6283
loss at iteration 50 is: 3.5350
loss at iteration 75 is: 3.4442
# Test your function
w4_unittest.test_align_embeddings(align_embeddings)
 All tests passed

Calculate transformation matrix R#

Using those the training set, find the transformation matrix \(\mathbf{R}\) by calling the function align_embeddings().

NOTE: The code cell below will take a few minutes to fully execute (~3 mins)

# UNQ_C7 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# You do not have to input any code in this cell, but it is relevant to grading, so please do not change anything
R_train = align_embeddings(X_train, Y_train, train_steps=400, learning_rate=0.8)
loss at iteration 0 is: 963.0146
loss at iteration 25 is: 97.8292
loss at iteration 50 is: 26.8329
loss at iteration 75 is: 9.7893
loss at iteration 100 is: 4.3776
loss at iteration 125 is: 2.3281
loss at iteration 150 is: 1.4480
loss at iteration 175 is: 1.0338
loss at iteration 200 is: 0.8251
loss at iteration 225 is: 0.7145
loss at iteration 250 is: 0.6534
loss at iteration 275 is: 0.6185
loss at iteration 300 is: 0.5981
loss at iteration 325 is: 0.5858
loss at iteration 350 is: 0.5782
loss at iteration 375 is: 0.5735

Expected Output#

loss at iteration 0 is: 963.0146
loss at iteration 25 is: 97.8292
loss at iteration 50 is: 26.8329
loss at iteration 75 is: 9.7893
loss at iteration 100 is: 4.3776
loss at iteration 125 is: 2.3281
loss at iteration 150 is: 1.4480
loss at iteration 175 is: 1.0338
loss at iteration 200 is: 0.8251
loss at iteration 225 is: 0.7145
loss at iteration 250 is: 0.6534
loss at iteration 275 is: 0.6185
loss at iteration 300 is: 0.5981
loss at iteration 325 is: 0.5858
loss at iteration 350 is: 0.5782
loss at iteration 375 is: 0.5735

2.2 Testing the translation#

k-Nearest neighbors algorithm#

k-Nearest neighbors algorithm

  • k-NN is a method which takes a vector as input and finds the other vectors in the dataset that are closest to it.

  • The ‘k’ is the number of “nearest neighbors” to find (e.g. k=2 finds the closest two neighbors).

Searching for the translation embedding#

Since we’re approximating the translation function from English to French embeddings by a linear transformation matrix \(\mathbf{R}\), most of the time we won’t get the exact embedding of a French word when we transform embedding \(\mathbf{e}\) of some particular English word into the French embedding space.

  • This is where \(k\)-NN becomes really useful! By using \(1\)-NN with \(\mathbf{eR}\) as input, we can search for an embedding \(\mathbf{f}\) (as a row) in the matrix \(\mathbf{Y}\) which is the closest to the transformed vector \(\mathbf{eR}\)

Cosine similarity#

Cosine similarity between vectors \(u\) and \(v\) calculated as the cosine of the angle between them. The formula is

\[\cos(u,v)=\frac{u\cdot v}{\left\|u\right\|\left\|v\right\|}\]
  • \(\cos(u,v)\) = \(1\) when \(u\) and \(v\) lie on the same line and have the same direction.

  • \(\cos(u,v)\) is \(-1\) when they have exactly opposite directions.

  • \(\cos(u,v)\) is \(0\) when the vectors are orthogonal (perpendicular) to each other.

Note: Distance and similarity are pretty much opposite things.#

  • We can obtain distance metric from cosine similarity, but the cosine similarity can’t be used directly as the distance metric.

  • When the cosine similarity increases (towards \(1\)), the “distance” between the two vectors decreases (towards \(0\)).

  • We can define the cosine distance between \(u\) and \(v\) as $\(d_{\text{cos}}(u,v)=1-\cos(u,v)\)$

Exercise 05: Complete the function nearest_neighbor()

Inputs:

  • Vector v,

  • A set of possible nearest neighbors candidates

  • k nearest neighbors to find.

  • The distance metric should be based on cosine similarity.

  • cosine_similarity function is already implemented and imported for you. It’s arguments are two vectors and it returns the cosine of the angle between them.

  • Iterate over rows in candidates, and save the result of similarities between current row and vector v in a python list. Take care that similarities are in the same order as row vectors of candidates.

  • Now you can use numpy argsort to sort the indices for the rows of candidates.

Hints

  • numpy.argsort sorts values from most negative to most positive (smallest to largest)
  • The candidates that are nearest to 'v' should have the highest cosine similarity
  • To reverse the order of the result of numpy.argsort to get the element with highest cosine similarity as the first element of the array you can use tmp[::-1]. This reverses the order of an array. Then, you can extract the first k elements.

# UNQ_C8 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def nearest_neighbor(v, candidates, k=1, cosine_similarity=cosine_similarity):
    """
    Input:
      - v, the vector you are going find the nearest neighbor for
      - candidates: a set of vectors where we will find the neighbors
      - k: top k nearest neighbors to find
    Output:
      - k_idx: the indices of the top k closest vectors in sorted form
    """
    ### START CODE HERE ###
    similarity_l = []

    # for each candidate vector...
    for row in candidates:
        # get the cosine similarity
        cos_similarity = cosine_similarity(row, v)

        # append the similarity to the list
        similarity_l.append(cos_similarity)

    # sort the similarity list and get the indices of the sorted list    
    sorted_ids = np.argsort(similarity_l)
    
    # Reverse the order of the sorted_ids array
    sorted_ids = sorted_ids[::-1]
    
    # get the indices of the k most similar candidate vectors
    k_idx = sorted_ids[:k]
    ### END CODE HERE ###
    return k_idx
# UNQ_C9 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# You do not have to input any code in this cell, but it is relevant to grading, so please do not change anything

# Test your implementation:
v = np.array([1, 0, 1])
candidates = np.array([[1, 0, 5], [-2, 5, 3], [2, 0, 1], [6, -9, 5], [9, 9, 9]])
print(candidates[nearest_neighbor(v, candidates, 3)])
[[2 0 1]
 [1 0 5]
 [9 9 9]]

Expected Output:

[[2 0 1]  [1 0 5]  [9 9 9]]

# Test your function
w4_unittest.test_nearest_neighbor(nearest_neighbor)
 All tests passed

Test your translation and compute its accuracy#

Exercise 06: Complete the function test_vocabulary which takes in English embedding matrix \(X\), French embedding matrix \(Y\) and the \(R\) matrix and returns the accuracy of translations from \(X\) to \(Y\) by \(R\).

  • Iterate over transformed English word embeddings and check if the closest French word vector belongs to French word that is the actual translation.

  • Obtain an index of the closest French embedding by using nearest_neighbor (with argument k=1), and compare it to the index of the English embedding you have just transformed.

  • Keep track of the number of times you get the correct translation.

  • Calculate accuracy as $\(\text{accuracy}=\frac{\#(\text{correct predictions})}{\#(\text{total predictions})}\)$

# UNQ_C10 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def test_vocabulary(X, Y, R, nearest_neighbor=nearest_neighbor):
    '''
    Input:
        X: a matrix where the columns are the English embeddings.
        Y: a matrix where the columns correspong to the French embeddings.
        R: the transform matrix which translates word embeddings from
        English to French word vector space.
    Output:
        accuracy: for the English to French capitals
    '''

    ### START CODE HERE ###
    # The prediction is X times R
    pred = X@R

    # initialize the number correct to zero
    num_correct = 0

    # loop through each row in pred (each transformed embedding)
    for i in range(len(pred)):
        # get the index of the nearest neighbor of pred at row 'i'; also pass in the candidates in Y
        pred_idx = nearest_neighbor(pred[i], Y, k=1).item()

        # if the index of the nearest neighbor equals the row of i... \
        if pred_idx == i:
            # increment the number correct by 1.
            num_correct += 1

    # accuracy is the number correct divided by the number of rows in 'pred' (also number of rows in X)
    accuracy = num_correct/pred.shape[0]

    ### END CODE HERE ###

    return accuracy

Let’s see how is your translation mechanism working on the unseen data:

X_val, Y_val = get_matrices(en_fr_test, fr_embeddings_subset, en_embeddings_subset)
# UNQ_C11 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# You do not have to input any code in this cell, but it is relevant to grading, so please do not change anything

acc = test_vocabulary(X_val, Y_val, R_train)  # this might take a minute or two
print(f"accuracy on test set is {acc:.3f}")
accuracy on test set is 0.557

Expected Output:

0.557

You managed to translate words from one language to another language without ever seing them with almost 56% accuracy by using some basic linear algebra and learning a mapping of words from one language to another!

# Test your function
w4_unittest.unittest_test_vocabulary(test_vocabulary)
 All tests passed

4 Conclusion#

Congratulations - Now you can look up vectors that are similar to the encoding of your tweet using LSH!