\n",
" Figure 1

\n",
"\n",
"Use the `en_fr` dictionary to ensure that the ith row in the `X` matrix\n",
"corresponds to the ith row in the `Y` matrix."
]
},
{
"cell_type": "markdown",
"id": "bb6efd79",
"metadata": {},
"source": [
"**Instructions**: Complete the function `get_matrices()`:\n",
"* Iterate over English words in `en_fr` dictionary.\n",
"* Check if the word have both English and French embedding."
]
},
{
"cell_type": "markdown",
"id": "9ffbf84a",
"metadata": {},
"source": [
"\n",
"## \n",
" **Hints**\n",
"

\n",
"

\n", "

- \n",
"
- Sets are useful data structures that can be used to check if an item is a member of a group. \n", "
- You can get words which are embedded into the language by using keys method. \n", "
- Keep vectors in `X` and `Y` sorted in list. You can use np.vstack() to merge them into the numpy matrix. \n", "
- numpy.vstack stacks the items in a list as rows in a matrix. \n", "

Figure 2

\n",
"\n",
"Write a program that translates English words to French words using word embeddings and vector space models. \n",
"\n",
"\n",
"## 2.1 Translation as linear transformation of embeddings\n",
"\n",
"Given dictionaries of English and French word embeddings you will create a transformation matrix `R`\n",
"* Given an English word embedding, $\\mathbf{e}$, you can multiply $\\mathbf{eR}$ to get a new word embedding $\\mathbf{f}$.\n",
" * Both $\\mathbf{e}$ and $\\mathbf{f}$ are [row vectors](https://en.wikipedia.org/wiki/Row_and_column_vectors).\n",
"* You can then compute the nearest neighbors to `f` in the french embeddings and recommend the word that is most similar to the transformed word embedding."
]
},
{
"cell_type": "markdown",
"id": "f8c25134",
"metadata": {},
"source": [
"### Describing translation as the minimization problem\n",
"\n",
"Find a matrix `R` that minimizes the following equation. \n",
"\n",
"$$\\arg \\min _{\\mathbf{R}}\\| \\mathbf{X R} - \\mathbf{Y}\\|_{F}\\tag{1} $$\n",
"\n",
"### Frobenius norm\n",
"\n",
"The Frobenius norm of a matrix $A$ (assuming it is of dimension $m,n$) is defined as the square root of the sum of the absolute squares of its elements:\n",
"\n",
"$$\\|\\mathbf{A}\\|_{F} \\equiv \\sqrt{\\sum_{i=1}^{m} \\sum_{j=1}^{n}\\left|a_{i j}\\right|^{2}}\\tag{2}$$"
]
},
{
"cell_type": "markdown",
"id": "b6d49bc3",
"metadata": {},
"source": [
"### Actual loss function\n",
"In the real world applications, the Frobenius norm loss:\n",
"\n",
"$$\\| \\mathbf{XR} - \\mathbf{Y}\\|_{F}$$\n",
"\n",
"is often replaced by it's squared value divided by $m$:\n",
"\n",
"$$ \\frac{1}{m} \\| \\mathbf{X R} - \\mathbf{Y} \\|_{F}^{2}$$\n",
"\n",
"where $m$ is the number of examples (rows in $\\mathbf{X}$).\n",
"\n",
"* The same R is found when using this loss function versus the original Frobenius norm.\n",
"* The reason for taking the square is that it's easier to compute the gradient of the squared Frobenius.\n",
"* The reason for dividing by $m$ is that we're more interested in the average loss per embedding than the loss for the entire training set.\n",
" * The loss for all training set increases with more words (training examples),\n",
" so taking the average helps us to track the average loss regardless of the size of the training set."
]
},
{
"cell_type": "markdown",
"id": "453879ff",
"metadata": {},
"source": [
"##### [Optional] Detailed explanation why we use norm squared instead of the norm:\n",
"\n",
"## \n",
" Click for optional details\n",
"

\n",
"

\n", "

- \n",
"
- The norm is always nonnegative (we're summing up absolute values), and so is the square. \n", "
- When we take the square of all non-negative (positive or zero) numbers, the order of the data is preserved. \n", "
- For example, if 3 > 2, 3^2 > 2^2\n", "
- Using the norm or squared norm in gradient descent results in the same
*location*of the minimum.\n", " - Squaring cancels the square root in the Frobenius norm formula. Because of the chain rule, we would have to do more calculations if we had a square root in our expression for summation.\n", "
- Dividing the function value by the positive number doesn't change the optimum of the function, for the same reason as described above.\n", "
- We're interested in transforming English embedding into the French. Thus, it is more important to measure average loss per embedding than the loss for the entire dictionary (which increases as the number of words in the dictionary increases).\n", "

\n",
"## \n",
" **Hints**\n",
"

\n",
"

\n", "

- \n",
"
- Useful functions:\n", " Numpy dot ,\n", " Numpy sum,\n", " Numpy square,\n", " Numpy norm\n", " \n", "
- Be careful about which operation is elementwise and which operation is a matrix multiplication. \n", "
- Try to use matrix operations instead of the numpy norm function. If you choose to use norm function, take care of extra arguments and that it's returning loss squared, and not the loss itself. \n", "\n", "

\n",
"## \n",
" **Hints**\n",
"

\n",
"

\n", "

- \n",
"
- Transposing in numpy \n", "
- Finding out the dimensions of matrices in numpy \n", "
- Remember to use numpy.dot for matrix multiplication \n", "

\n",
"## \n",
" **click here for detailed discussion**\n",
"

\n",
"

\n", "

- \n",
"
- You cannot rely on training loss getting low -- what you really want is the validation loss to go down, or validation accuracy to go up. And indeed - in some cases people train until validation accuracy reaches a threshold, or -- commonly known as \"early stopping\" -- until the validation accuracy starts to go down, which is a sign of over-fitting.\n", " \n", "
- \n", " Why not always do \"early stopping\"? Well, mostly because well-regularized models on larger data-sets never stop improving. Especially in NLP, you can often continue training for months and the model will continue getting slightly and slightly better. This is also the reason why it's hard to just stop at a threshold -- unless there's an external customer setting the threshold, why stop, where do you put the threshold?\n", " \n", "
- Stopping after a certain number of steps has the advantage that you know how long your training will take - so you can keep some sanity and not train for months. You can then try to get the best performance within this time budget. Another advantage is that you can fix your learning rate schedule -- e.g., lower the learning rate at 10% before finish, and then again more at 1% before finishing. Such learning rate schedules help a lot, but are harder to do if you don't know how long you're training.\n", " \n", "

\n",
"## \n",
" **Hints**\n",
"

\n",
"

\n", "

- \n",
"
- Use the 'compute_gradient()' function to get the gradient in each step \n", "\n", "

\n",
"## \n",
" **Hints**\n",
"

\n",
"

\n", "

- \n",
"
- numpy.argsort sorts values from most negative to most positive (smallest to largest) \n", "
- The candidates that are nearest to 'v' should have the highest cosine similarity \n", "
- To reverse the order of the result of numpy.argsort to get the element with highest cosine similarity as the first element of the array you can use tmp[::-1]. This reverses the order of an array. Then, you can extract the first k elements. \n", "

\n",
"## \n",
" **Hints**\n",
"

\n",
"

\n", "

- \n",
"
- You can handle missing words easier by using the `get()` method of the python dictionary instead of the bracket notation (i.e. \"[ ]\"). See more about it here \n", "
- The default value for missing word should be the zero vector. Numpy will broadcast simple 0 scalar into a vector of zeros during the summation. \n", "
- Alternatively, skip the addition if a word is not in the dictonary. \n", "
- You can use your `process_tweet()` function which allows you to process the tweet. The function just takes in a tweet and returns a list of words. \n", "

Figure 3

\n",
"\n",
"You can divide the vector space into regions and search within one region for nearest neighbors of a given vector.\n",
"\n",
" Figure 4

"
]
},
{
"cell_type": "code",
"execution_count": 35,
"id": "aa9337e1",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of vectors is 10000 and each has 300 dimensions.\n"
]
}
],
"source": [
"N_VECS = len(all_tweets) # This many vectors.\n",
"N_DIMS = len(ind2Tweet[1]) # Vector dimensionality.\n",
"print(f\"Number of vectors is {N_VECS} and each has {N_DIMS} dimensions.\")"
]
},
{
"cell_type": "markdown",
"id": "80bbfbcf",
"metadata": {},
"source": [
"#### Choosing the number of planes\n",
"\n",
"* Each plane divides the space to $2$ parts.\n",
"* So $n$ planes divide the space into $2^{n}$ hash buckets.\n",
"* We want to organize 10,000 document vectors into buckets so that every bucket has about $~16$ vectors.\n",
"* For that we need $\\frac{10000}{16}=625$ buckets.\n",
"* We're interested in $n$, number of planes, so that $2^{n}= 625$. Now, we can calculate $n=\\log_{2}625 = 9.29 \\approx 10$."
]
},
{
"cell_type": "code",
"execution_count": 36,
"id": "a05f53b3",
"metadata": {},
"outputs": [],
"source": [
"# The number of planes. We use log2(256) to have ~16 vectors/bucket.\n",
"N_PLANES = 10\n",
"# Number of times to repeat the hashing to improve the search.\n",
"N_UNIVERSES = 25"
]
},
{
"cell_type": "markdown",
"id": "481dab0f",
"metadata": {},
"source": [
"\n",
"\n",
"## 3.4 Getting the hash number for a vector\n",
"\n",
"For each vector, we need to get a unique number associated to that vector in order to assign it to a \"hash bucket\".\n",
"\n",
"### Hyperplanes in vector spaces\n",
"* In $3$-dimensional vector space, the hyperplane is a regular plane. In $2$ dimensional vector space, the hyperplane is a line.\n",
"* Generally, the hyperplane is subspace which has dimension $1$ lower than the original vector space has.\n",
"* A hyperplane is uniquely defined by its normal vector.\n",
"* Normal vector $n$ of the plane $\\pi$ is the vector to which all vectors in the plane $\\pi$ are orthogonal (perpendicular in $3$ dimensional case).\n",
"\n",
"### Using Hyperplanes to split the vector space\n",
"We can use a hyperplane to split the vector space into $2$ parts.\n",
"* All vectors whose dot product with a plane's normal vector is positive are on one side of the plane.\n",
"* All vectors whose dot product with the plane's normal vector is negative are on the other side of the plane.\n",
"\n",
"### Encoding hash buckets\n",
"* For a vector, we can take its dot product with all the planes, then encode this information to assign the vector to a single hash bucket.\n",
"* When the vector is pointing to the opposite side of the hyperplane than normal, encode it by 0.\n",
"* Otherwise, if the vector is on the same side as the normal vector, encode it by 1.\n",
"* If you calculate the dot product with each plane in the same order for every vector, you've encoded each vector's unique hash ID as a binary number, like [0, 1, 1, ... 0]."
]
},
{
"cell_type": "markdown",
"id": "d92c9e5b",
"metadata": {},
"source": [
"\n",
"\n",
"### Exercise 09: Implementing hash buckets\n",
"\n",
"We've initialized hash table `hashes` for you. It is list of `N_UNIVERSES` matrices, each describes its own hash table. Each matrix has `N_DIMS` rows and `N_PLANES` columns. Every column of that matrix is a `N_DIMS`-dimensional normal vector for each of `N_PLANES` hyperplanes which are used for creating buckets of the particular hash table.\n",
"\n",
"*Exercise*: Your task is to complete the function `hash_value_of_vector` which places vector `v` in the correct hash bucket.\n",
"\n",
"* First multiply your vector `v`, with a corresponding plane. This will give you a vector of dimension $(1,\\text{N_planes})$.\n",
"* You will then convert every element in that vector to 0 or 1.\n",
"* You create a hash vector by doing the following: if the element is negative, it becomes a 0, otherwise you change it to a 1.\n",
"* You then compute the unique number for the vector by iterating over `N_PLANES`\n",
"* Then you multiply $2^i$ times the corresponding bit (0 or 1).\n",
"* You will then store that sum in the variable `hash_value`.\n",
"\n",
"**Intructions:** Create a hash for the vector in the function below.\n",
"Use this formula:\n",
"\n",
"$$ hash = \\sum_{i=0}^{N-1} \\left( 2^{i} \\times h_{i} \\right) $$"
]
},
{
"cell_type": "markdown",
"id": "7900bb6f",
"metadata": {},
"source": [
"#### Create the sets of planes\n",
"* Create multiple (25) sets of planes (the planes that divide up the region).\n",
"* You can think of these as 25 separate ways of dividing up the vector space with a different set of planes.\n",
"* Each element of this list contains a matrix with 300 rows (the word vector have 300 dimensions), and 10 columns (there are 10 planes in each \"universe\")."
]
},
{
"cell_type": "code",
"execution_count": 37,
"id": "b767a68f",
"metadata": {},
"outputs": [],
"source": [
"np.random.seed(0)\n",
"planes_l = [np.random.normal(size=(N_DIMS, N_PLANES))\n",
" for _ in range(N_UNIVERSES)]"
]
},
{
"cell_type": "markdown",
"id": "889ba14c",
"metadata": {},
"source": [
"\n",
"## \n",
" **Hints**\n",
"

\n",
" \n",
"\n",
"We have given you the `make_hash_table` function, which maps the tweet vectors to a bucket and stores the vector there. It returns the `hash_table` and the `id_table`. The `id_table` allows you know which vector in a certain bucket corresponds to what tweet."
]
},
{
"cell_type": "markdown",
"id": "7241ad85",
"metadata": {},
"source": [
"

\n", "

- \n",
"
- numpy.squeeze() removes unused dimensions from an array; for instance, it converts a (10,1) 2D array into a (10,) 1D array \n", "

\n",
"## \n",
" **Hints**\n",
"

\n",
"

\n", "

- \n",
"
- a dictionary comprehension, similar to a list comprehension, looks like this: `{i:0 for i in range(10)}`, where the key is 'i' and the value is zero for all key-value pairs. \n", "

\n",
"## \n",
" **Hints**\n",
"

\n",
"

\n", "

- \n",
"
- There are many dictionaries used in this function. Try to print out planes_l, hash_tables, id_tables to understand how they are structured, what the keys represent, and what the values contain. \n", "
- To remove an item from a list, use `.remove()` \n", "
- To append to a list, use `.append()` \n", "
- To add to a set, use `.add()` \n", "