assignment 2 deep n grams

CoCalc’s goal is to provide the best real-time collaborative environment for Jupyter Notebooks , LaTeX documents , and SageMath , scalable from individual use to large groups and classes.

Assignment 2: deep n-grams.

Welcome to the second assignment of course 3. In this assignment you will explore Recurrent Neural Networks RNN .

You will be using the fundamentals of google's trax package to implement any kind of deeplearning model.

By completing this assignment, you will learn how to implement models from scratch:

How to convert a line of text into a tensor

Create an iterator to feed data to the model

Define a GRU model using trax

Train the model using trax

Compute the accuracy of your model using the perplexity

Predict using your own model

Part 1: Importing the Data

1.1 loading in the data, 1.2 convert a line to tensor, exercise 01, 1.3 batch generator, exercise 02, 1.4 repeating batch generator, part 2: defining the gru model, exercise 03, part 3: training.

3.1 Training the Model

Exercise 04

Part 4: evaluation, 4.1 evaluating using the deep nets, exercise 05, part 5: generating the language with your own model.

Your task will be to predict the next set of characters using the previous characters.

Although this task sounds simple, it is pretty useful.

You will start by converting a line of text into a tensor

Then you will create a generator to feed data into the model

You will train a neural network in order to predict the new set of characters of defined length.

You will use embeddings for each character and feed them as inputs to your model.

Many natural language tasks rely on using embeddings for predictions.

Your model will convert each character to its embedding, run the embeddings through a Gated Recurrent Unit GRU , and run it through a linear layer to predict the next set of characters.

assignment 2 deep n grams

Assignment 2: Deep N-grams

Welcome to the second assignment of course 3. In this assignment you will explore Recurrent Neural Networks RNN .

  • You will be using the fundamentals of google's trax package to implement any kind of deeplearning model.

By completing this assignment, you will learn how to implement models from scratch:

  • How to convert a line of text into a tensor
  • Create an iterator to feed data to the model
  • Define a GRU model using trax
  • Train the model using trax
  • Compute the accuracy of your model using the perplexity
  • Predict using your own model

1.1 Loading in the data

  • Exercise 01
  • Exercise 02
  • 1.4 Repeating Batch generator
  • Exercise 03
  • Exercise 04
  • Exercise 05
  • Part 5: Generating the language with your own model

Your task will be to predict the next set of characters using the previous characters.

  • Although this task sounds simple, it is pretty useful.
  • You will start by converting a line of text into a tensor
  • Then you will create a generator to feed data into the model
  • You will train a neural network in order to predict the new set of characters of defined length.
  • Many natural language tasks rely on using embeddings for predictions.
  • Your model will convert each character to its embedding, run the embeddings through a Gated Recurrent Unit GRU , and run it through a linear layer to predict the next set of characters.

alt

The figure above gives you a summary of what you are about to implement.

  • You will get the embeddings;
  • Stack the embeddings on top of each other;
  • Run them through two layers with a relu activation in the middle;
  • Finally, you will compute the softmax.

To predict the next character:

  • Use the softmax output and identify the word with the highest probability.
  • The word with the highest probability is the prediction for the next word.

Part 1: Importing the Data

alt

Now import the dataset and do some processing.

  • The dataset has one sentence per line.
  • You will be doing character generation, so you have to process each sentence by converting each character (and not word) to a number.
  • You will use the ord function to convert a unique character to a unique integer ID.
  • Store each line in a list.
  • The max_length corresponds to the maximum length of the sentence.

Welcome to Assignment 2! ¶

In this assignment, your primary goal is to implement unigram and bigram language models and evaluate their performance. You'll use the equations from Chapter 3 of SLP ; in particular you will implement maximum likelihood estimation (equations 3.11 and 3.12) with add-k smoothing (equation 3.25), as well as a perplexity calculation to test your models (equation 3.16, but explained more in this document and skeleton code).

The skeleton code for this assignment is available at https://faculty.wcas.northwestern.edu/robvoigt/courses/2021_fall/ling334/assignments/a2.zip . I expect this assignment to potentially be more tricky for many of you than the last one, so start early!

There is a small interface given so you can test your program by running:

And it will run your model on some data accompanying the assignment (specifically, Sam I Am) and report its performance.

A Note on Object-Oriented Programming ¶

Many of you may not be familiar with the idea of object-oriented programming, or how it plays out in python. You don't need to go deep into the details, but if you're interested in getting more detail on what's going on, you can start with Chapters 15-17 of Think Python .

For the purposes of this assignment, notice that there is a large structure at the top level (leftmost indentation) that is defined with a keyword class . This is us making a definition of a new type of object, an NgramLanguageModel . Doing so allows us to associate all the various data (for instance, counts from a corpus) and functions (for instance, to accumulate those counts or produce a probability) with a given "instance" of that object in a persistent manner.

Once the class is defined, we can produce an instance as follows:

The parens on the end look like a function call, and that's because they are - specifically a special "constructor" function that creates an object of the NgramLanguageModel type. In the above ngram_lm now contains an instance of that object, setup according to the special __init__(self) function at the top of the class (e.g., it has the *_counts dicts and the k value set).

A note on self : this is a special self-referential keyword, by which a class can reference the variables (== attributes of the class) and functions (== methods of the class) it contains from inside itself. In this assignment, you'll be updating the self.unigram_counts and self.bigram_counts variables - you have to use self. as a prefix to access them.

This is all for your information - the object-oriented setup is done for you in the skeleton code, your job is to modify the functions inside the class to do the various things we need to do.

Your Jobs ¶

There are a number of ways to build a language model; I've set up a basic interface which I'd like you to stick with for the basic submission, but you can change things around wildly in any extensions.

For this assignment the main things you need to do are to write the train , predict_unigram , predict_bigram , and test_perplexity functions.

train ¶

The way it's set up, the train function need only accumulate counts, the resulting probabilities can be calculated at test time in the predict_* functions. This is slower at test time though, so one simple extension is to modify this (see below).

You can expect the training corpus to contain one sentence per line, already tokenized, so you can split it up on whitespace (e.g., sentence.split() ). Important reminder that you must add <s> tokens to the beginning and </s> tokens to the end of every sentence.

predict_unigram , predict_bigram ¶

These functions should take a sentence (as a string), split it into tokens on whitespace, add start and end tokens, and then calculate the probability of that sentence sequence using a unigram or bigram language model, respectively.

  • Use add-k smoothing in this calculation. This is very similar to maximum likelihood estimation, but adding k to the numerator and k * vocab_size to the denominator (see Equation 3.25 in the textbook).
  • Return log probabilities! As talked about in class, we want to do these calculations in log-space because of floating point underflow problems. Do this per word! Even a long sentence could easily get us to an underflow. So create a float that starts at 0.0, and for each word where you get the probability, do += math.log(prob) onto that float variable. Return this sum in the end.

test_perplexity ¶

This function takes the path to a new corpus as input and calculates its perplexity (normalized total log-likelihood) relative to a new test corpus.

The basic gist here is quite simple - use your predict_* functions to calculate sentence-level log probabilities and sum them up, then convert to perplexity by doing the following:

Where N is the total number of words seen. There are some additional details given in the function docstring in the skeleton code.

Autograder ¶

The autograder will create an instance of your NgramLanguageModel class, train it on the first part of Sam I Am, test it on the last part, and check that the results roughly line up with what I get in my solution. To run it, use this command:

At a minimum, your assignment should run and pass all tests using this autograder before you submit!

Extensions ¶

If you've got more gas in the tank, journey onward!

When implementing any of these, please leave your working NgramLanguageModel class intact and perhaps copy-paste it to a new class or new script to be modified, so the autograder still works. I suggest doing cp language_modeling.py language_modeling_ext.py once your initial class works, and editing the _ext.py file instead of the original.

  • Add generation. Flip it around! Add a method to the class which uses your model to generate text. This is a bit tricky but I think very useful for better understanding these models. To generate a sentence with a bigram model, for instance, start with the <s> token and sample from the next word in proportion to their probabilities of following that token. The easiest way to do this in my view is using numpy , specifically the numpy.random.choice function - you can put the words into the a argument and the corresponding probabilities into the p argument and sample the word. Now look at all the words that follow that word, and sample again. Continue sampling until you happen to sample the end-of-sentence token </s> , at which point complete the generation.
  • Try with more and varied data. In the course data directory there is an a2 subfolder ( /projects/e31408/data/a2 ) where there are a few additional corpora you could try out your models on. In particular there is a copy of the Brown Corpus split into train/dev/test splits (1.1M words total), a large set of presidential speeches split into train/dev/test splits (4.5M words total), all the works of Shakespeare, and all the lyrics of Beyonce and TSwift. In addition to simply trying on simply more data, you can see how models trained on certain data work when applied to other data.
  • Speed up test time performance. You can change the structure of the NgramLanguageModel class to calculate all possible probabilities at training time, which will substantially increase the speed of test-time inference since that will become a matter of looking up log probabilities and adding them up rather than performing the calculation each time a given n-gram appears.
  • Add trigrams. Extend your LM to include trigram estimation as well. Is it better or worse on the Sam-I-Am data? What about on larger datasets? Why or why not? Does generation look qualitatively different compared to bigrams?
  • Add arbitrary n-grams. While unigram LMs are something of a special case, n-gram LMs where $n >= 2$ can be coded in an arbitrary manner where you can simply pass the integer for the n-gram size you want to use. Try refactoring your code to do this.
  • Add backoff, interpolation, or more complex smoothing. The book notes that add-k smoothing is actually not ideal in practice and lists several follow-ups of various types that improve model performance. Pick one or more, try implementing them, and see how low you can get your perplexity!
  • Neural LMs. If you want to go buck wild here, you could read about neural language models like in SLP here or in this D2L book and have a go at implementing, for instance, a basic RNN! Something along these lines could easily be substantial enough to be a final project as well, so keep this in your back pocket if you're interested.
  • Read and report. Read a relevant article, post about it on Ed!

And as usual, whatever else you can dream up is also welcome!

Submission ¶

To officially submit your assignment, please fill out this form:

https://forms.gle/FCdncfDdetuZ43ud7

This will let us know your assignment is ready for us to go check out!

Deep N-Grams: Evaluating the Model

Cloistered Monkey

2021-01-05 16:48

Table of Contents

Pre-built model, evaluating the model.

  • Previous Post

Now that you have learned how to train a model, you will learn how to evaluate it. To evaluate language models, we usually use perplexity which is a measure of how well a probability model predicts a sample. Note that perplexity is defined as:

\[ P(W) = \sqrt[N]{\prod_{i=1}^{N} \frac{1}{P(w_i| w_1,...,w_{n-1})}} \]

As an implementation hack, you would usually take the log of that formula (to enable us to use the log probabilities we get as output of our RNN , convert exponents to products, and products into sums which makes computations less complicated and computationally more efficient). You should also take care of the padding, since you do not want to include the padding when calculating the perplexity (because we do not want to have a perplexity measure that is artificially good).

Instructions: Write a program that will help evaluate your model. Implementation hack: your program takes in preds and target. Preds is a tensor of log probabilities. You can use tl.one_hot to transform the target into the same dimension. You then multiply them and sum.

You also have to create a mask to only get the non-padded probabilities. Good luck!

  • To convert the target into the same dimension as the predictions tensor use tl.one.hot with target and preds.shape[-1].
  • You will also need the np.equal function in order to unpad the data and properly compute perplexity.
  • Keep in mind while implementing the formula above that \(w_i\) represents a letter from our 256 letter alphabet.

We're going to start with a pre-built file and see how it does relative to our model.

On the one hand I over-trained my model, on the other hand… why such a big difference?

logo

Assignment 4: Word Embeddings

Assignment 4: word embeddings #.

Welcome to the fourth (and last) programming assignment of Course 2!

In this assignment, you will practice how to compute word embeddings and use them for sentiment analysis.

To implement sentiment analysis, you can go beyond counting the number of positive words and negative words.

You can find a way to represent each word numerically, by a vector.

The vector could then represent syntactic (i.e. parts of speech) and semantic (i.e. meaning) structures.

In this assignment, you will explore a classic way of generating word embeddings or representations.

You will implement a famous model called the continuous bag of words (CBOW) model.

By completing this assignment you will:

Train word vectors from scratch.

Learn how to create batches of data.

Understand how backpropagation works.

Plot and visualize your learned word vectors.

Knowing how to train these models will give you a better understanding of word vectors, which are building blocks to many applications in natural language processing.

Important Note on Submission to the AutoGrader #

Before submitting your assignment to the AutoGrader, please make sure you are not doing the following:

You have not added any extra print statement(s) in the assignment.

You have not added any extra code cell(s) in the assignment.

You have not changed any of the function parameters.

You are not using any global variables inside your graded exercises. Unless specifically instructed to do so, please refrain from it and use the local variables instead.

You are not changing the assignment code where it is not required, like creating extra variables.

If you do any of the following, you will get something like, Grader not found (or similarly unexpected) error upon submitting your assignment. Before asking for help/debugging the errors in your assignment, check for these first. If this is the case, and you don’t remember the changes you have made, you can get a fresh copy of the assignment by following these instructions .

1 The Continuous bag of words model

2 Training the Model

2.0 Initialize the model

Exercise 01

2.1 Softmax Function

Exercise 02

2.2 Forward Propagation

Exercise 03

2.3 Cost Function

2.4 Backproagation

Exercise 04

2.5 Gradient Descent

Exercise 05

3 Visualizing the word vectors

1. The Continuous bag of words model #

Let’s take a look at the following sentence:

‘I am happy because I am learning’ .

In continuous bag of words (CBOW) modeling, we try to predict the center word given a few context words (the words around the center word).

For example, if you were to choose a context half-size of say \(C = 2\) , then you would try to predict the word happy given the context that includes 2 words before and 2 words after the center word:

\(C\) words before: [I, am]
\(C\) words after: [because, I]

In other words:

The structure of your model will look like this:

alternate text

Where \(\bar x\) is the average of all the one hot vectors of the context words.

alternate text

Once you have encoded all the context words, you can use \(\bar x\) as the input to your model.

The architecture you will be implementing is as follows:

Mapping words to indices and indices to words #

We provide a helper function to create a dictionary that maps words to indices and indices to words.

2 Training the Model #

Initializing the model #.

You will now initialize two matrices and two vectors.

The first matrix ( \(W_1\) ) is of dimension \(N \times V\) , where \(V\) is the number of words in your vocabulary and \(N\) is the dimension of your word vector.

The second matrix ( \(W_2\) ) is of dimension \(V \times N\) .

Vector \(b_1\) has dimensions \(N\times 1\)

Vector \(b_2\) has dimensions \(V\times 1\) .

\(b_1\) and \(b_2\) are the bias vectors of the linear layers from matrices \(W_1\) and \(W_2\) .

The overall structure of the model will look as in Figure 1, but at this stage we are just initializing the parameters.

Exercise 01 #

Please use numpy.random.rand to generate matrices that are initialized with random values from a uniform distribution, ranging between 0 and 1.

Note: In the next cell you will encounter a random seed. Please DO NOT modify this seed so your solution can be tested correctly.

Expected Output #

2.1 softmax #.

Before we can start training the model, we need to implement the softmax function as defined in equation 5:

Array indexing in code starts at 0.

\(V\) is the number of words in the vocabulary (which is also the number of rows of \(z\) ).

\(i\) goes from 0 to |V| - 1.

Exercise 02 #

Instructions : Implement the softmax function below.

Assume that the input \(z\) to softmax is a 2D array

Each training example is represented by a vector of shape (V, 1) in this 2D array.

There may be more than one column, in the 2D array, because you can put in a batch of examples to increase efficiency. Let’s call the batch size lowercase \(m\) , so the \(z\) array has shape (V, m)

When taking the sum from \(i=1 \cdots V-1\) , take the sum for each column (each example) separately.

numpy.sum (set the axis so that you take the sum of each column in z)

Expected Ouput #

2.2 forward propagation #, exercise 03 #.

Implement the forward propagation \(z\) according to equations (1) to (3).

For that, you will use as activation the Rectified Linear Unit (ReLU) given by:

  • You can use numpy.maximum(x1,x2) to get the maximum of two values
  • Use numpy.dot(A,B) to matrix multiply A and B

Expected output #

2.3 cost function #.

We have implemented the cross-entropy cost function for you.

2.4 Training the Model - Backpropagation #

Exercise 04 #.

Now that you have understood how the CBOW model works, you will train it. You created a function for the forward propagation. Now you will implement a function that computes the gradients to backpropagate the errors.

Gradient Descent #

Exercise 05 #.

Now that you have implemented a function to compute the gradients, you will implement batch gradient descent over your training set.

Hint: For that, you will use initialize_model and the back_prop functions which you just created (and the compute_cost function). You can also use the provided get_batches helper function:

for x, y in get_batches(data, word2Ind, V, C, batch_size):

Also: print the cost after each batch is processed (use batch size = 128)

Your numbers may differ a bit depending on which version of Python you’re using.

3.0 Visualizing the word vectors #

In this part you will visualize the word vectors trained using the function you just coded above.

You can see that man and king are next to each other. However, we have to be careful with the interpretation of this projected word vectors, since the PCA depends on the projection – as shown in the following illustration.

COMMENTS

  1. Assignment 2: Deep N-grams

    Assignment 2: Deep N-grams. Welcome to the second assignment of course 3. In this assignment you will explore Recurrent Neural Networks RNN. You will be using the fundamentals of google's trax package to implement any kind of deeplearning model. By completing this assignment, you will learn how to implement models from scratch:

  2. Coursera-NLP/C3_W2_Assignment: Deep N-grams.ipynb at main

    C3_W2_Assignment: Deep N-grams.ipynb. Cannot retrieve latest commit at this time. History. Preview. Code. Blame. 1516 lines (1516 loc) · 55.8 KB. Contribute to Dodolly/Coursera-NLP development by creating an account on GitHub.

  3. amanchadha/coursera-natural-language-processing-specialization

    Natural Language Processing Specialization on Coursera ...

  4. DeepLearning/Deep N-grams.ipynb at master

    Deep N-grams.ipynb. Cannot retrieve latest commit at this time. History. Preview. Code. Blame. 1428 lines (1428 loc) · 53 KB. Deep Learning Specialization and Natural Language Processing Specialization by Deep Learning.ai, Coursera - DeepLearning/Deep N-grams.ipynb at master · anshudaur/DeepLearning.

  5. CoCalc -- C3_W2_Assignment.ipynb

    Assignment 2: Deep N-grams. Welcome to the second assignment of course 3. In this assignment you will explore Recurrent Neural Networks RNN.. You will be using the fundamentals of google's trax package to implement any kind of deeplearning model.. By completing this assignment, you will learn how to implement models from scratch:

  6. PDF NLP: N-Gram Models

    Deep Learning Srihari Limitation of Maximum Likelihood for n- gram models •P nestimated from training samples is very likely to be zero in many cases even though the tuple x t-n+1,..,x tmay appear in test set -When P n-1is zero the ratio is undefined-When P n-1is non-zero but Pnis zero the log- likelihood is -∞ •To avoid such catastrophic outcomes, n-gram

  7. Assignment 3: Language Models: Auto-Complete

    Assignment 2: Deep N-grams Assignment 3 - Named Entity Recognition (NER) Assignment 4: Question duplicates Deep Learning Course#4: NLP with Attention Models ... While a variety of language models have been developed, this assignment uses N-grams, a simple but powerful method for language modeling.

  8. Deep N Grams

    Assignment 2: Deep N-grams. Welcome to the second assignment of course 3. In this assignment you will explore Recurrent Neural Networks RNN.. You will be using the fundamentals of google's trax package to implement any kind of deeplearning model.; By completing this assignment, you will learn how to implement models from scratch:

  9. Deep N-Grams: Creating the Model

    ShiftRight(n_positions=1, mode 'train')= layer to shift the tensor to the right n_positions times. Here in the exercise you only need to specify the mode and not worry about n_positions. Embedding: Initializes the embedding layer which maps tokens/IDs to vectors. Embedding(vocab_size, d_feature).

  10. N-Grams in Natural Language Processing

    💼 Learn to build LLM-powered apps in just 40 hours with our Large Language Models bootcamp: https://hubs.la/Q01ZZGL---In this quick tutorial, we learn that...

  11. PDF Lecture 3 Language Modeling with N-Grams

    n-grams are Markov models that estimate words from a fixed window of previous words. n-gram probabilities can be estimated by counting in a corpus and normalizing (the maximum likelihood estimate). n-gram language models are evaluated extrinsically in some task, or intrinsically using perplexity.

  12. Deep N-Grams: Generating Sentences

    Furthermore, statistical N-Gram models take up too much space and memory. As a result, it would be inefficient and too slow. Conversely, with deep neural networks, you can get a better perplexity. Note though, that learning about n-gram language models is still important and leads to a better understanding of deep neural networks.

  13. Natural Language Processing Specialization

    amanjeetsahu/Natural-Language-Processing-Specialization

  14. Assignment 1: Sentiment with Deep Neural Networks

    To solve those kinds of misclassifications, you will write a program that uses deep neural networks to identify sentiment in text. By completing this assignment, you will: Understand how you can build/design a model using layers. Train a model using a training loop. Use a binary cross-entropy loss function.

  15. Assignment 2

    Assignment 2 - Ngram LM. In this assignment, your primary goal is to implement unigram and bigram language models and evaluate their performance. You'll use the equations from Chapter 3 of SLP; in particular you will implement maximum likelihood estimation (equations 3.11 and 3.12) with add-k smoothing (equation 3.25), as well as a perplexity ...

  16. Deep N-Grams: Evaluating the Model

    To evaluate language models, we usually use perplexity which is a measure of how well a probability model predicts a sample. Note that perplexity is defined as: P (W) = ∏ i = 1 N 1 P (w i | w 1,..., w n − 1) N. As an implementation hack, you would usually take the log of that formula (to enable us to use the log probabilities we get as ...

  17. GitHub

    Assignment 2: Deep N-grams. The second assignment of course 3. In this assignment I explored Recurrent Neural Networks RNN. I used the fundamentals of google's trax package to implement any kind of deeplearning model. By completing this assignment, I learnt how to implement models from scratch:

  18. Assignment 3

    Assignment 1: Sentiment with Deep Neural Networks Assignment 2: Deep N-grams Assignment 3 - Named Entity Recognition (NER) Assignment 4: Question duplicates Deep Learning Course#4: NLP with Attention Models Basic Attention Basic Attention Operation: Ungraded Lab Scaled Dot-Product Attention: Ungraded Lab Wandb

  19. Natural Language Processing with Sequence Models

    Natural Language Processing with Sequence Models

  20. Assignment 2: Naive Bayes

    Assignment 1: Sentiment with Deep Neural Networks Assignment 2: Deep N-grams Assignment 3 - Named Entity Recognition (NER) Assignment 4: Question duplicates Deep Learning Course#4: NLP with Attention Models Basic Attention Basic Attention Operation: Ungraded Lab Scaled Dot-Product Attention: Ungraded Lab Wandb

  21. Assignment 4: Word Embeddings

    In this assignment, you will explore a classic way of generating word embeddings or representations. You will implement a famous model called the continuous bag of words (CBOW) model. By completing this assignment you will: Train word vectors from scratch. Learn how to create batches of data. Understand how backpropagation works.

  22. Coursera-Deep-Learning/Natural Language Processing with ...

    You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window.