The second coding assignment asks you to implement a simple natural language processing model for sentiment analysis on the Amazon Review Dataset kaggle page of this coding assignment
You can use some deep learning libraries (e.g., PyTorch, Tensorflow) to accelerate your code with CUDA back-end.
Note: we will use Python 3.x
for the project.
- Push your code to github classroom page's CA2 section
- Submit your report to Gradescope 'CA2 (Hackathon) Report' section
- Submit your entry to Kaggle
Push to your github classroom
- All of the python files listed above (under "Files you'll edit").
- Caution: DO NOT UPLOAD THE DATASET
Construct the training set for the amazon review dataset as instructed and report the following statistics.
REPORT1
: Please fill the below table in the report
Statistics | Value |
---|---|
the total number of unique words in T | Plz, fill this |
the total number of training examples in T | Plz, fill this |
the ratio of positive examples to negative examples in T | Plz, fill this |
the average length of document in T | Plz, fill this |
the max length of document in T | Plz, fill this |
Suggested hyperparameters:
-
Data processing
- Word embedding dimension: 100
- Word Index: keep the most frequent 10k words
-
CNN
- Network: Word embedding lookup layer -> 1D CNN layer -> fully connected layer -> output prediction
- Number of filters: 100
- Filter length: 3
- CNN Activation: Relu
- Fully connected layer dimension 100, activation: None (i.e. this layer is linear)
-
RNN
- Network: Word embedding lookup layer -> LSTM layer -> fully connected layer(on the hidden state of the last LSTM cell) -> output prediction
- Hidden dimension for LSTM cell: 100
- Activation for LSTM cell: tanh
- Fully connected layer dimension 100, activation: None (i.e. this layer is linear)
REPORT2
: Please fill the below table in the report
Accuracy | Training time (in seconds) | |
---|---|---|
RNN w/o pretrained embedding | Plz, fill this | Plz, fill this |
RNN w/ pretrained embedding | Plz, fill this | Plz, fill this |
CNN w/o pretrained embedding | Plz, fill this | Plz, fill this |
CNN w/ pretrained embedding | Plz, fill this | Plz, fill this |
Plot the training/testing objective, training/testing accuracy over time for the 4 model combinations (correspond to 4 rows in the above table). In other word, there should be 2*4=8 graphs in total, each of which contains two curves (training and testing).
REPORT3
: RNN w/o pretrained embedding
- training/testing objective over time
- training/testing accuracy over time
REPORT4
: RNN w/ pretrained embedding
- training/testing objective over time
- training/testing accuracy over time
REPORT5
: CNN w/o pretrained embedding
- training/testing objective over time
- training/testing accuracy over time
REPORT6
: CNN w/ pretrained embedding
- training/testing objective over time
- training/testing accuracy over time
REPORT7
: Discuss the complete set of experimental results, comparing the algorithms to each other.
REPORT8
: Discuss your observations about the various algorithms, i.e., differences in how they performed, different parameters, what worked well and didn't, patterns/trends you observed across the set of experiments, etc.
REPORT9
: Try to explain why certain algorithms or approaches behaved the way they did.
Add detailed descriptions about software implementation & data preprocessing, including:
REPORT10
: A description of what you did to preprocess the dataset to make your implementations easier or more efficient.
REPORT11
: A description of major data structures (if any); any programming tools or libraries that you used;
REPORT12
: Strengths and weaknesses of your design, and any problems that your system encountered;