A PyTorch-based image captioning model using ResNet50 as the encoder and LSTM as the decoder. This model generates captions for images by learning from the Flickr30k dataset.
You can run the inference of the trained model on 🤗 Hugging Face - ImageCaption.
To get started, you'll need Git installed on your system. Clone this repository:
git clone https://github.com/nssharmaofficial/ImageCaption_Flickr30k
Install the required dependencies by running:
pip install -r requirements.txt
- Download the Flickr30k images and captions from this Kaggle dataset.
- Split captions into
train_list.txt
(90%) andval_list.txt
(10%).
To create the vocabulary with word2index and index2word mappings, run:
python code/vocab.py captions.txt vocabulary.txt 5000
- The first argument: Input file containing captions.
- The second argument: Output file for the vocabulary.
- The third argument: Maximum vocabulary size (includes 4 special tokens:
<pad>
,<sos>
,<eos>
, and<unk>
).
The custom class ImageCaptionDataset()
holds samples, where each sample contains:
- Image File ID
- Caption: As a list of word indices, enriched with
<sos>
and<eos>
tokens at the beginning and end, respectively.
The __getitem__()
method returns:
- Preprocessed image (as a tensor)
- Caption as a list of word indices
To load data efficiently, use:
get_data_loader(train_data, batch_size=config.BATCH, pad_index=vocab.PADDING_INDEX)
Captions are padded to the length of the longest caption in the batch using vocab.PADDING_INDEX
(0).
Run dataset.py to check the data loader as well as to sort the training and validation captions.
The model consists of three main components:
- Encoder
- Embeddings
- Decoder
You can test the model by running model.py
, which performs a forward pass with random inputs, helping you understand the output dimensions.
The encoder extracts features from images using a pretrained ResNet50 model (with the last layer removed) and a linear layer. The final output dimension is (IMAGE_EMB_DIM)
.
An embedding layer transforms captions into dense vector representations of dimension (WORD_EMB_DIM)
. During training, the embedding layer learns better word representations.
The decoder uses an LSTM to generate captions. It takes the word embeddings as input and outputs probabilities for the next word in the sequence.
Implementation Details:
- LSTM: Processes the sequence of word embeddings.
- Fully connected layer: Outputs word probabilities for each word in the vocabulary.
- LogSoftmax: Applied to convert raw scores to probabilities.
Customize settings in config.py
as needed.
Note: Ensure you have created the
checkpoints
andsaved
directories in thecode
folder to store the model's weights.
To train the model, run train.py
During training, captions are generated word by word in a loop of length SEQ_LENGTH-1
. The hidden and cell states are initialized as tensors of size (NUM_LAYER, BATCH, HIDDEN_DIM)
, where HIDDEN_DIM
is set to IMAGE_EMB_DIM
.
You can visualize results on validation data by running test_show.py
This script will display images along with real captions, generated captions, and BLEU scores (BLEU-1 and BLEU-2) to evaluate caption quality.
Captions are generated word by word starting with the <sos>
token. Each predicted word is used as input for the next step. The process continues until the <eos>
token is generated or the maximum length is reached.
Below are examples with a trained model:
To generate captions for a new image, run:
python code/predict_sample.py sample_image.jpg
This will output captions generated by the model. Below are some examples: