
Personal Project
Image Description Generator
Overview
An image captioning system that generates natural language descriptions for arbitrary input images. The model uses a CNN encoder to extract visual features and an RNN decoder to to process text and autoregressively generate descriptive captions word-by-word.
Role
Sole Researcher & Engineer
Problem
Image captioning bridges computer vision and natural language processing — a model must simultaneously understand visual content and generate grammatically coherent descriptions. The project aimed to implement a multimodal architecture from scratch to understand the cross-modal alignment between vision and language.
Solution
Built a merge-model architecture: a pre-trained InceptionV3 CNN extracts global image features, while pre-trained GloVe embeddings represent text tokens. An LSTM processes the text sequences, and the visual and textual features are merged using an addition layer before dense layers predict the next token in the caption.
Architecture
A CNN-RNN merge model: InceptionV3 encodes global image features, pre-trained GloVe embeddings and an LSTM process the text, and their outputs are combined to predict the next word.
Key Design Decisions
- InceptionV3 CNN encoder (pre-trained on ImageNet) extracting 2048-dimensional global feature vectors
- Pre-trained 200-dimensional GloVe word embeddings (glove.6B.200d) utilized as non-trainable weights
- LSTM layer processing sequences of caption tokens
- Merge architecture adding dense projections of image features and LSTM outputs
- Greedy search implemented to generate final image descriptions token-by-token
- Trained on MS-COCO dataset captions using a custom data generator to yield batches progressively
Challenges
- Combining global CNN features with sequential LSTM states via a merge architecture
- Integrating pre-trained GloVe embeddings efficiently into the embedding layer
- Managing large MS-COCO dataset size with a custom Python generator to avoid memory exhaustion
- Preprocessing sequence data to create multiple input-output pairs for next-token prediction
Impact
- Produced a working image captioning pipeline from visual input to natural language output
- Demonstrated a practical merge-based multimodal architecture for vision and language
- Published as an accessible open-source reference for image captioning with TensorFlow