Personal Project

Image Description Generator

TensorFlowPythonJupyter Notebook

Overview

An image captioning system that generates natural language descriptions for arbitrary input images. The model uses a CNN encoder to extract visual features and an RNN decoder to to process text and autoregressively generate descriptive captions word-by-word.

Role

Sole Researcher & Engineer

Problem

Image captioning bridges computer vision and natural language processing — a model must simultaneously understand visual content and generate grammatically coherent descriptions. The project aimed to implement a multimodal architecture from scratch to understand the cross-modal alignment between vision and language.

Solution

Built a merge-model architecture: a pre-trained InceptionV3 CNN extracts global image features, while pre-trained GloVe embeddings represent text tokens. An LSTM processes the text sequences, and the visual and textual features are merged using an addition layer before dense layers predict the next token in the caption.

Architecture

A CNN-RNN merge model: InceptionV3 encodes global image features, pre-trained GloVe embeddings and an LSTM process the text, and their outputs are combined to predict the next word.

Key Design Decisions

InceptionV3 CNN encoder (pre-trained on ImageNet) extracting 2048-dimensional global feature vectors
Pre-trained 200-dimensional GloVe word embeddings (glove.6B.200d) utilized as non-trainable weights
LSTM layer processing sequences of caption tokens
Merge architecture adding dense projections of image features and LSTM outputs
Greedy search implemented to generate final image descriptions token-by-token
Trained on MS-COCO dataset captions using a custom data generator to yield batches progressively

Challenges

Combining global CNN features with sequential LSTM states via a merge architecture
Integrating pre-trained GloVe embeddings efficiently into the embedding layer
Managing large MS-COCO dataset size with a custom Python generator to avoid memory exhaustion
Preprocessing sequence data to create multiple input-output pairs for next-token prediction

Impact

Produced a working image captioning pipeline from visual input to natural language output
Demonstrated a practical merge-based multimodal architecture for vision and language
Published as an accessible open-source reference for image captioning with TensorFlow

View on GitHub ↗