Technology Stack & Methodology
Core Components
1. Deep Learning Model
The core recognition engine relies on a deep neural network architecture commonly used for sequence recognition tasks, particularly OCR. This typically involves:
- Convolutional Neural Network (CNN) Layers: These layers act as feature extractors, automatically learning relevant visual patterns (like strokes, curves, loops) directly from the pixel data of the input word images.
- Recurrent Neural Network (RNN) Layers: Long Short-Term Memory (LSTM) or Gated Recurrent Unit (GRU) layers are often used after the CNN. They process the sequence of features extracted along the width of the image, capturing contextual information and dependencies between different parts of the word. Bidirectional RNNs are commonly employed to consider context from both left-to-right and right-to-left.
- Connectionist Temporal Classification (CTC) Loss: This loss function is essential for training OCR models without requiring pre-segmented character locations. It calculates the probability of the predicted sequence given the input image features, summing over all possible alignments between the input sequence and the target text label.
Specific Library Used: TensorFlow 1.15 (as indicated by logs)
2. Image Preprocessing
Raw input images are processed before being fed into the neural network to ensure consistency and optimal performance. Steps (handled by `SamplePreprocessor.py`) include:
- Grayscale Conversion: Converting the image to a single channel (grayscale) reduces complexity.
- Resizing: Images are typically resized to a fixed height (e.g., 32 or 64 pixels) while maintaining the aspect ratio. Width may vary or be padded to a maximum length.
- Normalization: Pixel values are scaled to a standard range (e.g., [0, 1] or [-1, 1]) to stabilize training.
- (Mention any other significant steps like binarization, slant correction if applicable)
Primary Library Used: OpenCV (`cv2`)
3. Backend Framework
The server-side logic, API handling, and coordination are managed by **Flask**, a Python microframework known for its simplicity and flexibility. Flask's responsibilities in this project include:
- Defining URL routes (e.g.,
/
,/recognize
,/predict
,/api/examples
). - Handling HTTP requests (GET for pages, POST for predictions).
- Receiving and securely saving uploaded image files.
- Calling the Python functions in
main.py
to trigger image preprocessing and model inference. - Formatting the prediction results into JSON for the API endpoint.
- Rendering HTML templates (using the Jinja2 engine) to display web pages.
- Serving static assets like CSS, JavaScript, and images.
4. Frontend Interface
The user interacts with the application through a web interface built using:
- HTML5: Provides the fundamental structure and content of the web pages.
- Tailwind CSS: A utility-first CSS framework used for rapidly building the user interface. It provides low-level utility classes (like
p-6
,rounded-lg
,text-center
,font-semibold
,hidden
) to style elements directly in the HTML markup, promoting consistency and custom design without writing extensive custom CSS. - Vanilla JavaScript: Handles all client-side interactivity, including:
- Managing file input selection and example image clicks.
- Displaying image previews.
- Making asynchronous API calls (
fetch
) to the Flask backend's/predict
endpoint to send the image data. - Receiving and parsing the JSON response from the backend.
- Dynamically updating the HTML to show prediction results, loading states, or error messages.
- Handling basic UI state changes (e.g., enabling/disabling buttons).
- Remixicon / Google Fonts: Used for icons and specific typography (Inter, Noto Sans Devanagari, Pacifico).