Skip to main content

OCR Pipeline Guide

OverviewDirect link to Overview

The OCR pipeline is a specialized pipeline for training Optical Character Recognition (OCR) models within the Protege framework. It provides a streamlined workflow specifically designed for OCR tasks while maintaining compatibility with the same data input sources as the main pipeline.

Key FeaturesDirect link to Key Features

  • Specialized OCR Components: Uses OCR-specific training artifacts (LMDB) and training components
  • Unified Data Input: Supports the same data sources as the main pipeline (encord, gcs, manifest, demo_dataset)
  • Simplified Workflow: Removes lineage tracking for simplicity
  • EasyOCR Integration: Based on EasyOCR architecture and training methodology
  • Automatic Hyperparameter Management: Handles OCR-specific parameters automatically
  • License Plate Optimization: Pre-configured for US license plate recognition
  • Production-Ready: Tested with real-world datasets and optimized for deployment

Pipeline ArchitectureDirect link to Pipeline Architecture

The OCR pipeline consists of four main steps:

  1. Data Ingestion: Loads data from various sources (encord, gcs, manifest, demo_dataset)
  2. Training Artifact Creation: Converts manifest to LMDB format for OCR training with automatic train/validation splits
  3. Model Training: Trains the OCR model using EasyOCR-based architecture with TPS-ResNet-BiLSTM-Attn
  4. Model Export: Exports the trained model to the specified location in standard Protege format

Local Testing and DebuggingDirect link to Local Testing and Debugging

Before deploying to VertexAI, you can test the OCR components locally using the provided debug script.

Quick Local TestDirect link to Quick Local Test

cd protege-pipelines/tests/ocr
python debug_ocr.py

This script will:

  1. Call ocr_manifest_component() to generate a manifest
  2. Call ocr_build_training_artifact_with_splits() to create LMDB dataset with train/validation splits
  3. Call ocr_train_component_mock() to simulate training
  4. Call ocr_model_export_mock() to simulate model export

Mock Components for TestingDirect link to Mock Components for Testing

For testing purposes, the pipeline includes mock versions of training and export components:

  • ocr_train_component_mock(): Creates dummy model files without actual training
  • ocr_model_export_mock(): Creates dummy export files without actual export

These allow you to test the full pipeline flow without requiring GPU resources or actual training time.

Manual Component TestingDirect link to Manual Component Testing

You can also test individual components:

import tempfile
from unittest.mock import Mock
from protege.pipelines.ocr_components import ocr_manifest_component

# Create mock artifact
manifest_artifact = Mock()
manifest_artifact.path = "/tmp/test_manifest.jsonl"

# Test the component
ocr_manifest_component(
export_root="gs://test-bucket",
dataset_source="test_dataset",
manifest=manifest_artifact
)

Supported Data SourcesDirect link to Supported Data Sources

1. Demo Dataset (demo_dataset)Direct link to 1-demo-dataset-demo_dataset

  • Format: train_manifest.txt file with image_name,text pairs
  • Structure:
    gs://your-bucket/demo_dataset/
    ├── train_manifest.txt
    └── imgs/
    ├── image1.png
    ├── image2.png
    └── ...
  • Example manifest:
    image1.png,Hello World
    image2.png,OCR Training
    image3.png,Protege Framework

2. Encord (encord)Direct link to 2-encord-encord

  • Uses existing Encord integration
  • Supports OCR annotations from Encord projects

3. GCS with COCO Labels (gcs)Direct link to 3-gcs-with-coco-labels-gcs

  • Supports COCO format labels
  • Requires label_path and images_path configuration
  • Automatically converts COCO annotations to OCR format

4. Manifest (manifest)Direct link to 4-manifest-manifest

  • Direct manifest file input
  • Supports existing manifest format with OCR annotations

ConfigurationDirect link to Configuration

Basic ConfigurationDirect link to Basic Configuration

[dataset]
source = "demo_dataset"

[model]
task = "ocr"
architecture = "tps_resnet_bilstm_attn"

[training]
duration = 100
batch_size = 32
num_workers = 4
learning_rate = 0.001
optimizer_type = "adam"

[cloud_provider]
platform = "GCP"
config.project = "your-project"
config.location = "us-central1"
config.machine_type = "n1-standard-4"

[export]
path = "gs://your-bucket/models/ocr_model.pth"
root = "gs://your-bucket/pipelines"

[wandb]
project = "protege-ocr"

OCR-Specific ParametersDirect link to OCR-Specific Parameters

The pipeline automatically handles OCR-specific parameters with sensible defaults:

  • Image Processing: 32x250 grayscale images (optimized for license plates)
  • Model Architecture: TPS + ResNet + BiLSTM + Attention
  • Vocabulary: Alphanumeric characters (0-9, a-z, A-Z for license plates)
  • Training: CTC loss with attention mechanism
  • Data Splits: Automatic 80/20 train/validation split

Advanced ConfigurationDirect link to Advanced Configuration

You can customize OCR parameters by modifying the training configuration:

[training]
# Standard parameters
duration = 100
batch_size = 32
learning_rate = 0.001

# OCR-specific parameters (handled automatically)
ocr.transformation = "TPS" # Spatial transformation
ocr.feature_extraction = "ResNet" # Feature extractor
ocr.sequence_modeling = "BiLSTM" # Sequence model
ocr.prediction = "Attn" # Prediction method
ocr.img_height = 32 # Image height
ocr.img_width = 250 # Image width (optimized for license plates)
ocr.input_channel = 1 # Grayscale input
ocr.output_channel = 512 # Feature extractor output
ocr.hidden_size = 256 # LSTM hidden state
ocr.batch_max_length = 10 # Max characters per text
ocr.character = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ" # Character set

UsageDirect link to Usage

1. Prepare Your DataDirect link to 1. Prepare Your Data

For demo dataset:

# Upload your data to GCS
gsutil cp -r your_ocr_data gs://your-bucket/demo_dataset/

2. Create Configuration FileDirect link to 2. Create Configuration File

Use the provided examples:

  • examples/ocr.toml - Basic OCR training
  • examples/ocr_lp.toml - License plate optimized training

3. Run the PipelineDirect link to 3. Run the Pipeline

# Using the protege CLI
protege run examples/ocr_lp.toml

# Or using the Python API
from protege.pipelines import run_pipeline
from protege.pipelines.job_spec import JobSpec

job_spec = JobSpec.from_toml("examples/ocr_lp.toml")
pipeline = run_pipeline(job_spec)

Pipeline Differences from Main PipelineDirect link to Pipeline Differences from Main Pipeline

FeatureMain PipelineOCR Pipeline
Lineage Tracking✅ Full support❌ Removed for simplicity
Training ArtifactH5 formatLMDB format
Model ArchitectureTask-specificOCR-specific (EasyOCR-based)
Hyperparameter Search✅ Supported❌ Not implemented
EvaluationTask-specific metricsOCR-specific metrics
Data SourcesAll supportedAll supported
Data SplitsManual configurationAutomatic 80/20 split
Character SetN/AConfigurable (alphanumeric)

Runtime IntegrationDirect link to Runtime Integration

The OCR pipeline integrates with protege-runtime for inference:

from protege.model_runtime import Runtime, Device

# Load trained OCR model
runtime = Runtime(
artifact_path="path/to/ocr_model.zip",
device=Device.CPU,
confidence_threshold=0.5,
)

# Run inference
results = runtime.inference(images, post_process=True)

TroubleshootingDirect link to Troubleshooting

Common IssuesDirect link to Common Issues

  1. LMDB Creation Fails

    • Ensure images are accessible from GCS
    • Check manifest format (image_name,text pairs)
    • Verify image files exist and are readable
    • Check service account permissions
  2. Training Fails

    • Check GPU availability if using accelerators
    • Verify batch size fits in memory
    • Ensure vocabulary matches your data
    • Check character set configuration
  3. Data Loading Issues

    • Verify GCS paths are correct
    • Check service account permissions
    • Ensure manifest format is valid
    • Verify train/validation split generation

DebuggingDirect link to Debugging

Enable verbose logging by setting environment variables:

export LOG_LEVEL=DEBUG
export WANDB_MODE=online

Local Testing IssuesDirect link to Local Testing Issues

  1. Import Errors

    • Ensure you're running from the correct directory
    • Check that PYTHONPATH includes the src directory
    • Verify all __init__.py files exist
  2. Mock Component Issues

    • Mock components create dummy files for testing
    • They don't perform actual training or export
    • Use for pipeline flow testing only

ExamplesDirect link to Examples

See the following examples for complete working configurations:

  • examples/ocr.toml - Basic OCR training
  • examples/ocr_lp.toml - License plate optimized training

Integration with Protege FrameworkDirect link to Integration with Protege Framework

The OCR pipeline integrates seamlessly with the Protege framework:

  • JobSpec Compatibility: Uses the same JobSpec structure
  • Component Reuse: Reuses data ingestion components
  • Cloud Provider Support: Full GCP support
  • Wandb Integration: Automatic experiment tracking
  • Model Export: Standard model export format
  • Runtime Integration: Compatible with protege-runtime for inference

Current StatusDirect link to Current Status

✅ Implemented FeaturesDirect link to ✅ Implemented Features

  • LMDB training artifact creation with automatic splits
  • TPS-ResNet-BiLSTM-Attn architecture
  • License plate optimization
  • Multi-source data ingestion (GCS, Encord, Manifest)
  • Automatic hyperparameter management
  • Runtime integration for inference
  • Local testing and debugging tools
  • Complete documentation and examples