OCR Pipeline Guide
OverviewDirect link to Overview
The OCR pipeline is a specialized pipeline for training Optical Character Recognition (OCR) models within the Protege framework. It provides a streamlined workflow specifically designed for OCR tasks while maintaining compatibility with the same data input sources as the main pipeline.
Key FeaturesDirect link to Key Features
- Specialized OCR Components: Uses OCR-specific training artifacts (LMDB) and training components
- Unified Data Input: Supports the same data sources as the main pipeline (encord, gcs, manifest, demo_dataset)
- Simplified Workflow: Removes lineage tracking for simplicity
- EasyOCR Integration: Based on EasyOCR architecture and training methodology
- Automatic Hyperparameter Management: Handles OCR-specific parameters automatically
- License Plate Optimization: Pre-configured for US license plate recognition
- Production-Ready: Tested with real-world datasets and optimized for deployment
Pipeline ArchitectureDirect link to Pipeline Architecture
The OCR pipeline consists of four main steps:
- Data Ingestion: Loads data from various sources (encord, gcs, manifest, demo_dataset)
- Training Artifact Creation: Converts manifest to LMDB format for OCR training with automatic train/validation splits
- Model Training: Trains the OCR model using EasyOCR-based architecture with TPS-ResNet-BiLSTM-Attn
- Model Export: Exports the trained model to the specified location in standard Protege format
Local Testing and DebuggingDirect link to Local Testing and Debugging
Before deploying to VertexAI, you can test the OCR components locally using the provided debug script.
Quick Local TestDirect link to Quick Local Test
cd protege-pipelines/tests/ocr
python debug_ocr.py
This script will:
- Call
ocr_manifest_component()to generate a manifest - Call
ocr_build_training_artifact_with_splits()to create LMDB dataset with train/validation splits - Call
ocr_train_component_mock()to simulate training - Call
ocr_model_export_mock()to simulate model export
Mock Components for TestingDirect link to Mock Components for Testing
For testing purposes, the pipeline includes mock versions of training and export components:
ocr_train_component_mock(): Creates dummy model files without actual trainingocr_model_export_mock(): Creates dummy export files without actual export
These allow you to test the full pipeline flow without requiring GPU resources or actual training time.
Manual Component TestingDirect link to Manual Component Testing
You can also test individual components:
import tempfile
from unittest.mock import Mock
from protege.pipelines.ocr_components import ocr_manifest_component
# Create mock artifact
manifest_artifact = Mock()
manifest_artifact.path = "/tmp/test_manifest.jsonl"
# Test the component
ocr_manifest_component(
export_root="gs://test-bucket",
dataset_source="test_dataset",
manifest=manifest_artifact
)
Supported Data SourcesDirect link to Supported Data Sources
1. Demo Dataset (demo_dataset)Direct link to 1-demo-dataset-demo_dataset
- Format:
train_manifest.txtfile withimage_name,textpairs - Structure:
gs://your-bucket/demo_dataset/
├── train_manifest.txt
└── imgs/
├── image1.png
├── image2.png
└── ... - Example manifest:
image1.png,Hello World
image2.png,OCR Training
image3.png,Protege Framework
2. Encord (encord)Direct link to 2-encord-encord
- Uses existing Encord integration
- Supports OCR annotations from Encord projects
3. GCS with COCO Labels (gcs)Direct link to 3-gcs-with-coco-labels-gcs
- Supports COCO format labels
- Requires
label_pathandimages_pathconfiguration - Automatically converts COCO annotations to OCR format
4. Manifest (manifest)Direct link to 4-manifest-manifest
- Direct manifest file input
- Supports existing manifest format with OCR annotations
ConfigurationDirect link to Configuration
Basic ConfigurationDirect link to Basic Configuration
[dataset]
source = "demo_dataset"
[model]
task = "ocr"
architecture = "tps_resnet_bilstm_attn"
[training]
duration = 100
batch_size = 32
num_workers = 4
learning_rate = 0.001
optimizer_type = "adam"
[cloud_provider]
platform = "GCP"
config.project = "your-project"
config.location = "us-central1"
config.machine_type = "n1-standard-4"
[export]
path = "gs://your-bucket/models/ocr_model.pth"
root = "gs://your-bucket/pipelines"
[wandb]
project = "protege-ocr"
OCR-Specific ParametersDirect link to OCR-Specific Parameters
The pipeline automatically handles OCR-specific parameters with sensible defaults:
- Image Processing: 32x250 grayscale images (optimized for license plates)
- Model Architecture: TPS + ResNet + BiLSTM + Attention
- Vocabulary: Alphanumeric characters (0-9, a-z, A-Z for license plates)
- Training: CTC loss with attention mechanism
- Data Splits: Automatic 80/20 train/validation split
Advanced ConfigurationDirect link to Advanced Configuration
You can customize OCR parameters by modifying the training configuration:
[training]
# Standard parameters
duration = 100
batch_size = 32
learning_rate = 0.001
# OCR-specific parameters (handled automatically)
ocr.transformation = "TPS" # Spatial transformation
ocr.feature_extraction = "ResNet" # Feature extractor
ocr.sequence_modeling = "BiLSTM" # Sequence model
ocr.prediction = "Attn" # Prediction method
ocr.img_height = 32 # Image height
ocr.img_width = 250 # Image width (optimized for license plates)
ocr.input_channel = 1 # Grayscale input
ocr.output_channel = 512 # Feature extractor output
ocr.hidden_size = 256 # LSTM hidden state
ocr.batch_max_length = 10 # Max characters per text
ocr.character = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ" # Character set
UsageDirect link to Usage
1. Prepare Your DataDirect link to 1. Prepare Your Data
For demo dataset:
# Upload your data to GCS
gsutil cp -r your_ocr_data gs://your-bucket/demo_dataset/
2. Create Configuration FileDirect link to 2. Create Configuration File
Use the provided examples:
examples/ocr.toml- Basic OCR trainingexamples/ocr_lp.toml- License plate optimized training
3. Run the PipelineDirect link to 3. Run the Pipeline
# Using the protege CLI
protege run examples/ocr_lp.toml
# Or using the Python API
from protege.pipelines import run_pipeline
from protege.pipelines.job_spec import JobSpec
job_spec = JobSpec.from_toml("examples/ocr_lp.toml")
pipeline = run_pipeline(job_spec)
Pipeline Differences from Main PipelineDirect link to Pipeline Differences from Main Pipeline
| Feature | Main Pipeline | OCR Pipeline |
|---|---|---|
| Lineage Tracking | ✅ Full support | ❌ Removed for simplicity |
| Training Artifact | H5 format | LMDB format |
| Model Architecture | Task-specific | OCR-specific (EasyOCR-based) |
| Hyperparameter Search | ✅ Supported | ❌ Not implemented |
| Evaluation | Task-specific metrics | OCR-specific metrics |
| Data Sources | All supported | All supported |
| Data Splits | Manual configuration | Automatic 80/20 split |
| Character Set | N/A | Configurable (alphanumeric) |
Runtime IntegrationDirect link to Runtime Integration
The OCR pipeline integrates with protege-runtime for inference:
from protege.model_runtime import Runtime, Device
# Load trained OCR model
runtime = Runtime(
artifact_path="path/to/ocr_model.zip",
device=Device.CPU,
confidence_threshold=0.5,
)
# Run inference
results = runtime.inference(images, post_process=True)
TroubleshootingDirect link to Troubleshooting
Common IssuesDirect link to Common Issues
-
LMDB Creation Fails
- Ensure images are accessible from GCS
- Check manifest format (image_name,text pairs)
- Verify image files exist and are readable
- Check service account permissions
-
Training Fails
- Check GPU availability if using accelerators
- Verify batch size fits in memory
- Ensure vocabulary matches your data
- Check character set configuration
-
Data Loading Issues
- Verify GCS paths are correct
- Check service account permissions
- Ensure manifest format is valid
- Verify train/validation split generation
DebuggingDirect link to Debugging
Enable verbose logging by setting environment variables:
export LOG_LEVEL=DEBUG
export WANDB_MODE=online
Local Testing IssuesDirect link to Local Testing Issues
-
Import Errors
- Ensure you're running from the correct directory
- Check that
PYTHONPATHincludes thesrcdirectory - Verify all
__init__.pyfiles exist
-
Mock Component Issues
- Mock components create dummy files for testing
- They don't perform actual training or export
- Use for pipeline flow testing only
ExamplesDirect link to Examples
See the following examples for complete working configurations:
examples/ocr.toml- Basic OCR trainingexamples/ocr_lp.toml- License plate optimized training
Integration with Protege FrameworkDirect link to Integration with Protege Framework
The OCR pipeline integrates seamlessly with the Protege framework:
- JobSpec Compatibility: Uses the same JobSpec structure
- Component Reuse: Reuses data ingestion components
- Cloud Provider Support: Full GCP support
- Wandb Integration: Automatic experiment tracking
- Model Export: Standard model export format
- Runtime Integration: Compatible with protege-runtime for inference
Current StatusDirect link to Current Status
✅ Implemented FeaturesDirect link to ✅ Implemented Features
- LMDB training artifact creation with automatic splits
- TPS-ResNet-BiLSTM-Attn architecture
- License plate optimization
- Multi-source data ingestion (GCS, Encord, Manifest)
- Automatic hyperparameter management
- Runtime integration for inference
- Local testing and debugging tools
- Complete documentation and examples