Sweep Parameters

OverviewDirect link to Overview

The sweep automatically tests all supported model architectures to find the best configuration for your dataset. This document explains the key parameters you need to configure.

Main ParametersDirect link to Main Parameters

`[sweep]` - Main SectionDirect link to sweep---main-section

`epochs_per_sweep`Direct link to epochs_per_sweep

Type: Integer
Required: Yes
Description: Number of epochs for each sweep experiment
Recommendation:
- Small datasets (<1000 images): 20-50 epochs
- Medium datasets (1000-5000 images): 10-30 epochs
- Large datasets (>5000 images): 5-20 epochs

`num_sweep_workers`Direct link to num_sweep_workers

Type: Integer
Required: Yes
Description: Number of parallel workers to run experiments
Recommendation:
- Limited resources: 2-4 workers
- Moderate resources: 4-6 workers
- Abundant resources: 6-8 workers

`method`Direct link to method

Type: String
Required: No (defaults to "random")
Values: "random", "grid", or "bayes"
Description: Sweep method to use for hyperparameter search
Recommendation:
- "random": Faster, tests a subset of combinations (~117 runs for classification)
- "grid": Slower, tests all possible combinations (624 runs for classification)
- "bayes": Bayesian optimization, tests ~50 runs

`run_cap`Direct link to run_cap

Type: Integer
Required: No (calculated automatically if not specified)
Description: Maximum number of runs for the entire sweep
Behavior:
- If specified: Uses the provided value
- If not specified: Calculated automatically based on task and method
- If calculation fails: Defaults to 100 runs
Recommendation: Leave empty for automatic calculation, or specify for custom control

Supported ArchitecturesDirect link to Supported Architectures

ResNet FamilyDirect link to ResNet Family

resnet18
resnet26
resnet34
resnet50
resnet101

EfficientNet FamilyDirect link to EfficientNet Family

efficientnet_b0
efficientnet_b1
efficientnet_b2
efficientnet_b3
efficientnet_b4
efficientnet_b5

Swin Transformer FamilyDirect link to Swin Transformer Family

swinv2_base_window8_256
swinv2_base_window12_192
swinv2_base_window12to16_192to256
swinv2_base_window12to24_192to384
swinv2_base_window16_256
swinv2_cr_small_224
swinv2_cr_small_ns_224
swinv2_cr_tiny_ns_224
swinv2_large_window12_192
swinv2_large_window12to16_192to256
swinv2_large_window12to24_192to384
swinv2_small_window8_256
swinv2_small_window16_256
swinv2_tiny_window8_256
swinv2_tiny_window16_256

Configuration ExamplesDirect link to Configuration Examples

1. Standard Sweep (Random Method)Direct link to 1. Standard Sweep (Random Method)

[sweep]
epochs_per_sweep = 10
num_sweep_workers = 4
method = "random"
# Tests a subset of combinations for faster results
# run_cap calculated automatically (~117 runs for classification)

2. Comprehensive Sweep (Grid Method)Direct link to 2. Comprehensive Sweep (Grid Method)

[sweep]
epochs_per_sweep = 10
num_sweep_workers = 4
method = "grid"
# Tests all possible combinations for best results
# run_cap calculated automatically (all combinations)

3. Custom Run CapDirect link to 3. Custom Run Cap

[sweep]
epochs_per_sweep = 10
num_sweep_workers = 4
method = "random"
run_cap = 50  # Custom limit of 50 runs

4. Bayesian OptimizationDirect link to 4. Bayesian Optimization

[sweep]
epochs_per_sweep = 10
num_sweep_workers = 4
method = "bayes"
# Bayesian optimization with ~50 runs

5. Small DatasetsDirect link to 5. Small Datasets

[sweep]
epochs_per_sweep = 30
num_sweep_workers = 3
method = "random"
# More epochs for better results on small datasets

6. Limited ResourcesDirect link to 6. Limited Resources

[sweep]
epochs_per_sweep = 50
num_sweep_workers = 2
method = "random"
# Fewer workers, more epochs per experiment

7. Large DatasetsDirect link to 7. Large Datasets

[sweep]
epochs_per_sweep = 20
num_sweep_workers = 6
method = "grid"
# More workers for faster parallelization, grid for comprehensive search

Sweep MethodsDirect link to Sweep Methods

Random Method (`method = "random"`)Direct link to random-method-method--random

Behavior: Tests a random subset of hyperparameter combinations
Speed: Faster execution
Coverage: Limited but diverse sampling
Use Case: Quick exploration, limited resources, initial screening
Typical Runs:
- Classification: ~117 runs
- Object Detection: ~60 runs
- Instance Segmentation: ~20 runs
- Semantic Segmentation: ~30 runs

Grid Method (`method = "grid"`)Direct link to grid-method-method--grid

Behavior: Tests all possible hyperparameter combinations
Speed: Slower execution
Coverage: Comprehensive search
Use Case: Thorough optimization, final tuning, research
Typical Runs: All possible combinations (e.g., 624 runs for classification tasks)

Bayes Method (`method = "bayes"`)Direct link to bayes-method-method--bayes

Behavior: Bayesian optimization for efficient hyperparameter search
Speed: Moderate execution
Coverage: Intelligent sampling based on previous results
Use Case: Efficient optimization, limited computational budget
Typical Runs: ~50 runs

Method Selection GuideDirect link to Method Selection Guide

Scenario	Recommended Method	Reasoning
Quick exploration	`"random"`	Fast results for initial assessment
Limited resources	`"random"`	Reduced computational cost
Efficient optimization	`"bayes"`	Intelligent sampling for better results
Final optimization	`"grid"`	Comprehensive search for best results
Research projects	`"grid"`	Complete coverage of parameter space
Production deployment	`"grid"`	Best possible model selection

Resource CalculationDirect link to Resource Calculation

Formula for Total ExperimentsDirect link to Formula for Total Experiments

Grid Method (Comprehensive Search)Direct link to Grid Method (Comprehensive Search)

Total Experiments = All Supported Architectures for Task × Number of Hyperparameter Combinations

Random Method (Subset Search)Direct link to Random Method (Subset Search)

Total Experiments ≈ Task-specific empirical values:
- Classification: ~117 runs
- Object Detection: ~60 runs
- Instance Segmentation: ~20 runs
- Semantic Segmentation: ~30 runs

Bayes Method (Intelligent Search)Direct link to Bayes Method (Intelligent Search)

Total Experiments ≈ 50 runs (Bayesian optimization)

Practical ExamplesDirect link to Practical Examples

Classification Task (Grid Method)Direct link to Classification Task (Grid Method)

26 architectures × 4 backbone layers × 2 optimizers × 3 learning rates = 624 experiments
With 4 num_sweep_workers: 156 experiments per worker
With 10 epochs_per_sweep: 1,560 epochs per worker

Classification Task (Random Method)Direct link to Classification Task (Random Method)

~117 experiments (empirical value)
With 4 num_sweep_workers: ~29 experiments per worker
With 10 epochs_per_sweep: ~290 epochs per worker

Object Detection Task (Random Method)Direct link to Object Detection Task (Random Method)

~60 experiments (empirical value)
With 4 num_sweep_workers: ~15 experiments per worker
With 10 epochs_per_sweep: ~150 epochs per worker

Bayes Method (Any Task)Direct link to Bayes Method (Any Task)

~50 experiments (Bayesian optimization)
With 4 num_sweep_workers: ~12 experiments per worker
With 10 epochs_per_sweep: ~120 epochs per worker

Optimization TipsDirect link to Optimization Tips

1. For Limited ResourcesDirect link to 1. For Limited Resources

Use method = "random" for faster exploration
Increase epochs_per_sweep for better results with fewer experiments
Decrease num_sweep_workers to reduce parallel load
Let run_cap be calculated automatically

2. For Limited TimeDirect link to 2. For Limited Time

Use method = "random" or method = "bayes" for quicker results
Increase num_sweep_workers if possible for faster parallelization
Reduce epochs_per_sweep for quicker results
Set custom run_cap if needed (e.g., run_cap = 30)

3. For Better PerformanceDirect link to 3. For Better Performance

Use method = "grid" for comprehensive search
Use method = "bayes" for efficient optimization
Increase epochs_per_sweep for small datasets
Use more num_sweep_workers for parallelization
Let run_cap be calculated automatically

TroubleshootingDirect link to Troubleshooting

Problem: Sweep too slowDirect link to Problem: Sweep too slow

Solution: Use method = "random" or increase num_sweep_workers for faster parallelization

Problem: Poor resultsDirect link to Problem: Poor results

Solution: Use method = "grid" for comprehensive search or increase epochs_per_sweep for better training

Problem: Insufficient memoryDirect link to Problem: Insufficient memory

Solution: Reduce batch_size or num_sweep_workers

Problem: OverfittingDirect link to Problem: Overfitting

Solution: Increase augmentation or reduce model complexity

Problem: Too many runs with grid methodDirect link to Problem: Too many runs with grid method

Solution: Switch to method = "random" or method = "bayes" for faster exploration

Problem: Need custom run limitDirect link to Problem: Need custom run limit

Solution: Set run_cap = <desired_number> in the TOML configuration

Problem: Automatic calculation failsDirect link to Problem: Automatic calculation fails

Solution: The system defaults to 100 runs automatically, or specify run_cap manually

Important ConsiderationsDirect link to Important Considerations

Execution Time: Complete sweep can take several hours/days
Resources: Monitor CPU/GPU/memory usage
Storage: Each experiment saves checkpoints
Selection: Best model is automatically selected
Reproducibility: Use fixed seeds for consistent results
Method Choice:
- Random for exploration
- Bayes for efficient optimization
- Grid for comprehensive search
Run Cap: Automatically calculated based on task and method, or specify manually

Technical ImplementationDirect link to Technical Implementation

How Parameters Are Used InternallyDirect link to How Parameters Are Used Internally

`epochs_per_sweep` FlowDirect link to epochs_per_sweep-flow

TOML Configuration → job_spec.sweep.epochs_per_sweep
Pipeline → Passed to hyperparam_search as epochs
Sweep Config → Used in W&B sweep configuration
Sweep Worker → Executed in training loop

# Pipeline (pipeline.py)
"epochs": job_spec.sweep.epochs_per_sweep,
"run_cap": job_spec.sweep.run_cap,  # Optional

# Sweep Config (sweep.py)
"epochs": {"value": epochs},  # Fixed for all experiments
"run_cap": run_cap,  # Limits total experiments

# Sweep Worker (utils.py)
for epoch in range(1, config["epochs"] + 1):
    train_loss = train_model(dataloader=train_dl, model=model, optimizer=optimizer)
    wandb.log({"train_loss": train_loss, "epoch": epoch})

#### `num_sweep_workers` Flow
1. **TOML Configuration** → `job_spec.sweep.num_sweep_workers`
2. **Pipeline** → Passed to `hyperparam_search`
3. **Hyperparam Search** → Creates multiple Vertex AI jobs
4. **Vertex AI** → Executes workers in parallel

#### `run_cap` Flow
1. **TOML Configuration** → `job_spec.sweep.run_cap` (optional)
2. **Pipeline** → Passed to `hyperparam_search`
3. **Hyperparam Search** → Calculated automatically if not specified
4. **Sweep Config** → Used in W&B sweep configuration
5. **W&B** → Limits total number of experiments

```python
# Pipeline (pipeline.py)
"num_sweep_workers": job_spec.sweep.num_sweep_workers,

# Hyperparam Search (components.py)
for i in range(num_sweep_workers):
    job = create_job(
        client=vertex_client,
        name=f"sweep_worker_{i}",  # Unique name for each worker
        container=container,
        command=["python3", "-m", "protege.pipelines.sweep_worker"],
        args=[sweep_id, wandb_project],
        # ...
    )
    futs.append(executor.submit(monitor_job, vertex_client, job.name))

wait(futs, return_when=ALL_COMPLETED)  # Wait for all to finish

Parameter InteractionDirect link to Parameter Interaction

Resource Calculation ExampleDirect link to Resource Calculation Example

# Example with typical values:
epochs_per_sweep = 10
num_sweep_workers = 4

# Total epochs per worker:
epochs_per_worker = epochs_per_sweep  # = 10

# Total parallel epochs:
total_parallel_epochs = num_sweep_workers * epochs_per_sweep  # = 40

# Estimated time (assuming 1 minute per epoch):
estimated_time = epochs_per_sweep / num_sweep_workers  # = 2.5 minutes

Parallel Execution StrategyDirect link to Parallel Execution Strategy

Workers run simultaneously on Vertex AI
Each worker picks different hyperparameter combinations
ThreadPoolExecutor manages parallel execution
W&B coordinates experiment distribution
All workers wait for completion before returning

Validation and Error HandlingDirect link to Validation and Error Handling

Required ParametersDirect link to Required Parameters

# job_spec.py validation
assert (
    "epochs_per_sweep" in job_spec_dict["sweep"]
), "must specify sweep.epochs_per_sweep when hyperparam sweep is enabled"

assert (
    "num_sweep_workers" in job_spec_dict["sweep"]
), "must specify sweep.num_sweep_workers when hyperparam sweep is enabled"

Common Error ScenariosDirect link to Common Error Scenarios

Problem	Cause	Solution
Sweep too slow	Few workers	Increase `num_sweep_workers`
Poor results	Few epochs	Increase `epochs_per_sweep`
Memory issues	Too many workers	Reduce `num_sweep_workers`
Timeout	Too many epochs	Reduce `epochs_per_sweep`
Too many runs	Grid method	Use `method = "random"` or `method = "bayes"`
Need custom limit	Default calculation	Set `run_cap = <number>`

Monitoring and LoggingDirect link to Monitoring and Logging

W&B IntegrationDirect link to W&B Integration

Each worker logs to separate W&B runs
Metrics are aggregated automatically
Best run is selected automatically
Real-time progress tracking

Vertex AI MonitoringDirect link to Vertex AI Monitoring

Job states are monitored continuously
Error logs are captured and reported
Resource usage is tracked
Automatic retry on failures

Performance OptimizationDirect link to Performance Optimization

For Limited ResourcesDirect link to For Limited Resources

[sweep]
epochs_per_sweep = 50    # More epochs per experiment
num_sweep_workers = 2     # Fewer parallel workers
method = "random"         # Faster exploration
# run_cap calculated automatically

For Limited TimeDirect link to For Limited Time

[sweep]
epochs_per_sweep = 10    # Fewer epochs per experiment
num_sweep_workers = 6     # More parallel workers
method = "bayes"          # Efficient optimization
run_cap = 30             # Custom limit for faster completion

For Best ResultsDirect link to For Best Results

[sweep]
epochs_per_sweep = 30    # Balanced epochs
num_sweep_workers = 4     # Balanced workers
method = "grid"           # Comprehensive search
# run_cap calculated automatically (all combinations)