Skip to main content

Sweep Parameters

OverviewDirect link to Overview

The sweep automatically tests all supported model architectures to find the best configuration for your dataset. This document explains the key parameters you need to configure.

Main ParametersDirect link to Main Parameters

[sweep] - Main SectionDirect link to sweep---main-section

epochs_per_sweepDirect link to epochs_per_sweep

  • Type: Integer
  • Required: Yes
  • Description: Number of epochs for each sweep experiment
  • Recommendation:
    • Small datasets (<1000 images): 20-50 epochs
    • Medium datasets (1000-5000 images): 10-30 epochs
    • Large datasets (>5000 images): 5-20 epochs

num_sweep_workersDirect link to num_sweep_workers

  • Type: Integer
  • Required: Yes
  • Description: Number of parallel workers to run experiments
  • Recommendation:
    • Limited resources: 2-4 workers
    • Moderate resources: 4-6 workers
    • Abundant resources: 6-8 workers

methodDirect link to method

  • Type: String
  • Required: No (defaults to "random")
  • Values: "random", "grid", or "bayes"
  • Description: Sweep method to use for hyperparameter search
  • Recommendation:
    • "random": Faster, tests a subset of combinations (~117 runs for classification)
    • "grid": Slower, tests all possible combinations (624 runs for classification)
    • "bayes": Bayesian optimization, tests ~50 runs

run_capDirect link to run_cap

  • Type: Integer
  • Required: No (calculated automatically if not specified)
  • Description: Maximum number of runs for the entire sweep
  • Behavior:
    • If specified: Uses the provided value
    • If not specified: Calculated automatically based on task and method
    • If calculation fails: Defaults to 100 runs
  • Recommendation: Leave empty for automatic calculation, or specify for custom control

Supported ArchitecturesDirect link to Supported Architectures

ResNet FamilyDirect link to ResNet Family

  • resnet18
  • resnet26
  • resnet34
  • resnet50
  • resnet101

EfficientNet FamilyDirect link to EfficientNet Family

  • efficientnet_b0
  • efficientnet_b1
  • efficientnet_b2
  • efficientnet_b3
  • efficientnet_b4
  • efficientnet_b5

Swin Transformer FamilyDirect link to Swin Transformer Family

  • swinv2_base_window8_256
  • swinv2_base_window12_192
  • swinv2_base_window12to16_192to256
  • swinv2_base_window12to24_192to384
  • swinv2_base_window16_256
  • swinv2_cr_small_224
  • swinv2_cr_small_ns_224
  • swinv2_cr_tiny_ns_224
  • swinv2_large_window12_192
  • swinv2_large_window12to16_192to256
  • swinv2_large_window12to24_192to384
  • swinv2_small_window8_256
  • swinv2_small_window16_256
  • swinv2_tiny_window8_256
  • swinv2_tiny_window16_256

Configuration ExamplesDirect link to Configuration Examples

1. Standard Sweep (Random Method)Direct link to 1. Standard Sweep (Random Method)

[sweep]
epochs_per_sweep = 10
num_sweep_workers = 4
method = "random"
# Tests a subset of combinations for faster results
# run_cap calculated automatically (~117 runs for classification)

2. Comprehensive Sweep (Grid Method)Direct link to 2. Comprehensive Sweep (Grid Method)

[sweep]
epochs_per_sweep = 10
num_sweep_workers = 4
method = "grid"
# Tests all possible combinations for best results
# run_cap calculated automatically (all combinations)

3. Custom Run CapDirect link to 3. Custom Run Cap

[sweep]
epochs_per_sweep = 10
num_sweep_workers = 4
method = "random"
run_cap = 50 # Custom limit of 50 runs

4. Bayesian OptimizationDirect link to 4. Bayesian Optimization

[sweep]
epochs_per_sweep = 10
num_sweep_workers = 4
method = "bayes"
# Bayesian optimization with ~50 runs

5. Small DatasetsDirect link to 5. Small Datasets

[sweep]
epochs_per_sweep = 30
num_sweep_workers = 3
method = "random"
# More epochs for better results on small datasets

6. Limited ResourcesDirect link to 6. Limited Resources

[sweep]
epochs_per_sweep = 50
num_sweep_workers = 2
method = "random"
# Fewer workers, more epochs per experiment

7. Large DatasetsDirect link to 7. Large Datasets

[sweep]
epochs_per_sweep = 20
num_sweep_workers = 6
method = "grid"
# More workers for faster parallelization, grid for comprehensive search

Sweep MethodsDirect link to Sweep Methods

Random Method (method = "random")Direct link to random-method-method--random

  • Behavior: Tests a random subset of hyperparameter combinations
  • Speed: Faster execution
  • Coverage: Limited but diverse sampling
  • Use Case: Quick exploration, limited resources, initial screening
  • Typical Runs:
    • Classification: ~117 runs
    • Object Detection: ~60 runs
    • Instance Segmentation: ~20 runs
    • Semantic Segmentation: ~30 runs

Grid Method (method = "grid")Direct link to grid-method-method--grid

  • Behavior: Tests all possible hyperparameter combinations
  • Speed: Slower execution
  • Coverage: Comprehensive search
  • Use Case: Thorough optimization, final tuning, research
  • Typical Runs: All possible combinations (e.g., 624 runs for classification tasks)

Bayes Method (method = "bayes")Direct link to bayes-method-method--bayes

  • Behavior: Bayesian optimization for efficient hyperparameter search
  • Speed: Moderate execution
  • Coverage: Intelligent sampling based on previous results
  • Use Case: Efficient optimization, limited computational budget
  • Typical Runs: ~50 runs

Method Selection GuideDirect link to Method Selection Guide

ScenarioRecommended MethodReasoning
Quick exploration"random"Fast results for initial assessment
Limited resources"random"Reduced computational cost
Efficient optimization"bayes"Intelligent sampling for better results
Final optimization"grid"Comprehensive search for best results
Research projects"grid"Complete coverage of parameter space
Production deployment"grid"Best possible model selection

Resource CalculationDirect link to Resource Calculation

Formula for Total ExperimentsDirect link to Formula for Total Experiments

Total Experiments = All Supported Architectures for Task × Number of Hyperparameter Combinations
Total Experiments ≈ Task-specific empirical values:
- Classification: ~117 runs
- Object Detection: ~60 runs
- Instance Segmentation: ~20 runs
- Semantic Segmentation: ~30 runs
Total Experiments ≈ 50 runs (Bayesian optimization)

Practical ExamplesDirect link to Practical Examples

Classification Task (Grid Method)Direct link to Classification Task (Grid Method)

  • 26 architectures × 4 backbone layers × 2 optimizers × 3 learning rates = 624 experiments
  • With 4 num_sweep_workers: 156 experiments per worker
  • With 10 epochs_per_sweep: 1,560 epochs per worker

Classification Task (Random Method)Direct link to Classification Task (Random Method)

  • ~117 experiments (empirical value)
  • With 4 num_sweep_workers: ~29 experiments per worker
  • With 10 epochs_per_sweep: ~290 epochs per worker

Object Detection Task (Random Method)Direct link to Object Detection Task (Random Method)

  • ~60 experiments (empirical value)
  • With 4 num_sweep_workers: ~15 experiments per worker
  • With 10 epochs_per_sweep: ~150 epochs per worker

Bayes Method (Any Task)Direct link to Bayes Method (Any Task)

  • ~50 experiments (Bayesian optimization)
  • With 4 num_sweep_workers: ~12 experiments per worker
  • With 10 epochs_per_sweep: ~120 epochs per worker

Small Datasets (<1000 images)Direct link to Small Datasets (<1000 images)

[sweep]
epochs_per_sweep = 50
num_sweep_workers = 2
method = "random" # Faster exploration
# run_cap calculated automatically

Medium Datasets (1000-5000 images)Direct link to Medium Datasets (1000-5000 images)

[sweep]
epochs_per_sweep = 30
num_sweep_workers = 4
method = "bayes" # Efficient optimization
# run_cap calculated automatically (~50 runs)

Large Datasets (>5000 images)Direct link to Large Datasets (>5000 images)

[sweep]
epochs_per_sweep = 20
num_sweep_workers = 6
method = "grid" # Best model selection
# run_cap calculated automatically (all combinations)

Optimization TipsDirect link to Optimization Tips

1. For Limited ResourcesDirect link to 1. For Limited Resources

  • Use method = "random" for faster exploration
  • Increase epochs_per_sweep for better results with fewer experiments
  • Decrease num_sweep_workers to reduce parallel load
  • Let run_cap be calculated automatically

2. For Limited TimeDirect link to 2. For Limited Time

  • Use method = "random" or method = "bayes" for quicker results
  • Increase num_sweep_workers if possible for faster parallelization
  • Reduce epochs_per_sweep for quicker results
  • Set custom run_cap if needed (e.g., run_cap = 30)

3. For Better PerformanceDirect link to 3. For Better Performance

  • Use method = "grid" for comprehensive search
  • Use method = "bayes" for efficient optimization
  • Increase epochs_per_sweep for small datasets
  • Use more num_sweep_workers for parallelization
  • Let run_cap be calculated automatically

TroubleshootingDirect link to Troubleshooting

Problem: Sweep too slowDirect link to Problem: Sweep too slow

Solution: Use method = "random" or increase num_sweep_workers for faster parallelization

Problem: Poor resultsDirect link to Problem: Poor results

Solution: Use method = "grid" for comprehensive search or increase epochs_per_sweep for better training

Problem: Insufficient memoryDirect link to Problem: Insufficient memory

Solution: Reduce batch_size or num_sweep_workers

Problem: OverfittingDirect link to Problem: Overfitting

Solution: Increase augmentation or reduce model complexity

Problem: Too many runs with grid methodDirect link to Problem: Too many runs with grid method

Solution: Switch to method = "random" or method = "bayes" for faster exploration

Problem: Need custom run limitDirect link to Problem: Need custom run limit

Solution: Set run_cap = <desired_number> in the TOML configuration

Problem: Automatic calculation failsDirect link to Problem: Automatic calculation fails

Solution: The system defaults to 100 runs automatically, or specify run_cap manually

Important ConsiderationsDirect link to Important Considerations

  1. Execution Time: Complete sweep can take several hours/days
  2. Resources: Monitor CPU/GPU/memory usage
  3. Storage: Each experiment saves checkpoints
  4. Selection: Best model is automatically selected
  5. Reproducibility: Use fixed seeds for consistent results
  6. Method Choice:
    • Random for exploration
    • Bayes for efficient optimization
    • Grid for comprehensive search
  7. Run Cap: Automatically calculated based on task and method, or specify manually

Technical ImplementationDirect link to Technical Implementation

How Parameters Are Used InternallyDirect link to How Parameters Are Used Internally

epochs_per_sweep FlowDirect link to epochs_per_sweep-flow

  1. TOML Configurationjob_spec.sweep.epochs_per_sweep
  2. Pipeline → Passed to hyperparam_search as epochs
  3. Sweep Config → Used in W&B sweep configuration
  4. Sweep Worker → Executed in training loop
# Pipeline (pipeline.py)
"epochs": job_spec.sweep.epochs_per_sweep,
"run_cap": job_spec.sweep.run_cap, # Optional

# Sweep Config (sweep.py)
"epochs": {"value": epochs}, # Fixed for all experiments
"run_cap": run_cap, # Limits total experiments

# Sweep Worker (utils.py)
for epoch in range(1, config["epochs"] + 1):
train_loss = train_model(dataloader=train_dl, model=model, optimizer=optimizer)
wandb.log({"train_loss": train_loss, "epoch": epoch})

#### `num_sweep_workers` Flow
1. **TOML Configuration** → `job_spec.sweep.num_sweep_workers`
2. **Pipeline** → Passed to `hyperparam_search`
3. **Hyperparam Search** → Creates multiple Vertex AI jobs
4. **Vertex AI** → Executes workers in parallel

#### `run_cap` Flow
1. **TOML Configuration** → `job_spec.sweep.run_cap` (optional)
2. **Pipeline** → Passed to `hyperparam_search`
3. **Hyperparam Search** → Calculated automatically if not specified
4. **Sweep Config** → Used in W&B sweep configuration
5. **W&B** → Limits total number of experiments

```python
# Pipeline (pipeline.py)
"num_sweep_workers": job_spec.sweep.num_sweep_workers,

# Hyperparam Search (components.py)
for i in range(num_sweep_workers):
job = create_job(
client=vertex_client,
name=f"sweep_worker_{i}", # Unique name for each worker
container=container,
command=["python3", "-m", "protege.pipelines.sweep_worker"],
args=[sweep_id, wandb_project],
# ...
)
futs.append(executor.submit(monitor_job, vertex_client, job.name))

wait(futs, return_when=ALL_COMPLETED) # Wait for all to finish

Parameter InteractionDirect link to Parameter Interaction

Resource Calculation ExampleDirect link to Resource Calculation Example

# Example with typical values:
epochs_per_sweep = 10
num_sweep_workers = 4

# Total epochs per worker:
epochs_per_worker = epochs_per_sweep # = 10

# Total parallel epochs:
total_parallel_epochs = num_sweep_workers * epochs_per_sweep # = 40

# Estimated time (assuming 1 minute per epoch):
estimated_time = epochs_per_sweep / num_sweep_workers # = 2.5 minutes

Parallel Execution StrategyDirect link to Parallel Execution Strategy

  • Workers run simultaneously on Vertex AI
  • Each worker picks different hyperparameter combinations
  • ThreadPoolExecutor manages parallel execution
  • W&B coordinates experiment distribution
  • All workers wait for completion before returning

Validation and Error HandlingDirect link to Validation and Error Handling

Required ParametersDirect link to Required Parameters

# job_spec.py validation
assert (
"epochs_per_sweep" in job_spec_dict["sweep"]
), "must specify sweep.epochs_per_sweep when hyperparam sweep is enabled"

assert (
"num_sweep_workers" in job_spec_dict["sweep"]
), "must specify sweep.num_sweep_workers when hyperparam sweep is enabled"

Common Error ScenariosDirect link to Common Error Scenarios

ProblemCauseSolution
Sweep too slowFew workersIncrease num_sweep_workers
Poor resultsFew epochsIncrease epochs_per_sweep
Memory issuesToo many workersReduce num_sweep_workers
TimeoutToo many epochsReduce epochs_per_sweep
Too many runsGrid methodUse method = "random" or method = "bayes"
Need custom limitDefault calculationSet run_cap = <number>

Monitoring and LoggingDirect link to Monitoring and Logging

W&B IntegrationDirect link to W&B Integration

  • Each worker logs to separate W&B runs
  • Metrics are aggregated automatically
  • Best run is selected automatically
  • Real-time progress tracking

Vertex AI MonitoringDirect link to Vertex AI Monitoring

  • Job states are monitored continuously
  • Error logs are captured and reported
  • Resource usage is tracked
  • Automatic retry on failures

Performance OptimizationDirect link to Performance Optimization

For Limited ResourcesDirect link to For Limited Resources

[sweep]
epochs_per_sweep = 50 # More epochs per experiment
num_sweep_workers = 2 # Fewer parallel workers
method = "random" # Faster exploration
# run_cap calculated automatically

For Limited TimeDirect link to For Limited Time

[sweep]
epochs_per_sweep = 10 # Fewer epochs per experiment
num_sweep_workers = 6 # More parallel workers
method = "bayes" # Efficient optimization
run_cap = 30 # Custom limit for faster completion

For Best ResultsDirect link to For Best Results

[sweep]
epochs_per_sweep = 30 # Balanced epochs
num_sweep_workers = 4 # Balanced workers
method = "grid" # Comprehensive search
# run_cap calculated automatically (all combinations)