Sweep Parameters
OverviewDirect link to Overview
The sweep automatically tests all supported model architectures to find the best configuration for your dataset. This document explains the key parameters you need to configure.
Main ParametersDirect link to Main Parameters
[sweep] - Main SectionDirect link to sweep---main-section
epochs_per_sweepDirect link to epochs_per_sweep
- Type: Integer
- Required: Yes
- Description: Number of epochs for each sweep experiment
- Recommendation:
- Small datasets (<1000 images): 20-50 epochs
- Medium datasets (1000-5000 images): 10-30 epochs
- Large datasets (>5000 images): 5-20 epochs
num_sweep_workersDirect link to num_sweep_workers
- Type: Integer
- Required: Yes
- Description: Number of parallel workers to run experiments
- Recommendation:
- Limited resources: 2-4 workers
- Moderate resources: 4-6 workers
- Abundant resources: 6-8 workers
methodDirect link to method
- Type: String
- Required: No (defaults to "random")
- Values:
"random","grid", or"bayes" - Description: Sweep method to use for hyperparameter search
- Recommendation:
"random": Faster, tests a subset of combinations (~117 runs for classification)"grid": Slower, tests all possible combinations (624 runs for classification)"bayes": Bayesian optimization, tests ~50 runs
run_capDirect link to run_cap
- Type: Integer
- Required: No (calculated automatically if not specified)
- Description: Maximum number of runs for the entire sweep
- Behavior:
- If specified: Uses the provided value
- If not specified: Calculated automatically based on task and method
- If calculation fails: Defaults to 100 runs
- Recommendation: Leave empty for automatic calculation, or specify for custom control
Supported ArchitecturesDirect link to Supported Architectures
ResNet FamilyDirect link to ResNet Family
resnet18resnet26resnet34resnet50resnet101
EfficientNet FamilyDirect link to EfficientNet Family
efficientnet_b0efficientnet_b1efficientnet_b2efficientnet_b3efficientnet_b4efficientnet_b5
Swin Transformer FamilyDirect link to Swin Transformer Family
swinv2_base_window8_256swinv2_base_window12_192swinv2_base_window12to16_192to256swinv2_base_window12to24_192to384swinv2_base_window16_256swinv2_cr_small_224swinv2_cr_small_ns_224swinv2_cr_tiny_ns_224swinv2_large_window12_192swinv2_large_window12to16_192to256swinv2_large_window12to24_192to384swinv2_small_window8_256swinv2_small_window16_256swinv2_tiny_window8_256swinv2_tiny_window16_256
Configuration ExamplesDirect link to Configuration Examples
1. Standard Sweep (Random Method)Direct link to 1. Standard Sweep (Random Method)
[sweep]
epochs_per_sweep = 10
num_sweep_workers = 4
method = "random"
# Tests a subset of combinations for faster results
# run_cap calculated automatically (~117 runs for classification)
2. Comprehensive Sweep (Grid Method)Direct link to 2. Comprehensive Sweep (Grid Method)
[sweep]
epochs_per_sweep = 10
num_sweep_workers = 4
method = "grid"
# Tests all possible combinations for best results
# run_cap calculated automatically (all combinations)
3. Custom Run CapDirect link to 3. Custom Run Cap
[sweep]
epochs_per_sweep = 10
num_sweep_workers = 4
method = "random"
run_cap = 50 # Custom limit of 50 runs
4. Bayesian OptimizationDirect link to 4. Bayesian Optimization
[sweep]
epochs_per_sweep = 10
num_sweep_workers = 4
method = "bayes"
# Bayesian optimization with ~50 runs
5. Small DatasetsDirect link to 5. Small Datasets
[sweep]
epochs_per_sweep = 30
num_sweep_workers = 3
method = "random"
# More epochs for better results on small datasets
6. Limited ResourcesDirect link to 6. Limited Resources
[sweep]
epochs_per_sweep = 50
num_sweep_workers = 2
method = "random"
# Fewer workers, more epochs per experiment
7. Large DatasetsDirect link to 7. Large Datasets
[sweep]
epochs_per_sweep = 20
num_sweep_workers = 6
method = "grid"
# More workers for faster parallelization, grid for comprehensive search
Sweep MethodsDirect link to Sweep Methods
Random Method (method = "random")Direct link to random-method-method--random
- Behavior: Tests a random subset of hyperparameter combinations
- Speed: Faster execution
- Coverage: Limited but diverse sampling
- Use Case: Quick exploration, limited resources, initial screening
- Typical Runs:
- Classification: ~117 runs
- Object Detection: ~60 runs
- Instance Segmentation: ~20 runs
- Semantic Segmentation: ~30 runs
Grid Method (method = "grid")Direct link to grid-method-method--grid
- Behavior: Tests all possible hyperparameter combinations
- Speed: Slower execution
- Coverage: Comprehensive search
- Use Case: Thorough optimization, final tuning, research
- Typical Runs: All possible combinations (e.g., 624 runs for classification tasks)
Bayes Method (method = "bayes")Direct link to bayes-method-method--bayes
- Behavior: Bayesian optimization for efficient hyperparameter search
- Speed: Moderate execution
- Coverage: Intelligent sampling based on previous results
- Use Case: Efficient optimization, limited computational budget
- Typical Runs: ~50 runs
Method Selection GuideDirect link to Method Selection Guide
| Scenario | Recommended Method | Reasoning |
|---|---|---|
| Quick exploration | "random" | Fast results for initial assessment |
| Limited resources | "random" | Reduced computational cost |
| Efficient optimization | "bayes" | Intelligent sampling for better results |
| Final optimization | "grid" | Comprehensive search for best results |
| Research projects | "grid" | Complete coverage of parameter space |
| Production deployment | "grid" | Best possible model selection |
Resource CalculationDirect link to Resource Calculation
Formula for Total ExperimentsDirect link to Formula for Total Experiments
Grid Method (Comprehensive Search)Direct link to Grid Method (Comprehensive Search)
Total Experiments = All Supported Architectures for Task × Number of Hyperparameter Combinations
Random Method (Subset Search)Direct link to Random Method (Subset Search)
Total Experiments ≈ Task-specific empirical values:
- Classification: ~117 runs
- Object Detection: ~60 runs
- Instance Segmentation: ~20 runs
- Semantic Segmentation: ~30 runs
Bayes Method (Intelligent Search)Direct link to Bayes Method (Intelligent Search)
Total Experiments ≈ 50 runs (Bayesian optimization)
Practical ExamplesDirect link to Practical Examples
Classification Task (Grid Method)Direct link to Classification Task (Grid Method)
- 26 architectures × 4 backbone layers × 2 optimizers × 3 learning rates = 624 experiments
- With 4 num_sweep_workers: 156 experiments per worker
- With 10 epochs_per_sweep: 1,560 epochs per worker
Classification Task (Random Method)Direct link to Classification Task (Random Method)
- ~117 experiments (empirical value)
- With 4 num_sweep_workers: ~29 experiments per worker
- With 10 epochs_per_sweep: ~290 epochs per worker
Object Detection Task (Random Method)Direct link to Object Detection Task (Random Method)
- ~60 experiments (empirical value)
- With 4 num_sweep_workers: ~15 experiments per worker
- With 10 epochs_per_sweep: ~150 epochs per worker
Bayes Method (Any Task)Direct link to Bayes Method (Any Task)
- ~50 experiments (Bayesian optimization)
- With 4 num_sweep_workers: ~12 experiments per worker
- With 10 epochs_per_sweep: ~120 epochs per worker
Recommended Configurations by Dataset SizeDirect link to Recommended Configurations by Dataset Size
Small Datasets (<1000 images)Direct link to Small Datasets (<1000 images)
[sweep]
epochs_per_sweep = 50
num_sweep_workers = 2
method = "random" # Faster exploration
# run_cap calculated automatically
Medium Datasets (1000-5000 images)Direct link to Medium Datasets (1000-5000 images)
[sweep]
epochs_per_sweep = 30
num_sweep_workers = 4
method = "bayes" # Efficient optimization
# run_cap calculated automatically (~50 runs)
Large Datasets (>5000 images)Direct link to Large Datasets (>5000 images)
[sweep]
epochs_per_sweep = 20
num_sweep_workers = 6
method = "grid" # Best model selection
# run_cap calculated automatically (all combinations)
Optimization TipsDirect link to Optimization Tips
1. For Limited ResourcesDirect link to 1. For Limited Resources
- Use
method = "random"for faster exploration - Increase
epochs_per_sweepfor better results with fewer experiments - Decrease
num_sweep_workersto reduce parallel load - Let
run_capbe calculated automatically
2. For Limited TimeDirect link to 2. For Limited Time
- Use
method = "random"ormethod = "bayes"for quicker results - Increase
num_sweep_workersif possible for faster parallelization - Reduce
epochs_per_sweepfor quicker results - Set custom
run_capif needed (e.g.,run_cap = 30)
3. For Better PerformanceDirect link to 3. For Better Performance
- Use
method = "grid"for comprehensive search - Use
method = "bayes"for efficient optimization - Increase
epochs_per_sweepfor small datasets - Use more
num_sweep_workersfor parallelization - Let
run_capbe calculated automatically
TroubleshootingDirect link to Troubleshooting
Problem: Sweep too slowDirect link to Problem: Sweep too slow
Solution: Use method = "random" or increase num_sweep_workers for faster parallelization
Problem: Poor resultsDirect link to Problem: Poor results
Solution: Use method = "grid" for comprehensive search or increase epochs_per_sweep for better training
Problem: Insufficient memoryDirect link to Problem: Insufficient memory
Solution: Reduce batch_size or num_sweep_workers
Problem: OverfittingDirect link to Problem: Overfitting
Solution: Increase augmentation or reduce model complexity
Problem: Too many runs with grid methodDirect link to Problem: Too many runs with grid method
Solution: Switch to method = "random" or method = "bayes" for faster exploration
Problem: Need custom run limitDirect link to Problem: Need custom run limit
Solution: Set run_cap = <desired_number> in the TOML configuration
Problem: Automatic calculation failsDirect link to Problem: Automatic calculation fails
Solution: The system defaults to 100 runs automatically, or specify run_cap manually
Important ConsiderationsDirect link to Important Considerations
- Execution Time: Complete sweep can take several hours/days
- Resources: Monitor CPU/GPU/memory usage
- Storage: Each experiment saves checkpoints
- Selection: Best model is automatically selected
- Reproducibility: Use fixed seeds for consistent results
- Method Choice:
- Random for exploration
- Bayes for efficient optimization
- Grid for comprehensive search
- Run Cap: Automatically calculated based on task and method, or specify manually
Technical ImplementationDirect link to Technical Implementation
How Parameters Are Used InternallyDirect link to How Parameters Are Used Internally
epochs_per_sweep FlowDirect link to epochs_per_sweep-flow
- TOML Configuration →
job_spec.sweep.epochs_per_sweep - Pipeline → Passed to
hyperparam_searchasepochs - Sweep Config → Used in W&B sweep configuration
- Sweep Worker → Executed in training loop
# Pipeline (pipeline.py)
"epochs": job_spec.sweep.epochs_per_sweep,
"run_cap": job_spec.sweep.run_cap, # Optional
# Sweep Config (sweep.py)
"epochs": {"value": epochs}, # Fixed for all experiments
"run_cap": run_cap, # Limits total experiments
# Sweep Worker (utils.py)
for epoch in range(1, config["epochs"] + 1):
train_loss = train_model(dataloader=train_dl, model=model, optimizer=optimizer)
wandb.log({"train_loss": train_loss, "epoch": epoch})
#### `num_sweep_workers` Flow
1. **TOML Configuration** → `job_spec.sweep.num_sweep_workers`
2. **Pipeline** → Passed to `hyperparam_search`
3. **Hyperparam Search** → Creates multiple Vertex AI jobs
4. **Vertex AI** → Executes workers in parallel
#### `run_cap` Flow
1. **TOML Configuration** → `job_spec.sweep.run_cap` (optional)
2. **Pipeline** → Passed to `hyperparam_search`
3. **Hyperparam Search** → Calculated automatically if not specified
4. **Sweep Config** → Used in W&B sweep configuration
5. **W&B** → Limits total number of experiments
```python
# Pipeline (pipeline.py)
"num_sweep_workers": job_spec.sweep.num_sweep_workers,
# Hyperparam Search (components.py)
for i in range(num_sweep_workers):
job = create_job(
client=vertex_client,
name=f"sweep_worker_{i}", # Unique name for each worker
container=container,
command=["python3", "-m", "protege.pipelines.sweep_worker"],
args=[sweep_id, wandb_project],
# ...
)
futs.append(executor.submit(monitor_job, vertex_client, job.name))
wait(futs, return_when=ALL_COMPLETED) # Wait for all to finish
Parameter InteractionDirect link to Parameter Interaction
Resource Calculation ExampleDirect link to Resource Calculation Example
# Example with typical values:
epochs_per_sweep = 10
num_sweep_workers = 4
# Total epochs per worker:
epochs_per_worker = epochs_per_sweep # = 10
# Total parallel epochs:
total_parallel_epochs = num_sweep_workers * epochs_per_sweep # = 40
# Estimated time (assuming 1 minute per epoch):
estimated_time = epochs_per_sweep / num_sweep_workers # = 2.5 minutes
Parallel Execution StrategyDirect link to Parallel Execution Strategy
- Workers run simultaneously on Vertex AI
- Each worker picks different hyperparameter combinations
- ThreadPoolExecutor manages parallel execution
- W&B coordinates experiment distribution
- All workers wait for completion before returning
Validation and Error HandlingDirect link to Validation and Error Handling
Required ParametersDirect link to Required Parameters
# job_spec.py validation
assert (
"epochs_per_sweep" in job_spec_dict["sweep"]
), "must specify sweep.epochs_per_sweep when hyperparam sweep is enabled"
assert (
"num_sweep_workers" in job_spec_dict["sweep"]
), "must specify sweep.num_sweep_workers when hyperparam sweep is enabled"
Common Error ScenariosDirect link to Common Error Scenarios
| Problem | Cause | Solution |
|---|---|---|
| Sweep too slow | Few workers | Increase num_sweep_workers |
| Poor results | Few epochs | Increase epochs_per_sweep |
| Memory issues | Too many workers | Reduce num_sweep_workers |
| Timeout | Too many epochs | Reduce epochs_per_sweep |
| Too many runs | Grid method | Use method = "random" or method = "bayes" |
| Need custom limit | Default calculation | Set run_cap = <number> |
Monitoring and LoggingDirect link to Monitoring and Logging
W&B IntegrationDirect link to W&B Integration
- Each worker logs to separate W&B runs
- Metrics are aggregated automatically
- Best run is selected automatically
- Real-time progress tracking
Vertex AI MonitoringDirect link to Vertex AI Monitoring
- Job states are monitored continuously
- Error logs are captured and reported
- Resource usage is tracked
- Automatic retry on failures
Performance OptimizationDirect link to Performance Optimization
For Limited ResourcesDirect link to For Limited Resources
[sweep]
epochs_per_sweep = 50 # More epochs per experiment
num_sweep_workers = 2 # Fewer parallel workers
method = "random" # Faster exploration
# run_cap calculated automatically
For Limited TimeDirect link to For Limited Time
[sweep]
epochs_per_sweep = 10 # Fewer epochs per experiment
num_sweep_workers = 6 # More parallel workers
method = "bayes" # Efficient optimization
run_cap = 30 # Custom limit for faster completion
For Best ResultsDirect link to For Best Results
[sweep]
epochs_per_sweep = 30 # Balanced epochs
num_sweep_workers = 4 # Balanced workers
method = "grid" # Comprehensive search
# run_cap calculated automatically (all combinations)