Sometimes, unexpected issues can arise which can cause SmartML training jobs to fail. Common errors during the model training process include:
See below for steps on how to resolve these issues.
Troubleshooting steps and support level will vary depending on your account type.
- 1.The training error you received during your training, such as "Resources exhausted" or "Unknown error"
- 2.The URL of your model version, located here:
To troubleshoot model training failures, please provide support with this URL.
- "Resources exhausted" error:
This error occurs if a training virtual machine (VM) instance runs out of memory during training. You can view the memory usage of your training VMs in the Cloud Console. Even when you get this error, you might not see 100% memory usage on the VM, because services other than your training application that run on the VM also consume resources. For machine types that have less memory, other services might consume a relatively large percentage of memory. For example, on an n1-standard-4 VM, services can consume up to 40% of the memory. You can optimize the memory consumption of your training application, or you can choose a larger machine type with more memory.
- Other training errors:
Your cloud administrator will need to access the Vertex AI logs to determine the cause:
- 1.Select your model version ID from the Name column. The status should be "Failed".
- 2.Click the "View Logs" button.
For further support in diagnosing the issue, contact your Plainsight account representative. To speed up the troubleshooting process, please provide the following:
- 1.The training error you received during you training, such as "Resources exhausted" or "Unknown error"
- 3.The following files from your model output located in your application's storage bucket: