Links

Training Errors

Errors in Training

Sometimes, unexpected issues can arise which can cause SmartML training jobs to fail. Common errors during the model training process include:
Error message
Description
Resources exhausted
Out of memory or out of capacity.
Training service unavailable
Training service is currently down.
Unknown error
Training could not complete. Please try again. Contact support if the problem persists.
See below for steps on how to resolve these issues.

Troubleshooting

Troubleshooting steps and support level will vary depending on your account type.

On-Demand Users

Please contact support and provide:
  1. 1.
    The training error you received during your training, such as "Resources exhausted" or "Unknown error"
  2. 2.
    The URL of your model version, located here:
To troubleshoot model training failures, please provide support with this URL.

Private Cloud/Enterprise Users

  • "Resources exhausted" error:
This error occurs if a training virtual machine (VM) instance runs out of memory during training. You can view the memory usage of your training VMs in the Cloud Console. Even when you get this error, you might not see 100% memory usage on the VM, because services other than your training application that run on the VM also consume resources. For machine types that have less memory, other services might consume a relatively large percentage of memory. For example, on an n1-standard-4 VM, services can consume up to 40% of the memory. You can optimize the memory consumption of your training application, or you can choose a larger machine type with more memory.
  • Other training errors:
Your cloud administrator will need to access the Vertex AI logs to determine the cause:
  1. 1.
    Select your model version ID from the Name column. The status should be "Failed".
  2. 2.
    Click the "View Logs" button.
For further support in diagnosing the issue, contact your Plainsight account representative. To speed up the troubleshooting process, please provide the following:
  1. 1.
    The training error you received during you training, such as "Resources exhausted" or "Unknown error"
  2. 2.
    Your downloaded Vertex AI model logs.
  3. 3.
    The following files from your model output located in your application's storage bucket:
<googleProjectID>-uploads/organizations/<orgID>/models/<modelID>/versions/<modelVersionID>/training_config.json
<googleProjectID>-uploads/organizations/<orgID>/models/<modelID>/versions/<modelVersionID>/job_state.json