codewithme.cloud | Mostly Technical

TL;DR You need to explicitly wait for the AI Foundry Hub to be ready before creating a private endpoint. Add a time delay in your Terraform code.

Jump to recipe

Problem

I have just recently come out of a several day troubleshooting session with a somewhat complex AI Foundry deployment issue. Private endpoints for secure networking, model deployments, and a certain amount of Terraform code.

When using a bespoke AI Foundry module (the need for which is some custom requirements not covered by this post) as a part of a larger deployment it failed. If using the module on its own, it succeeds. Not an intermittent or arbitrary failure, but a consistent failure every time the module was used in a larger deployment, starting from a couple weeks ago with no apparent changes in code or environment.

This is the error message I got when running Terraform apply:


| Error: creating Private Endpoint (Subscription: "---redacted---"
│ Resource Group Name: "---redacted---"
│ Private Endpoint Name: "---redacted---"): polling after CreateOrUpdate: polling failed: the Azure API returned the following error:
│ 
│ Status: "InternalServerError"
│ Code: ""
│ Message: "Call to Microsoft.MachineLearningServices/workspaces failed. Error message: InternalServerError"
│ Activity Id: ""
│ 
│ ---
│ 
│ API Response:
│ 
│ ----[start]----
│ {"status":"Failed","error":{"code":"InternalServerError","message":"Call to Microsoft.MachineLearningServices/workspaces failed. Error message: InternalServerError","details":[]}}
│ -----[end]-----

Like so many of my other troubleshooting sessions this one starts with a somewhat lacking error message: “Something has gone wrong” / “There was an error” - which honestly is not very helpful.

Investigation 🔍

I started looking for changes in my environment the past couple of weeks. The module test pipeline was green just two weeks ago, so I focused my attention on the last 14 days.

Checking deployment logs for specific changes.
Checking the providers for related changes.
Checking the assigned policies for any relevant changes (deny/DINE/modify).
Checking the network configurations (firewalls, NSGs, etc.).
Checking for any change to builtin Azure Policies.

Nothing stood out 🤔

The module testing pipeline deploys all resources, and cleans up after itself when it works correctly, but this time it was failing to clean the private endpoint. In an act of desperation I checked the private endpoint itself, and noticed there were no dns records created 👀

This led me to check why the dns records were missing. In this environment there are central private link DNS zones associated to the central hub, which in turn leads to all spokes having access to the private link DNS records. The records are updated by an Azure Policy that automatically creates DNS records for a certain type of private endpoints.

After digging some more into the Azure Activity Log, I found several failed deployIfNotExists operations for the private endpoint resource. Following one of the activity log leads sent me to the resource group deployment log. There I found a more detailed error mesage, which in turn pointed me to the solution! The error message looked something like this - I can’t find the exact error message now as the resource group has been deleted:

{
  "code": "ReferencedResourceNotProvisioned",
  "message": "Cannot proceed with operation because resource x used by resource y is not in Succeeded state. Resource is in Updating state and the last operation that updated/is updating the resource is PutSubnetOperation."
}

The Solution

It seemed like Terraform was missing something important here and trying to create the private endpoint before the AI Foundry Hub was ready. This has not happened before, so I was flabbergasted. Is this an issue with AzApi? 🤔

The private endpoint module call should have an implicit dependency on the AI Foundry Hub because of how I have used the output from AI Foundry:

module "ai_foundry_hubs_pep" {
  source  = "module/private-endpoint-source"
  version = "module-version"

  resource_group_name            = azurerm_resource_group.main_rg.name
  location                       = local.location
  subresource_names              = ["amlworkspace"]
  private_endpoint_name          = local.private_endpoint_name
  private_connection_resource_id = azapi_resource.hub.id # This should create an implicit dependency
  subnet_id                      = azurerm_subnet.private_endpoint_subnet.id
  tags                           = local.tags
}

You’ll notice there is no explicit dependency there, because I assumed the property reference would be enough. Luckily we can use “depends_on” with a time_sleep to add an explicit dependency and wait period.

After I added the depends_on in the example below, this worked like a charm. Terraform now waits for the AI Foundry Hub to be ready before it tries to create the private endpoint.


resource "azapi_resource" "hub" {
  ...resource deployment code...
}

# Wait for hub configuration
resource "time_sleep" "after_hub_creation" {
  create_duration = "1m"
  depends_on      = [azapi_resource.hub]
}

module "ai_foundry_hubs_pep" {
  source  = "module/private-endpoint-source"
  version = "module-version"

  resource_group_name            = azurerm_resource_group.main_rg.name
  location                       = local.location
  subresource_names              = ["amlworkspace"]
  private_endpoint_name          = local.private_endpoint_name
  private_connection_resource_id = azapi_resource.hub.id # This should create an implicit dependency
  subnet_id                      = azurerm_subnet.private_endpoint_subnet.id
  tags                           = local.tags

  # Wait for hub configuration
    depends_on = [time_sleep.after_hub_creation]
}

In Summary

Sometimes you need to help Terraform understand the dependencies in your code. In this case, the AI Foundry Hub was not ready when Terraform tried to create the private endpoint, leading to an internal server error.

Use the available tools to troubleshoot your deployment issues. In my case these resources were crucial:

Azure Activity Log (both subscription and resource group level)
Azure Resource Group Deployment Log
Terraform plan and apply logs
Environment Change tracking (Merge Requests, Commit history, etc.)
AzAdvertizer for any recent changes to Azure Policy

With these tools I was able to eliminate some of the potential issues, and find a path to the correct solution.

Leave a comment if you have any questions or suggestions for improvements. 🙂 I hope this post finds someone that needs it, and saves them some time in their troubleshooting session!

AI Foundry - Internal Server Error

Problem

Investigation 🔍

The Solution

In Summary