jbmurphy.com

A subscription tripwire: detecting and auto-locking suspicious Azure changes

The goal was a subscription-wide tripwire: watch every control-plane change in an Azure subscription, decide whether it looks like someone loosening security, and automatically contain the blast radius – with the detector itself holding no keys and exposing no public endpoint.

Contain here means a ReadOnly lock on the affected resource group. The change has already happened, so the value is freezing that group within seconds so nothing else can move, then alerting. It is detect-and-contain, not prevention; preventing a change outright is what Azure Policy deny is for.

What it does

The path a change takes, top to bottom: capture, private delivery, classify, contain.

Every resource write, delete, or action in the subscription is captured by an Event Grid system topic and handed to a classifier function. The function decides three ways:

Changes it caused itself, and lock writes, are skipped – otherwise applying a lock would re-trigger the pipeline on itself.
A short list of unambiguous operations goes straight to CRITICAL with no model call: writing an NSG rule, granting a role assignment, deleting a lock, changing a Key Vault access policy, writing a storage account. These are the classic quietly-loosen-security moves.
Everything else is enriched with caller, time of day, and resource hierarchy, then rated NORMAL, SUSPICIOUS, or CRITICAL by Azure OpenAI.

A genuine CRITICAL gets a ReadOnly lock on the resource group, stamped with the reason and a 60-minute expiry. A model or infrastructure failure alerts a human instead of locking – auto-locking during an OpenAI outage would freeze every group touched during the outage, a self-inflicted denial of service. A companion timer function removes the locks once they expire, because Azure locks do not expire on their own.

HARD_DENY = {
    "Microsoft.Authorization/roleAssignments/write",
    "Microsoft.Authorization/locks/delete",
    "Microsoft.Network/networkSecurityGroups/securityRules/write",
    "Microsoft.KeyVault/vaults/accessPolicies/write",
    "Microsoft.Storage/storageAccounts/write",
}

def decide(event):
    if is_self_or_lock_write(event):
        return None                     # never trigger on our own actions
    if event.operation in HARD_DENY:
        return "CRITICAL"               # deterministic, skip the model
    return classify_with_openai(event)  # grey area, ask the model

Keeping the detector locked down

Trust boundaries, top to bottom: keyless, no public inbound, storage reachable only over the vnet.

The point of the exercise was to run all of this with no soft spot an attacker could use to blind it:

No keys. The function authenticates to Azure OpenAI, to the queue, and to the locks API with a managed identity. Event Grid delivers with its own managed identity. There are no connection strings or API keys anywhere.
No public inbound. The function has public network access disabled. Event Grid cannot call a private webhook, so instead of a webhook the subscription delivers events to a Storage Queue and the function drains that queue over its virtual-network integration.
Firewalled storage. The storage account defaults to Deny and trusts only the function subnet and the deploy runner subnet. Event Grid still writes to the queue because it is a trusted service using its identity; the function reads it back over the vnet.

# The subscription topic delivers to a queue, not a webhook,
# and Event Grid authenticates with its own managed identity.
event_subscription {
  storage_queue_endpoint {
    storage_account_id = azurerm_storage_account.this.id
    queue_name         = "events"
  }
  delivery_identity { type = "SystemAssigned" }
}

Result

Writing an NSG rule from a normal user account trips it end to end in about forty seconds: the change is captured, delivered to the private queue, drained over the vnet, matched against the hard-deny list, and the resource group comes back with a ReadOnly lock whose note reads hard_deny: Microsoft.Network/networkSecurityGroups/securityRules/write. A benign change, a resource-group tag for example, flows through the same path, is classified as normal, and nothing is locked. False positives thaw themselves after an hour; a real one leaves responders a frozen resource group to investigate instead of a moving target.

Posted

July 3, 2026

Uncategorized

jbmurphy

Tags:

A subscription tripwire: detecting and auto-locking suspicious Azure changes

What it does

Keeping the detector locked down

Result

Comments

Leave a Reply