Hey community!
A Step Functions workflow that usually takes a few minutes to finish has been stuck in RUNNING for hours.
All of the Lambda functions inside it ran without any problems, and the logs show that there were no retries, errors, or timeouts. But the state machine never gets to the last state.
Why it might be happening:
-
A Task Token or Lambda callback never came back
-
The Wait state or timeout is set up wrong
-
The “Next” transition in the ASL definition is wrong
-
Step Functions waiting on an external service that never responded
-
Possible delays in service-level propagation after a deployment
Has anyone ever had Step Functionsexecutions that “hang” for no clear reason?
What troubleshooting approach should I take to identify the stuck state or missing callback?