Hey community!
A Step Functions workflow that usually takes a few minutes to finish has been stuck in RUNNING for hours.
All of the Lambda functions inside it ran without any problems, and the logs show that there were no retries, errors, or timeouts. But the state machine never gets to the last state.
Why it might be happening:
-
A Task Token or Lambda callback never came back
-
The Wait state or timeout is set up wrong
-
The “Next” transition in the ASL definition is wrong
-
Step Functions waiting on an external service that never responded
-
Possible delays in service-level propagation after a deployment
Has anyone ever had Step Functionsexecutions that “hang” for no clear reason?
What troubleshooting approach should I take to identify the stuck state or missing callback?
Hi @Carol_01
This doesn’t seem Bitrise related. Are you sure you wanted to post it on Bitrise’s forum?
Heyy!!!Thank you for your answer! Yes, we know this isn’t strictly about Bitrise, and we know that the forum is usually about CI/CD topics.
We posted here on purpose because we wanted to reach a wider group of engineers who might have used AWS Step Functions in real life.
Thanks for your time. Any tips on how to fix stuck Step Functions executions would be very helpful!
this almost always means AWS Step Functions is waiting on one state that never finished.
check the execution event history, not the graph. the last emitted event shows where it’s stuck.
most common causes i’ve seen:
-
task token / callback never sent (Send Task Success missing)
-
Wait state using bad or null timestamp
-
ResultPath / OutputPath wiping input for the next state
-
service integration completed externally but never signaled back
step functions don’t hang randomly — they’re waiting forever on something specific. event history will point straight to it.