Most current Agent frameworks have fundamental flaws in handling fault recovery. Agent workflows differ from traditional data pipelines: Agent steps call model sampling and may change the external world—invoking tools, accessing APIs, writing to external systems. These operations cannot be safely recomputed: if a worker node crashes during execution, it may lead to duplicate payments, duplicate emails, or data with silent errors.
Existing frameworks usually use simple retry mechanisms or do not handle this at all, leading to hidden state leaking into worker node memory, fragile coordination via message queues, and no persistent records of attempts or successes. When problems occur, developers can only debug by reading logs.
Kortecx-core takes the opposite stance: every Agent attempt is a persistent fact appended to the log. During recovery, committed steps are never re-run—instead, results are re-read.