Catacloud Refactoring and Event Sourcing Challenges

2025-08-03
3 minute read

Today, I finalized the refactoring of Catacloud to incorporate the changes from your epoch. I found some additional modifications were needed, so I updated the interface of the PostgreSQL state store to make it more generic. This is something I've discussed in previous journal entries, and was done to avoid binding the aggregate to one specific table and ID. Now, the PGState interface requires getByID, persist, abstract, and delete methods. The underlying state also doesn't necessarily have to be a PgRow; it could be anything, only the parts of it would need to be persistable in PostgreSQL, potentially across multiple tables. These methods now accept an executor, which allows us to pass in a transaction once they are implemented. Hopefully, this means we won't need to change any state implementations once the transaction implementation is available.

Furthermore, in Catacloud, almost everything now uses event sourcing, except for users. I've added organizations and job configurations as aggregates. I also implemented a saga to compute storage usage.

Event Sourcing Challenges and Race Conditions

While testing, I encountered an interesting, or rather annoying, error in the logs. This saga, which computes storage usage, listens for events from the files aggregate, specifically for "part uploaded" and "file deleted" events. Based on these events, it sends commands to the organization aggregate to update the organization's storage usage.

The issue is a race condition: the handler sends a command, we read the state from the store, apply the events, and then attempt to persist the state. This persistence fails with a "duplicate key" error, or a unique key constraint violation. This indicates that two events with the same version are trying to modify the same thing simultaneously. Since there are no atomic operations to persist the state and events together, this race condition occurs.

The solution is to run a transaction between getting the state for the organization and persisting events. This will ensure the version of the events is correct. While this is fixable, it's a bit annoying because I wasn't planning on implementing transactions yet, and it's a significant amount of work given my limited time. This issue has been happening quite consistently, and it's why the storage usage computation is incorrect. The projection that listens for the storage usage event isn't being triggered because the events aren't being persisted. This is a critical issue that needs to be addressed.

Overall, once I fix this, we are moving in the right direction. However, I must admit this whole system is quite complex now. It's entirely event-driven, and we're not calling things directly; instead, we're sending commands and listening to events everywhere. This makes it a lot looser than when you call a function directly, which I believe is one of the main issues with event-driven systems. I'm currently using an in-memory event bus. If I were to use an external event bus, some of these issues would be more apparent due to added latency. These are not trivial problems, but I'm committed to seeing this through.