Machine Life Cycle on Catacloud
Today's focus is on implementing the machine life cycle for Catacloud. This new system will monitor job states, provisioning new machines for jobs, and then de-provisioning the machines once the jobs are complete.
New Entities: Machine Type and Machine Instance
To support this, we are introducing new entities:
- Machine Type: This entity will describe the specifications of a machine. It will serve as a mapping that the machine life cycle uses to identify and provision the correct machine, initially focusing on Hetzner Cloud, but with future potential for other cloud providers, local docker instances or even perpetual baremetal running machines.
- Organization and Machine Description (intermediary table): This table will link organizations to the machine types they have access to.
- Machine Instance: The machine life cycle will create a new machine instance when jobs are in a pending state. It will track the job's progress and then turn off the machine instance when the job finishes.
Job Scheduling and Machine Types
Jobs will be scheduled to specific machine types through their job configuration (previously referred to as pipeline configuration or job type). This means that jobs with a particular configuration will be defined to run on a specific machine type. This binding is crucial for scenarios where certain job types might require specific resources, such as GPUs. When defining a new job configuration, users will select from the machine types their organization has access to.
Machine Life Cycle Logic and Resource Management
The machine life cycle will be bound to an organization's machine type configuration. This configuration will define the maximum amount of resources an organization can schedule, such as the maximum number of simultaneous jobs or the maximum pool size of machines. For example, if an organization only wants to run two jobs simultaneously, the maximum pool size will be two machines. If they only have one, jobs will be queued one by one.
I may need to implement multiple algorithms for the machine life cycle to decide job parallelism. The goal will be to optimize resource utilization, potentially by reducing costs (scheduling the fewest machines possible) or by prioritizing speed (creating more machines for faster completion). This will involve considering factors like minimum machine run times (e.g., a 30-minute job on a machine with a one-hour minimum run time).
Architectural Considerations
The machine life cycle will be implemented as a saga. The machine type and machine instance will be aggregates.