From upscaling to image and video generation, generative AI is exploding. According to Gartner, generative AI will account for 10% of all data produced by 2025. Cutting edge models like DALL-E 2 and Imagen have made incredible advances in recent years, but delivering a commercial generative AI service takes a lot more than implementing the latest model.
One of the most challenging aspects of providing generative AI services is managing and supporting the custom trained models for each customer. Generation tasks like "generate a blue bird" might be serviceable by a generalized, common model, but applications like, "write an email about patient file 1234 like Dr. Smith", or "create a marketing video staring executive A using script Y" require specifically trained models. Depending on the need, a single customer could have hundreds of custom models, and need inference done on any one at random. Keeping this many models organized is difficult enough, but on top of that, the inference system needs to coordinate loading and unloading the models on the GPU server for inference whenever the customer requests.
trainML Models make it easy to store trained models and use them for new inference tasks. Each customer model can be stored independently and uniquely in the same trainML account. Running an inference tasks for the customer is as simple as selecting the associated model ID when creating the inference task. The trainML job system automatically loads the trained model on whichever system is available to process the inference task. trainML Projects make organizing models even easier. By using a separate project for each customer, all models and resources can be organized and tracked together for security and billing purposes.
Generative inference tasks are bursty in nature. Provisioning GPU systems for peak load could bankrupt an early stage company. On the other hand, keeping too few systems available can lead to congestion, negatively impacting the user experience. Scaling and descaling systems based on demand requires a great deal of engineering effort, and isn't particularly fast.
trainML Inference Jobs eliminate the need for server provisioning. When a new customer inference tasks comes in, all that's needed is to tell trainML what model to use, what data to provide it, and where to send the results. The trainML job engine will automatically reserve a GPU, load the model code and dependencies, execute the task, upload the results, and terminate the job. Whether it's 1 or 100 concurrent tasks, they execute without any additional tooling or configuration.
With model training and inference as direct cost of sales, the high cost of GPU time on major cloud providers can significantly reduce gross margins. Once volume picks up, the best financial decision for the service provider is to run the workloads on their own equipment. Typically, the IT expertise and overhead associated with installing, patching, securing, and monitoring servers is prohibitive, and the bursty needs for inference cannot be constrained by the fixed number of owned GPUS.
CloudBender™ provides a zero IT, on-prem infrastructure solution. No clusters to configure or systems to support. As demand rises, the service provider can add more GPU-enabled systems. CloudBender will automatically prioritize running workloads on owned systems. When demand bursts above on-prem capacity, it seamlessly runs the workloads in the cloud or on other trainML systems. Since all jobs using the same trainML SDK/CLI the inference pipelines and scripts require no changes to utilize the new infrastructure.