Your laptop’s operating system runs ten or hundreds of processes simultaneously. It gives each process exactly the amount of resources it needs (RAM, CPU, IO). It isolates them into their own virtual address space, locks them to a set of predefined permissions, enables them to communicate, and allows you, as a user, to securely monitor and control them. The operating system abstracts the hardware layer (writing to a flash drive is the same as writing to a hard disk), and it does not matter which programming language or technology stack you use to write these apps – it simply executes them smoothly and smoothly ,

As machine learning pervades the enterprise, companies will soon be producing more and more models and a faster clip. Deploying, scaling, monitoring, and auditing resources becomes increasingly difficult and expensive over time. Data scientists from different divisions each have their own preferred technology packages (R, Python, Julia, TensorFlow, Caffe, deeplearning4j, H2O.ai, etc.), and data center strategies are being switched from cloud to hybrid. Cloud-independent execution, scaling and monitoring of heterogeneous models is a task comparable to an operating system. We want to talk about that.

At Gencliksevdam, we run more than 8,000 algorithms (each with multiple versions, including up to 40,000 unique REST API endpoints). Each API endpoint can be called from once a day to a burst of more than 1,000 times per second. These algorithms are written in one of the 14 languages ​​we support today, can be CPU or GPU based, run in any cloud, can read and write to any data source (S3, Dropbox, etc.), and operate with a latency of ~ 15ms on standard hardware

Training vs. inference
Machine learning and deep learning consist of two different phases: training and inference. The former is about building the model, and the latter is about running it in production.

Training vs Inference | Creating an operating system for AI

Training a model is an iterative process that depends heavily on the framework. Some machine learning engineers use TensorFlow on GPUs, others use Scikit-Learn on CPUs, and every training environment is a snowflake. This is analogous to building an app where an app developer has a carefully crafted development toolchain and libraries that constantly compile and test their code. The training requires a long calculation cycle (hours to days), is usually a fixed-load input (that is, you do not have to scale in response to a trigger from a machine on X machines), and is ideally a stateful process by the data scientist You must save the progress of the workout repeatedly, eg At control points for neural networks.

Inference, on the other hand, is about making this model scalable to multiple users. If multiple models are run simultaneously, each written in different frameworks and languages, it becomes an operating system. The operating system is responsible for scheduling jobs, releasing resources, and monitoring those jobs. A “job” is an inference transaction and, in contrast to training, requires a brief interruption of the computation cycle (similar to a SQL query) and elastic loading (machines must be increased / decreased in proportion to inference demand) Stateless if the result of a previous transaction is the result the next transaction is not affected.

We will focus on the inference side of the equation.

Serverless FTW
We will use serverless computing for our AI operating system. So let’s take a moment to explain why serverless architecture makes sense for artificial intelligence.

As we explained in the previous section, the machine learning inference requires a short computational burst. This means that a server providing a model as a REST API is inactive. When it receives a request to, for example, classify an image, the CPU / GPU utilization is suspended for a short period of time, the result returned, and idle continued. This process is similar to a database server that is idle until an SQL query arrives.