Productizing Machine Learning

The biggest challenge with machine learning has been the fact that while modelling has been fairly well explored everywhere. Putting models into production has been a challenge across all firms.

Scientists dream about doing great things, engineers do them

James A Michener

While a bit too extreme, its indeed the case with putting data science models or plans into production. Machine learning is needed when a firm has lots of data and a proven business model/equation to optimise for. This is especially a challenge for early-stage firms which are better off sticking with basic startup analytics.

For mature organizations and growth-stage startups, productizing machine learning throws up a gambit of risks in terms of execution. This includes hidden costs and plausible engineering constraints, which can make complex models prohibitive.

Machine Learning Engineering 101

The basic requirement to get started on the production side is to have a specific problem statement along with the model ready.

  • We know the features to use
  • The model to use
  • The metric, we are using to measure model.

The simplest model in production uses an offline training approach which we update with a specific training interval. This could be a day, week or longer. The time interval depends on the business need which further depends on how quickly the new data changes and effects model dynamics.

  • Databases with data pipelines and scheduled updates.
  • Ensure indexation and keys (Primary and Foreign)
  • Specific modelling table with the necessary features. This will allows us to pull the data into RAM (Pandas Data Frame) and start modelling.

The server is an EC2 instance in this context and RDS(OLAP) for databases. Cookie Cutter Framework is useful to be able to structure the workflows. Splits the entire flow into.

  • Data: Temp data sets can be saved as CSV files
  • Features: On the fly feature generation
  • Modelling: The modelling modules go here, this includes any kind of un-supervised steps such as clustering as well
  • Visualization: Extra module to produce, graphs and any other outputs from the reports generated

There is another section for Jupyter Notebooks which I prefer to keep running through screen(Linux). Can tunnel into the server through a local machine to run any ad-hoc analysis. The notebooks are always up and fast to use.

The simplest scheduler is a CRON. On the other hand, for DAG type of workflows, one can use Luigi or Airflow. We started with the simplest model whose output was pushed to a DB with the timestamp and the specific prediction. This pretty much covers all the basic aspects of productizing machine learning.

Machine Learning Operations (MLOps)

This section is WIP. More content to arrive soon..