ML Systems that is production ready

Data Collection              Data Verification              Configuration

Feature Extraction           ML Algorithms                  Analysis Tools

Monitoring                   Process Management Tools       Machine Resource

                             Evaluation Pipeline

the design steps (framework)

1 - Requirements(Business Objective)
2 - Frame your ML Task
3 - Data Preparation
4 - Model Development
5 - Evaluation
6 - Deployment
7 - Monitoring

1 - Requirements

- business objective -> increase the revenue or increase the number of registrations
- features the system needs to support -> intreaction data
- data -> what are the sources, how large is the datasets, is the data labeled?
- constraints -> computing power, are you using a cloud based system, Is the model expected to improve automatically over time?
- scale of the system -> how many useres do we have?
- performance -> How fast prediction can be? whats the priority accuracy or latency?

2 - Frame your ML Task

- define your ML Objective
- specifying the systems input and output
- selecting the right ML category

- define your ML Objective

business objective                                        ML objective
- (Youtube)increase user engagement                       maximize the time a user spends on watching videos
- (instagram)imporve the platform safety                  accurately predict if a content is harmful
- (bookmyshow)increase ticket sales                       maximize the number of event registrations

- specifying the systems input and output

input          algorithm               output
post           harmful content         probability
               detection system

input                      output
user           model       probability

- selecting the right ML category

                                    ML Categories
Supervised                          Unsupervised                            Reinforcement
Regression                          clustering
Classification                      dimensionality reduction
    - binary
    - multiclass

3 - Data Preparation

data sources -----> data engineering -> feature engineering -----> prepared features
                           data preparation process

data engineering -> designing and building pipelines for collection, storing, retrieving and processing data.

data sources ->
- who collected the data
- how clean the data is
- can the source be trusted
- is the data user generated or system generated

data storage ->
- the high-level understanding of how diff databases work


Relational Database

- PostgreSQL


Key/value      -> Redis, DynamoDB
Column-based   -> Cassandra, HBase
Graph          -> Neo4J
Document       -> MongoDB, CouchDB

ETL -> Extract Transform and Load

the ETL process ->

Extract                           Transform                          Load
Data Sources                                                     to target destination

Databases                                                          Database

Logs                                                               files

Files                                                              data warehouse

Data Types in ML

- structured - numerical(Discrete and continuous), categorical(ordinal, nominal)
             - predefined schema
             - easy to search
             - relational database, data warehouse

- unstructured - audio, video, image, text
               - no schema
               - difficult to search
               - NoSQL databases, data lakes

4 - Model Development

- model selection
- simple baseline model -
- experiment with simple models
- move to complex models(deep learning models)
- ensemble of models - 3 ways to do ensembling -> bagging, boosting and stacking

keep these points in mind ->

- the amount of data for training
- training speed
- hyperparameters to choose and hyperparameter tuning techniques
- is there any possibility of continual learning
- compute requirements

model training
- constructing the dataset
            - collect the raw data
            - feature engineering
            - sampling strategy
            - address class imbalance

- choose a loss function
- training from scratch vs fine-tuning

Data Labels
- How should we obtain the labels?
- is the data annotated and how good the annotations are?
- Natural Labels ->

Model training
- what loss function should we use?
- what regularization should we use?
- backpropagation
- what activation function
- how to handle an imbalanced dataset?
- cause of overfitting or underfitting

5 - Evaluation

offline evaluation - the model is in dvelopment phase
- classification - precision, recall, f1-score, accuracy, ROC-AUC, confusion matrix
- regression - MSE, MAE, RMSE
- image generation - inception score, FID

online evaluation
ad click prediction - click through rate, revenue lift
harmful content detection - valid posts
video recommendation - total watch time, number of completed videos

6 - Deployment