ML Systems that is production ready
Data Collection Data Verification Configuration
Feature Extraction ML Algorithms Analysis Tools
Service
Infrastructure
Monitoring Process Management Tools Machine Resource
Management
Evaluation Pipeline
the design steps (framework)
1 - Requirements(Business Objective)
2 - Frame your ML Task
3 - Data Preparation
4 - Model Development
5 - Evaluation
6 - Deployment
7 - Monitoring
1 - Requirements
- business objective -> increase the revenue or increase the number of registrations
- features the system needs to support -> intreaction data
- data -> what are the sources, how large is the datasets, is the data labeled?
- constraints -> computing power, are you using a cloud based system, Is the model expected to improve automatically over time?
- scale of the system -> how many useres do we have?
- performance -> How fast prediction can be? whats the priority accuracy or latency?
2 - Frame your ML Task
- define your ML Objective
- specifying the systems input and output
- selecting the right ML category
- define your ML Objective
business objective ML objective
- (Youtube)increase user engagement maximize the time a user spends on watching videos
- (instagram)imporve the platform safety accurately predict if a content is harmful
- (bookmyshow)increase ticket sales maximize the number of event registrations
- specifying the systems input and output
input algorithm output
post harmful content probability
detection system
input output
user model probability
events
- selecting the right ML category
ML Categories
Supervised Unsupervised Reinforcement
Regression clustering
Classification dimensionality reduction
- binary
- multiclass
3 - Data Preparation
data sources -----> data engineering -> feature engineering -----> prepared features
data preparation process
data engineering -> designing and building pipelines for collection, storing, retrieving and processing data.
data sources ->
- who collected the data
- how clean the data is
- can the source be trusted
- is the data user generated or system generated
data storage ->
- the high-level understanding of how diff databases work
SQL
Relational Database
- MySQL
- PostgreSQL
NoSQL
Key/value -> Redis, DynamoDB
Column-based -> Cassandra, HBase
Graph -> Neo4J
Document -> MongoDB, CouchDB
ETL -> Extract Transform and Load
the ETL process ->
Extract Transform Load
Data Sources to target destination
Databases Database
Logs files
Files data warehouse
Data Types in ML
- structured - numerical(Discrete and continuous), categorical(ordinal, nominal)
- predefined schema
- easy to search
- relational database, data warehouse
- unstructured - audio, video, image, text
- no schema
- difficult to search
- NoSQL databases, data lakes
4 - Model Development
- model selection
- simple baseline model -
- experiment with simple models
- move to complex models(deep learning models)
- ensemble of models - 3 ways to do ensembling -> bagging, boosting and stacking
keep these points in mind ->
- the amount of data for training
- training speed
- hyperparameters to choose and hyperparameter tuning techniques
- is there any possibility of continual learning
- compute requirements
model training
- constructing the dataset
- collect the raw data
- feature engineering
- sampling strategy
- address class imbalance
- choose a loss function
- training from scratch vs fine-tuning
Data Labels
- How should we obtain the labels?
- is the data annotated and how good the annotations are?
- Natural Labels ->
Model training
- what loss function should we use?
- what regularization should we use?
- backpropagation
- what activation function
- how to handle an imbalanced dataset?
- cause of overfitting or underfitting
5 - Evaluation
offline evaluation - the model is in dvelopment phase
- classification - precision, recall, f1-score, accuracy, ROC-AUC, confusion matrix
- regression - MSE, MAE, RMSE
- image generation - inception score, FID
- NLP - BLEU, METEOR, ROUGE
online evaluation
ad click prediction - click through rate, revenue lift
harmful content detection - valid posts
video recommendation - total watch time, number of completed videos
6 - Deployment