System Features
Auto ML
Model Principle Guide

Model Principle Guide

Preprocessing Techniques

1. Remove Columns with a High Percentage of Null Values

If a column mostly contains missing values, discard it. For instance, in a dataset of students, if 90% of the 'Address' column is missing, remove the 'Address' column.

2. Parse and Standardize Datetime Columns

Ensure that dates and times are in a consistent format. Convert datetime columns to timestamp numerical format in seconds.

3. Remove Text Columns with Too Many Unique Categories

If a text column has too many different values, making it hard to analyze, consider removal. For instance, in a dataset of products, if the 'Description' column has thousands of unique descriptions, it might be removed.

4. Handle Mixed Data Types by Inferring and Converting

Convert columns with a mix of numbers and text to a consistent type. For example, if a column contains both age numbers and 'unknown', convert all to numbers and replace 'unknown' with a special number like 1.

5. Handle Missing/Null Values by Imputation

If some data is missing, estimate or fill it in. For example, in a dataset of house prices, if the 'Number of Bedrooms' column is missing for some houses, fill in the missing values with a suitable method.

6. Detecting and Handling Outlier Data

Identify and handle outlier values that can skew analysis. For example, in a dataset of student exam scores, if one student has a score of 1000 while others range from 0 to 100, consider 1000 as an outlier and handle it accordingly.

7. Remove Constant Columns with the Same Values

If a column has the same value for every row, it doesn't provide useful information. For example, in a dataset of survey responses, if a column for 'Survey Version' has every entry as 'Version 1', consider removing that column.

8. Encode Categorical Columns for Modeling

Convert categorical data (like gender or type of product) into numbers for modeling. For example, if a 'Gender' column has 'Male' and 'Female', encode them as 0 and 1 respectively.

Model Principle

Categorical columns are considered of classification nature because they inherently define the classes or categories to be predicted in classification tasks. For example, if we have a list of fruits, the 'Type' column might have categories like 'Apple,' 'Banana,' and 'Orange.' In a classification task, we use these categories to predict or classify things.

If the target column consists of discrete numbers with a limited number of unique values (also known as classes or categories), it typically indicates a classification problem.

Linear regression is commonly used for predicting continuous values, such as predicting house prices or estimating sales revenue based on advertising spending.

The anomaly model is trained on normal data and calculates a reconstruction threshold to predict whether new data is anomalous or normal.

Types of Models and When to Use Them

1. Regression Models

  • When to Use: Use for predicting continuous values. For example, predicting house prices based on various features like square footage, number of bedrooms, etc.
  • Example Use Case: Estimating sales revenue based on advertising spending.

2. Classification Models

  • When to Use: Use when the target variable is categorical or discrete. This is ideal for problems where the goal is to assign input data into predefined categories.
  • Example Use Case: Classifying emails as spam or not spam based on certain features.

3. Anomaly Detection Models

  • When to Use: Use when you need to identify unusual or unexpected data points. These models are trained on "normal" data and can identify outliers or anomalies in new data.
  • Example Use Case: Fraud detection in credit card transactions.

4. Clustering Models

  • When to Use: Use when you want to group similar data points together without any predefined labels. Clustering can reveal patterns or groupings in data.
  • Example Use Case: Customer segmentation in marketing to understand different customer groups.

5. Recommendation Models

  • When to Use: Use for suggesting items to users based on past behavior or preferences. Commonly used in e-commerce and content platforms.
  • Example Use Case: Movie recommendations on streaming platforms like Netflix.

6. Time Series Models

  • When to Use: Use when you need to predict future values based on historical data that is ordered in time. Time series models are used for forecasting.
  • Example Use Case: Predicting stock prices or sales performance over time.

Model Selection

  • When choosing a model, it's important to first understand the nature of your target variable (continuous or categorical), as well as the structure of your data (e.g., time series, categorical data).

This guide provides an overview of the key preprocessing techniques and the principles behind model selection, helping you choose the right model for different machine learning tasks.