Data Science

Data Collection | Data Cleaning | Exploratory Data Analysis (EDA) | Model Building | Model Deployment

Extract Load Transform (ELT) | Dbt (Data building Tool) | Orchestration | Modules

Data Science Tutorial by Example


Data Science Steps

Data Scientist Roadmap 2022

 


Data Collection

Read a (local) CSV File | Download from API | Download CSV from API | Save to a DataFrame> | Merge Data | Concatenate Data | GroupBy Data | Unique Values | Replace NaN Values |

Read a (local) CSV File

Reading a large uncompressed comma separated value (CSV) file is the most inefficient way to access Big Data.   Whenever possible, the file should be either imported into a local database, or saved to a compressed binary file such as TDMS or HDF5 for subsequent reading.  


import pandas as pd
# load the Titanic dataset https://github.com/jorisvandenbossche/pandas-tutorial/blob/master/data/titanic.csv
# into a pandas dataframe (df) from: https://github.com/jorisvandenbossche/pandas-tutorial/blob/master/data/titanic.csv
#df = pd.read_csv("https://raw.githubusercontent.com/jorisvandenbossche/pandas-tutorial/master/data/titanic.csv")
# OR
df = pd.read_csv('data/titanic.csv')    # Download the CSV file to a subfolder 'data' under the folder where this script resides.
If you have a CSV file that has the delimiter included within string/text columns in the data, then this will limit your ability to use fast methods to import the CSV data into a PostgreSQL database.   The Pandas function df = pd.read_csv() is a more flexible option for these cases.  

Reading a local CSV file with a poor structure:


#
#   Written by:  Mark W Kiehl
#   http://mechatronicsolutionsllc.com/
#   http://www.savvysolutions.info/savvycodesolutions/
#
#   Create a PostgreSQL daabase table from a CSV file by deriving it's structure using Pandas,
#   and then write the CSV contents to the database table (without using Pandas).
#   Thereafter the CSV contents are available in the database table for extraction and analysis. 
#
#   This script assumes the Titanic CSV file is already downloaded locally.
#       Titanic dataset https://github.com/jorisvandenbossche/pandas-tutorial/blob/master/data/titanic.csv
#   THE TITANIC CSV FILE IS NOT FORMATTED WELL.  It contains the delimiter ',' within
#   the text/string columns, limiting the ability to use fast options such as
#   cur.copy_from() to read the CSV file quickly. This script demonstrates a solution
#   to that CSV read problem. 


#Python performance timer
from pickle import NONE
import time
t_start_sec = time.perf_counter()


# pip install memory-profiler
# Use the package memory-profiler to measure memory consumption
# The peak memory is the difference between the starting value of the “Mem usage” column, and the highest value (also known as the “high watermark”).
# IMPORTANT:  See how / where @profile is inserted before a function later in the script.
from memory_profiler import profile


# import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import shapiro


# pip install memory-profiler
# Use the package memory-profiler to measure memory consumption
# The peak memory is the difference between the starting value of the “Mem usage” column, and the highest value (also known as the “high watermark”).
# IMPORTANT:  See how / where @profile is inserted before a function later in the script.
from memory_profiler import profile


# import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import shapiro

import csv

# pip install psycopg2
import psycopg2
import psycopg2.extras  # for execute_batch


conn = psycopg2.connect(
    host="localhost",
    database="pytestdb",
    user="postgres",
    password="your-db-password",
)
conn.autocommit = True  #set autocommit=True so every command we execute will take effect immediately


def pandas_types_to_postgresql_cols(dtypes):
    # Translate Pandas data type in dtypes and return a string for a SQL create table statement, e.g. CREATE UNLOGGED TABLE <table name> (col1 INTEGER, col2 TEXT, ..)
    # NOTE: The string returned may need modification!
    #
    # The following types (or spellings thereof) are specified by SQL: bigint, bit, bit varying, boolean, char, character varying, character, varchar, date, double precision, integer, interval, numeric, decimal, real, smallint, time (with or without time zone), timestamp (with or without time zone), xml.
    # Pandas data types:    string_, unicode_, mixed types
    #                       int_, int8, int16, int32, int64, uint8, uint16, uint32, uint64
    #                       float_, float16, float32, float64
    #                       bool_
    #                       datetime64[ns]
    s = '('
    for key in dtypes:
        #print(key,': ',dtypes[key])
        if dtypes[key] == 'int' or dtypes[key] == 'int8' or dtypes[key] == 'int16' or dtypes[key] == 'int32' or dtypes[key] == 'int64' or dtypes[key] == 'uint8' or dtypes[key] == 'uint16' or dtypes[key] == 'uint32' or dtypes[key] == 'uint64':
            s = s + key + ' ' + 'INTEGER,'
        elif dtypes[key] == 'float' or dtypes[key] == 'float16' or dtypes[key] == 'float32' or dtypes[key] == 'float64':
            s = s + key + ' ' + 'DECIMAL,'
        elif dtypes[key] == 'bool':
            s = s + key + ' ' + 'BOOLEAN,'
        elif dtypes[key] == 'datetime64[ns]':
            s = s + key + ' ' + 'TIMESTAMPTZ,'    #TIMESTAMP, TIMESTAMPTZ, DATE, ..
        elif dtypes[key] == 'string' or dtypes[key] == 'unicode':
            s = s + key + ' ' + 'TEXT,'
        else:
            print('WARNING: unexpected dtypes of "{}" in pandas_types_to_postgresql_cols() for {}'.format(dtypes[key], key))
            s = s + key + ':' + 'ERROR'
    s = s[0:-1] + ')'
    return s


def create_db_table(cur, table_name, s):
    """ Connect to the local PostgreSQL database server """
    try:
        # connect to the PostgreSQL server
        print('Connecting to the PostgreSQL database...')
        
        # create a cursor
        cur = conn.cursor()
        
        # execute a SQL statement to get the PostgreSQL version
        cur.execute('SELECT version()')
        # display the PostgreSQL database server version
        db_version = cur.fetchone()
        print(db_version)
        
        # Build the SQL statement to execute
        sql = 'DROP TABLE IF EXISTS ' + table_name + ';' 
        sql = sql + 'CREATE UNLOGGED TABLE ' + table_name + ' ' + s + ';'
        sql = "" + sql + ""
        print(sql)
        
        # Create a new table named table_name
        cur.execute(sql)
        
        # execute a SQL statement to get the number of records in table table_name
        # (throws an error if the table doesn't exist
        cur.execute('select count(*) from ' + table_name + ';')        
        
    except (Exception, psycopg2.DatabaseError) as error:
        print(error)
    finally:
            print('Created db table ' + table_name)


def write_csv_to_db_table(cur, table_name, csv_filepath):
    """ Connect to the local PostgreSQL database server """
    try:
        # connect to the PostgreSQL server
        print('Connecting to the PostgreSQL database...')
        
        # create a cursor
        cur = conn.cursor()
        
        # The CSV file is not formatted well.  It contains the delimiter ',' within
        # the text/string columns, limiting the ability to use fast options such as
        # cur.copy_from() to read the CSV file quickly. 

        # Read some of the CSV data from the file to get the column information
        df = pd.read_csv(csv_filepath, nrows=5)
        
        # Create a comma separated list of the DataFrame column names
        cols = ','.join(list(df.columns))
        #print(cols)     #PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
        
        #Build a string with the '(%s, ... %s)' for VALUES
        v = '('
        for x in range(len(df.columns)):
            v = v + '%s,'
        v = v[0:-1]
        v = v + ')'
        #print(v)    #(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)
        
        # The arguments to csv.reader() cause any delimiter between the quotechar to be ignored. 
        # This is not the most efficient / fast way to read a CSV file, but it is necessary given
        # the presence of the delimiter character in the strings/text. 
        with open(csv_filepath, 'r') as f:
            csv_reader = csv.reader(f, quotechar='"', delimiter=',', quoting=csv.QUOTE_ALL)
            next(csv_reader) # Skip the header row.
            for row in csv_reader:
                # Note: row is a list
                # Below is the recommended way to call INSERT using psycopg2:
                cur.execute("INSERT INTO " + table_name + " VALUES " + v, row)
        
    except (Exception, psycopg2.DatabaseError) as error:
        print('ERROR: {}'.format(error))
        cur.close()        
    finally:
            print('done .. write_csv_to_db_table()')


def alter_db_table(cur, table_name, sql):
    """ Connect to the local PostgreSQL database server """
    try:
        # connect to the PostgreSQL server
        print('Connecting to the PostgreSQL database...')
        
        # create a cursor
        cur = conn.cursor()
        
        # execute a SQL statement to get the PostgreSQL version
        cur.execute('SELECT version()')
        # display the PostgreSQL database server version
        db_version = cur.fetchone()
        print(db_version)
        
        # Alter the table named table_name with the statements as specified within string SQL
        cur.execute(sql)
        
        # execute a SQL statement to get the number of records in table table_name
        # (throws an error if the table doesn't exist
        cur.execute('select count(*) from ' + table_name + ';')
        print('Altered {} rows in db table {}'.format(cur.fetchone(),table_name))
        
    except (Exception, psycopg2.DatabaseError) as error:
        print(error)
    finally:
            print('done .. alter_db_table()')


def get_db_table_struct(cur, table_name):
    """ Connect to the local PostgreSQL database server """
    try:
        # create a cursor
        cur = conn.cursor()
        
        # Query the information_schema.columns catalog to get the structure of table_name
        sql = 'SELECT table_name, column_name, data_type FROM information_schema.columns WHERE table_name = \'' + table_name + '\';'
        sql = "" + sql + ""
        cur.execute(sql)
        
        tbl_struct = cur.fetchall()
        return tbl_struct
        
    except (Exception, psycopg2.DatabaseError) as error:
        print(error)



@profile    # instantiating the decorator for the function/code to be monitored for memory usage by memory-profiler
def main():
    # Get the CVS structure information so it can be used to create a database table. 
    """
        # Pandas data types:    string_, unicode_, mixed types
        #                       int_, int8, int16, int32, int64, uint8, uint16, uint32, uint64
        #                       float_, float16, float32, float64
        #                       bool_
        #                       datetime64[ns]
    """
    
    """
    Contents of file 'titanic.csv':
        PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
        1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
        2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
        3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
    """
    
    # Load the CSV file 'titanic.csv' into a PostgreSQL database table.  
    # The best way to get the CSV file structure is to read the first n rows of the CSV file with
    # Pandas, and then use what Pandas acquired from the file to create the database table structure
    csv_filepath = 'data/titanic.csv'
    
    # IMPORTANT: If the data contains dates, they may import as data type 'Object' rather than 'datetime64[ns]'.
    # To fix this, add "parse_dates=['column_name']" option and update 'column_name' with the name of the column with dates.
    # This resolves import of column 'datetime', but not the unix timestamps in column 'unix_s'. 
    df = pd.read_csv(csv_filepath, nrows=50)
    #print(df.info())
    """
     #   Column       Non-Null Count  Dtype
    ---  ------       --------------  -----
     0   PassengerId  50 non-null     int64
     1   Survived     50 non-null     int64
     2   Pclass       50 non-null     int64
     3   Name         50 non-null     object
     4   Sex          50 non-null     object
     5   Age          36 non-null     float64
     6   SibSp        50 non-null     int64
     7   Parch        50 non-null     int64
     8   Ticket       50 non-null     object
     9   Fare         50 non-null     float64
     10  Cabin        9 non-null      object
     11  Embarked     50 non-null     object
    """
    
    # How to manually convert a DataFrame column to data type 'datetime64[ns]': df['column_name'] = pd.to_datetime(df['column_name'], unit='s') 
    
    # Manually convert column 'Name' to data type 'string'
    df['Name'] = df['Name'].astype({'Name':'string'})
    df['Sex'] = df['Sex'].astype({'Sex':'string'})
    df['Ticket'] = df['Ticket'].astype({'Ticket':'string'})
    df['Cabin'] = df['Cabin'].astype({'Cabin':'string'})
    df['Embarked'] = df['Embarked'].astype({'Embarked':'string'})
    df['Age'] = df['Age'].astype({'Age':'string'})      # actually a float64, but NaN values exist
    #print(df.info())
    #print('\n')
    """
     #   Column    Non-Null Count  Dtype
    ---  ------    --------------  -----
     0   PassengerId  50 non-null     int64
     1   Survived     50 non-null     int64
     2   Pclass       50 non-null     int64
     3   Name         50 non-null     string
     4   Sex          50 non-null     string
     5   Age          36 non-null     float64
     6   SibSp        50 non-null     int64
     7   Parch        50 non-null     int64
     8   Ticket       50 non-null     string
     9   Fare         50 non-null     float64
     10  Cabin        9 non-null      string
     11  Embarked     50 non-null     string
    """
    
    # Get the data types from the DataFrame and convert them from a Series to a dictionary
    dtypes = df.dtypes.to_dict()
    # Get a string with SQL statements for the column names and data types derived from dtypes 
    s = pandas_types_to_postgresql_cols(dtypes)
    print(s)
    #(PassengerId INTEGER,Survived INTEGER,Pclass INTEGER,Name TEXT,Sex TEXT,Age DECIMAL,SibSp INTEGER,Parch INTEGER,Ticket TEXT,Fare DECIMAL,Cabin TEXT,Embarked TEXT)
    # Edit the string s if necessary before it is used to create the database table.
    #s = '(PassengerId INTEGER,Survived INTEGER,Pclass INTEGER,Name TEXT,Sex TEXT,Age DECIMAL,SibSp INTEGER,Parch INTEGER,Ticket TEXT,Fare DECIMAL,Cabin TEXT,Embarked TEXT)'
    print('\n')
    
    table_name = 'titanic'
    with conn.cursor() as cur:
        create_db_table(cur, table_name, s)
        print('\n')
        
        # Read the CSV into the db table.
        # NOTE: you will need to modify write_csv_to_db_table() for your specific CSV.
        write_csv_to_db_table(cur, table_name, csv_filepath)
        
        # Fix the table column 'Age' that is currently data type 'text' due to NaN
        s = 'ALTER TABLE ' + table_name + ' ADD COLUMN unix_s_ts TIMESTAMPTZ;'
        #alter_db_table(cur, table_name, sql)
        
        
        # Print out the structure of table table_name
        print('\nStructure of db table ' + table_name + ':')
        tbl_struct = get_db_table_struct(cur, table_name);
        for line in tbl_struct:
            print(line)
        
        """
        ('test_a', 'datetime', 'timestamp with time zone')
        ('test_a', 'int', 'integer')
        ('test_a', 'float', 'numeric')
        ('test_a', 'bool1', 'boolean')
        ('test_a', 'unix_s', 'timestamp with time zone')
        ('test_a', 'text', 'text')
        """
        
        cur.close() 
        conn.close()
        
    
    # Report the script execution time
    t_stop_sec = time.perf_counter()
    print('\nElapsed time {:6f} sec'.format(t_stop_sec-t_start_sec))


if __name__ == '__main__':
    main()		#only executes when NOT imported from another script

 

Download from an API


 

Download CSV from an API

Pandas makes it very easy to download a CSV file directly from an online source.


#Python performance timer
from pickle import NONE
import time
t_start_sec = time.perf_counter()


# pip install memory-profiler
# Use the package memory-profiler to measure memory consumption
# The peak memory is the difference between the starting value of the “Mem usage” column, and the highest value (also known as the “high watermark”).
# IMPORTANT:  See how / where @profile is inserted before a function later in the script.
from memory_profiler import profile


# import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import shapiro
import json


def data_collection():
    # load the Titanic dataset https://github.com/jorisvandenbossche/pandas-tutorial/blob/master/data/titanic.csv
    # into a pandas dataframe (df) from: https://github.com/jorisvandenbossche/pandas-tutorial/blob/master/data/titanic.csv
    df = pd.read_csv("https://raw.githubusercontent.com/jorisvandenbossche/pandas-tutorial/master/data/titanic.csv")
    # OR
    #df = pd.read_csv('data/titanic.csv')    # Download the CSV file to a subfolder 'data' under the folder where this script resides.
    return df


def show_df_contents(df):
    # print general information about the DataFrame contents
    print(df.info())
    print('\n')
    """
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 891 entries, 0 to 890
    Data columns (total 12 columns):
     #   Column       Non-Null Count  Dtype
    ---  ------       --------------  -----
     0   PassengerId  891 non-null    int64
     1   Survived     891 non-null    int64
     2   Pclass       891 non-null    int64
     3   Name         891 non-null    object
     4   Sex          891 non-null    object
     5   Age          714 non-null    float64
     6   SibSp        891 non-null    int64
     7   Parch        891 non-null    int64
     8   Ticket       891 non-null    object
     9   Fare         891 non-null    float64
     10  Cabin        204 non-null    object
     11  Embarked     889 non-null    object
     """


#@profile    # instantiating the decorator for the function/code to be monitored for memory usage by memory-profiler
def main():
    # Load the Titanic dataset
    df = data_collection()
    
    # Show some basic information about the CSV file contents in the DataFrame df
    show_df_contents(df)
        
    # Report the script execution time
    t_stop_sec = time.perf_counter()
    print('\nElapsed time {:6f} sec'.format(t_stop_sec-t_start_sec))


if __name__ == '__main__':
    main()		#only executes when NOT imported from another script

 

Save to a DataFrame


import pandas as pd

# list to DataFrame
li = [[1,2,3],[4,5,6]]
df = pd.DataFrame(li, columns=['A','B','C'])

   A  B  C
0  1  2  3
1  4  5  6


# dictionary of lists to DataFrame
dic = {'A':[1,2,3],'B':[4,5,6]}
df = pd.DataFrame(dic)

   A  B
0  1  4
1  2  5
2  3  6

Merge Data


import pandas as pd
# merge two DataFrames (same as SQL JOIN)
df1 = pd.DataFrame([[1,'A'],[2,'B']], columns=['col1','col2'])
df2 = pd.DataFrame([['A',3],['B',4]], columns=['col2','col3'])
df = pd.merge(df1, df2, on = 'col2', how = 'inner')

   col1 col2  col3
0     1    A     3
1     2    B     4

Concatenate Data


import pandas as pd
# concatenate two DataFrames.  axis = 1 stacks columns together. | axis = 0 stacks rows together, provided column header match.
df1 = pd.DataFrame([[1,'A'],[2,'B']], columns=['col1','col2'])
df2 = pd.DataFrame([['A',3],['B',4]], columns=['col2','col3'])
df = pd.concat((df1, df2), axis = 1)

   col1 col2 col2  col3
0     1    A    A     3
1     2    B    B     4

Groupby Sum


import pandas as pd
df = pd.DataFrame([[1,'A'],[2,'B'],[3,'A'],[4,'C']], columns = ['col1','col2'])

print(df.head())
   col1 col2
0     1    A
1     2    B
2     3    A
3     4    C

print(df.groupby('col2').col1.sum())
col2
A    4
B    2
C    4

Unique Values


import pandas as pd
df = pd.DataFrame([[1,'A'],[2,'B'],[3,'A'],[4,'C']], columns = ['col1','col2'])

print(df.head())
   col1 col2
0     1    A
1     2    B
2     3    A
3     4    C

# Get a list of unique values in DataFrame column 'col2'
print(df['col2'].unique())
['A' 'B' 'C']

# Get the count of unique values in DataFrame column 'col2'
print(df['col2'].value_counts())
A	2
B	1
C	1

Replace NaN Values

Pandas is built on top of NumPy.   If a NumPy array contains NaNs, NumPy's aggregate functions (np.mean, np.min, np.max, etc.) return NaN.   NumPy also provides nan-insensitive methods, such as np.nansum, np.nanmin, etc.  


import pandas as pd
df = pd.DataFrame([[1,float('NaN')],[2,'B'],[3,np.nan],[4,'C']], columns = ['col1','col2'])

print(df.head())
   col1 col2
0     1  NaN
1     2    B
2     3  NaN
3     4    C

df.col2.fillna('B', inplace = True)
   col1 col2
0     1    B
1     2    B
2     3    B
3     4    C


Related Links

20% of Pandas Functions that Data Scientists Use 80% of the Time

 

Data Cleaning / Preprocessing

Data Validation should be configured prior to building a data model.   How to Do Data Validation on Your Data on Pandas with pytest

Missing Values | Encoding Categorical Fields | Standardizing Numeric Columns | Binning Numeric Columns | Min/Max Scaling | Robust Scale | Power Transformations | Quantile Transformations | Box-Cox Transformations |

Missing Values

There are various ways to handle missing values, such as filling them with the mean, median, or mode of the column, or dropping rows with missing values.  


df = pd.DataFrame([[1,'A',1.23,10],[2,'B',2.34,9],[3,np.nan,float('NaN'),None],[4,'D',4.45,7]], columns = ['idx','cat','float','int'])
"""
print(df.head(), '\n')
   idx  cat  float   int
0    1    A   1.23  10.0
1    2    B   2.34   9.0
2    3  NaN    NaN   NaN
3    4    D   4.45   7.0
"""

# Get the count of missing values by column
"""
print('DataFrames with missing values:\n',df.isnull().sum(),'\n')
DataFrames with missing values:
idx      0
cat      1
float    1
int      1
"""

# Calculate the count and % of missing values
"""
total = df.isnull().sum().sort_values(ascending=False)
percent_1 = df.isnull().sum()/df.isnull().count()*100
percent_2 = (round(percent_1, 1)).sort_values(ascending=False)
missing_data = pd.concat([total, percent_2], axis=1, keys=['Count', '%'])
print('DataFrames with missing values:\n',missing_data.head(5))
DataFrames with missing values:
		Count     %
cat        1  25.0
float      1  25.0
int        1  25.0
idx        0   0.0
"""

# Replace the missing int column values with random values generated based
# on the mean int and the standard deviation of the int columnn.
# (The column must be of data type float and normally distributed)
"""
mean = df['int'].mean()
#print('int column mean  = ', mean)        # int column mean  =  8.666666666666666
std = df['int'].std()
#print('int std = ', std)                  # int std =  1.5275252316519468
is_null = df['int'].isnull().sum()
#print(is_null, 'int values are null')     # 1 int values are null
print('Replacing column int nan values with ', is_null, ' random values between ', mean - std, ' and ', mean + std, ' based on the mean and std of column int')
# random.randint(low, high=None, size=None, dtype=int)
rand_int = np.random.randint(mean - std, mean + std, size = is_null)    # rand_int is a numpy array with 7 values
#print('rand_int = ', pd.DataFrame(rand_int).to_string(header=False, index=False))   # 9
# Replace NaN values in int column with the random values generated
int_copy = df['int'].copy()
int_copy[np.isnan(int_copy)] = rand_int     # np.isnan() tests element-wise whether it is NaN or not and returns the result as a boolean array
df['int'] = int_copy
#print(df['int'].isnull().sum(), 'int values are null')  # 0 int values are null
print(df.head(), '\n')
idx  cat  float   int
0    1    A   1.23  10.0
1    2    B   2.34   9.0
2    3  NaN    NaN   9.0
3    4    D   4.45   7.0
"""


# Replace NaN value for column 'int' with value of 8
#df['int'].fillna(8, inplace=True)
"""
print(df.head())
print('\n')
	   idx  cat  float   int
	0    1    A   1.23  10.0
	1    2    B   2.34   9.0
	2    3  NaN    NaN   8.0
	3    4    D   4.45   7.0
"""

# Replace NaN value for column 'cat' with value of A
# (WARNING: Do not use: df['cat'] = df['cat'].fillna('A', inplace=True))
#df['cat'].fillna('A', inplace=True)

# Replace the NaN value for column 'float' with an interpolated value
# DataFrame.interpolate(method=’linear’, axis=0, limit=None, inplace=False, limit_direction=’forward’, limit_area=None, downcast=None, **kwargs)
#df['float'] = df['float'].interpolate(method ='linear', limit_direction ='forward')
"""
	   idx cat  float  int
	0    1   A  1.230   10
	1    2   B  2.340    9
	2    3   A  3.395    8
	3    4   D  4.450    7
"""

# Replace NaN values for cloumn 'float' with the median
#df['float'].fillna(df['float'].median(), inplace=True)

# Replace NaN values for column 'float' with the mode
"""
#print(df['float'].isnull().sum(), ' missing values in column float')    # 1  missing values in column float
mode = df['float'].mode()
mode = mode.iat[0]      # Convert to a string
# OR: mode = df['float'].mode().iat[0]
#print('float column mode = ', mode)        # S
print('Replacing ', df['float'].isnull().sum(), ' missing values in column float with the mode of ', mode)
# Replacing  1  missing values in column float with the mode of  1.23
df['float'].fillna(mode, inplace=True)
print(df['float'].isnull().sum(), ' missing values in column float')    # 0  missing values in column float
"""

# Create a new column from the values in column 'float' using the Pandas .apply() function with a custom function df_prod().
"""
def df_prod(val):
    return val * 3.0

df['float3'] = df['float'].apply(df_prod)
print(df.head())
print('\n')
# Delete the column 'float3'
df.drop(columns=['float3'], inplace=True)
"""

 

Encoding Categorical Fields

Fields (variables) that contain a limited number of categories.   One-hot encoding converts the categories into a numerical value that can be used by a machine learning (ML) model.   Pandas example.  

 

Standardizing Numeric Columns

Scale the values of a numeric column so that they have zero mean and unit variance.   A ML model will be sensitive to wide variations in scale among the numeric columns.   Frequently they will be normalized to a floating point value between 0 and 1.   Pandas example.  

 

Binning Numeric Columns

Allocate the values of a continuous numeric column into bins, converting them into a categorical column.  


 

Min/Max Scaling

Scale the values of a numeric column so that they have a minimum value of 0 and a maximum value of 1.   Pandas example.  

 

Robust Scaling

When the data contains outliers, scale the values of a numeric column using the median and interquartile range to minimize the outlier influence.  


 

Power Transformations

Used for correcting the skewness of a distribution, as skewed distributions can cause problems when fitting certain types of models.  


 

Quantile Transformations

Transform the values of a numeric column so that they have a uniform or normal distribution.   Useful for improving the assumptions of certain ML models that assume the predictor variables are normally distributed.  


 

Box-Cox Transformations

Transform the values of a numeric column so that they are approximately normally distributed.   Useful for improving the assumptions of certain ML models that assume the predictor variables are normally distributed.  


 

Related Links

data cleaning using SQL

 

 

Exploratory Data Analysis (EDA)

Distribution | Correlation | Statistics | Scaling | Categorical Values | | | |

Summarize the main characteristics of a data set, identifying patterns in the data, and the existance of outliers and features of the data that are unexpected.  

DataFrame For The Examples That Follow


# Create a simple Pandas DataFrame
df = pd.DataFrame([[1,'A',1.23,10],[2,'B',2.34,9],[3,np.nan,float('NaN'),None],[4,'D',4.45,7]], columns = ['idx','cat','float','int'])

 

Distribution    


# Visualize the distribution of the numeric column 'float' to make sure the values are normally distributed
sns.histplot(df['float'])       # or plt.hist(df['float'])
plt.show()

 


# Check visually if numeric values are normally distributed using a Q-Q plot.
# (quantiles of two variables are plotted against each other)
# Tutorial: https://towardsdatascience.com/significance-of-q-q-plots-6f0c6e31c626
stats.probplot(df['float'], plot=plt)
plt.show()

 


# Visualize the distribution of a categorical column
df['cat'].value_counts().plot(kind='bar')
plt.show()

 


# Create a distplot to visualize the distribution of a numeric column (better than a KDE plot)
#sns.distplot(df['float'])   # depreciated.  Replaced by `displot` (a figure-level function with similar flexibility)
sns.distplot(df['float'])
plt.show()

 


# Create a KDE  plot to visualize the kernel density estimate of a numeric column
sns.kdeplot(df['float'])
plt.show()

 


    # create a box plot to visualize the distribution of a numeric column
plt.boxplot(df['float'])
plt.ylabel('float')
plt.show()

 


# Create a box plot to visualize the distribution of a numeric column by the categories of a categorical column.
sns.boxplot(x='cat', y='int', data=df)
plt.ylabel('int')
plt.show()

 


# Create a boxenplot to visualize the distribution of a numeric column by the categories of a categorical column
sns.boxenplot(x='cat', y='float', data=df)
plt.ylabel('float')
plt.show()

 


# Create a violin plot to visualize the distribution of a numeric column by the categories of a categorical column
sns.violinplot(x='cat', y='int', data=df)
plt.ylabel('int')
plt.show()

 


# Create a swarm plot to visualize the distribution of a numeric column by the categories of a categorical column
sns.swarmplot(x='cat', y='float', data=df)
plt.ylabel('float')
plt.show()

 


# Create a faceting grid to visualize the distribution of multiple numeric columns by the categories of a categorical column
g = sns.FacetGrid(df, col='cat')
g.map(plt.hist, 'int')
plt.show()

 


# Create a rugplot to visualize the distribution of a numeric column and the individual observations
# (rugplot adds ticks on axis to show the location of individual observations)
sns.set_theme()
sns.kdeplot(data=df, x='float')
sns.rugplot(data=df, x='float')
plt.show()

 

Correlation    


# Create a heatmap to visualize the correlation between multiple numeric columns
sns.heatmap(df[['float','int']].corr(), cmap='RdYlGn', annot=True)
plt.show()

 


# Create a scatter plot to visualize the relationship between two numeric columns
plt.scatter(df['float'], df['int'])
plt.xlabel('float')
plt.ylabel('int')
plt.show()

 


# create a pairplot to visualize the relationships between multiple numeric columns
sns.pairplot(df, vars=['cat', 'float', 'int'])
plt.show()

 


#Create a jointplot to visualize the relationship between two numeric columns and their distributions
sns.jointplot(x='cat', y='int', data=df)
plt.show()

 


# Create a lag plot to check for autocorrelation in a numeric column.
# (autocorrelation is the degree of correlation of a variables values over time)
from pandas.plotting import lag_plot
lag_plot(df['float'])
plt.title('Lag Plot of Autocorrelation of float')
plt.show()

 


# Create an autocorrelation plot to visualize the autocorrelation in a numeric column
from pandas.plotting import autocorrelation_plot
autocorrelation_plot(df['float'])
plt.title('Autocorrelation of float')
plt.show()

 

Statistics    


# Create a bar plot to visualize the count of a categorical column
df['cat'].value_counts().plot(kind='bar')
plt.ylabel('Count')
plt.show()

 


# Create a bar plot to visualize the mean of a numeric column for each category of a categorical column
df.groupby('cat')['float'].mean().plot(kind='bar')
plt.ylabel('float average by cat')
plt.title('Mean of float by cat')
plt.show()

 


    # Create a pivot table to summarize the data in terms of the mean
    pivot_table = df.pivot_table(index='cat', columns='float', values='int', aggfunc='mean')
    print(pivot_table)
    """
        float  1.230  2.340  3.395  4.450
        cat
        A       10.0    NaN    8.0    NaN
        B        NaN    9.0    NaN    NaN
        D        NaN    NaN    NaN    7.0
    """

    # Create a heatmap to visualize the contents of a pivot table
    plt.pcolor(pivot_table, cmap='Reds')
    plt.colorbar()
    plt.title('Pivot Table')
    plt.show()

 


# Create a point plot to visualize the mean of a numeric column by the categories of a categorical column
sns.pointplot(x='cat', y='float', data=df)
plt.ylabel('Average float')
plt.show()

 


# Create a pointplot to visualize the mean and confidence interval of a numeric column by the categories of a categorical column
sns.pointplot(x='cat', y='float', data=df, errorbar=('ci', 95))
plt.ylabel('Average float')
plt.title('PointPlot to Visualize Mean & Confidence Interval')
plt.show()

 


# Create a scatter plot matrix to visualize the relationships between multiple numeric columns
from pandas.plotting import scatter_matrix
scatter_matrix(df[['int', 'float']], alpha=0.2, figsize=(6, 6))
plt.show()

 


# Create a regression plot to visualize the relationship between two numeric columns
sns.regplot(x='int', y='float', data=df)
plt.show()

 


# Create a barplot to visualize the mean of a numeric column by the categories of a categorical column
sns.barplot(x='cat', y='int', data=df)
plt.ylabel('Average of int')
plt.show()

 


# Create a lmplot to visualize the relationship between two numeric columns and the categories of a categorical column
sns.lmplot(x='int', y='float', hue='cat', data=df)
plt.show()

 

Scaling    


# Scaling can improve the convergence speed of various ML algorithms, especially with data sets that have a large variation. 
# The Normalization method scales values between 0 and 1.
# The Standardization method scales values between -1 and 1, with a mean of 0.  
# If the distrubution is normal, then it should be standardized, otherwise normalized.
# https://medium.com/mlearning-ai/feature-scaling-normalization-or-standardization-74d73ec90366

# Min-Max Normalization:
# Formula: df[‘column’] = df[‘column’] – df[‘column’].min()
# Formula: df[‘column’] = df[‘column’] / df[‘column’].max()
df['float'] = df['float'] - df['float'].min()
df['float'] = df['float'] / df['float'].max()
print(df['float'].head())
"""
0    0.00000
1    0.34472
2    0.67236
3    1.00000
"""

# Standardization:
# Formula: df[‘column’] =( df[‘column’] – df[‘column’].mean() ) / df[‘column’].std()
df['float'] = (df['float'] - df['float'].mean() ) / df['float'].std()
print(df['float'].head())
"""
0   -1.173730
1   -0.371365
2    0.391243
3    1.153852
"""

# Apply a box-cox transformation to a numeric column
from scipy import stats
# transform data & save lambda value
fitted_data, fitted_lambda = stats.boxcox(df['float'])
print(fitted_data)
print(f"Lambda value used for Transformation: {fitted_lambda}")
"""
[0.22291806 1.16527003 1.94135308 2.64853539]
Lambda value used for Transformation: 0.7064823317828955
"""

This code automatically scales the data using Standardization or Normalization after examining the data to determine if it is normally distributed.  


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


def scale_df(df):
    """
        If the Series df is Gaussian (normally distributed), then it is 
        standardized (values between -1 and 1, with a mean of 0).

        If the Series df is NOT Gaussian (normally distributed) and tests
        positive as log-normal distribution, then it will be Normalized.

        If the Series df is NOT Gaussian/normally distributed and is not
        a log-normal distribution, then it will be Normalized
        (also known as min-max scaling).
    """
    # Scaling can improve the convergence speed of various ML algorithms, especially with data sets that have a large variation. 
    # The Normalization method scales values between 0 and 1.
    # The Standardization method scales values between -1 and 1, with a mean of 0.  
    # If the distrubution is normal, then it should be standardized, otherwise normalized.
    # Normalization is a better option when we are not necessarily dealing with a normally distirbuted 
    # data and you are not worried about standardization along the variance axes (i.e. image processing 
    # or neural network expecting values between 0–1).
    # https://medium.com/mlearning-ai/feature-scaling-normalization-or-standardization-74d73ec90366


    from scipy.stats import shapiro
    # Check numerically if a numeric column is normally distributed using the Shapiro-Wilk test
    from scipy.stats import normaltest
    stat, p1 = shapiro(df)

    # Check numerically if a numeric column is normally distributed using the K-squiared test
    from scipy.stats import shapiro
    stat, p2 = shapiro(df)

    # Check numerically if a numeric column is normally distributed using the K-squiared test
    # (use for a log-normal distribution)
    from scipy.stats import kstest
    #perform Kolmogorov-Smirnov test for normality
    stat, pvalue = kstest(df, 'norm')
    
    if p1 > 0.05 and p2 > 0.05 and pvalue <= 0.05:
        print('Series "', df.name,'" is probably Gaussian (normally distributed) and will be scaled using Standardization.\n')
        df = (df - df.mean() ) / df.std()
    elif pvalue > 0.05:
        print('Series "', df.name, '" is a Log-Normal distribution and will be scaled using Standardization')
        df = (df - df.mean() ) / df.std()
    else:
        print('Series "', df.name,'" is NOT Gaussian (normally distributed) or Log-Normal distribution, so it will be scaled using Normalization (min-max scaling).\n')
        print('stat=%.3f, p1=%0.3f' % (stat,p1)) 
        print('stat=%.3f, p2=%0.3f' % (stat,p2)) 
        print('stat=%.3f, pvalue=%0.3f' % (stat,pvalue)) 
        # Min-Max Normalization:
        # Formula: df[‘column’] = df[‘column’] – df[‘column’].min()
        # Formula: df[‘column’] = df[‘column’] / df[‘column’].max()
        df = df - df.min()
        df = df / df.max()
    return df


df = pd.DataFrame([[1,'A',1.23,10],[2,'B',2.34,9],[3,'C',3.45,8],[4,'D',4.56,7]], columns = ['idx','cat','float','int'])

print('\n', df.head(), '\n')
"""
    idx cat  float  int
0    1   A   1.23   10
1    2   B   2.34    9
2    3   C   3.45    8
3    4   D   4.56    7
"""

df['float'] = scale_df(df['float'])

print('\n', df.head(), '\n')

Categorical Values    


# Encoding categorical variables:
#   https://towardsdatascience.com/categorical-encoding-using-label-encoding-and-one-hot-encoder-911ef77fb5bd

# Encode categorical variables using one-hot encoding
"""    
print(df.head())
df = pd.get_dummies(df, columns=['cat'], prefix=['df'])
print(df.head())
	idx  float  int  df_A  df_B  df_D
	0    1  1.230   10     1     0     0
	1    2  2.340    9     0     1     0
	2    3  3.395    8     1     0     0
	3    4  4.450    7     0     0     1
"""    

# Encode categorical variables using ordinal encoding
"""
print(df.head())
df['cat_numerical'] = pd.factorize(df['cat'])[0] + 1
#df['cat'] = pd.factorize(df['cat'])[0] + 1
print(df.head())
	idx cat  float  int  cat_numerical
	0    1   A  1.230   10              1
	1    2   B  2.340    9              2
	2    3   A  3.395    8              1
	3    4   D  4.450    7              3
"""

# Encode categorical variables using label encoding 
# (retains the natural order, but it has the disadvantage that the numeric values can be misinterpreted by algorithms as having some sort of hierarchy/order in them)
"""
# Convert 'cat' from data type string to category
df['cat'] = df['cat'].astype('category')
df['cat_enc'] = df['cat'].cat.codes
print(df.head())
"""
"""
	   idx cat  float  int  cat_enc
	0    1   A  1.230   10        0
	1    2   B  2.340    9        1
	2    3   A  3.395    8        0
	3    4   D  4.450    7        2
"""

Related Links

Intro to Descriptive Statistics

How to Find the Best Theoretical Distribution for Your Data

 

Model Building

| | |

Related Links

 

 

Model Deployment

| | |

Related Links

 

 


ELT (Extract, Load, Transform)

| | |

Extract, load, transform is an alternative to extract, transform, load (ETL) used with data lake implementations.   In contrast to ETL, in ELT models the data is not transformed on entry to the data lake, but stored in its original raw format (enables faster loading).   After loading, the data is transformed into a usable and trusted resource.   Thereafter the data can be accessed by end-users to answer business related questions.  

 

Dbt (Data building Tool)

This is the T (transform) in ETL/ELT processes, and handles converting select statements into tables and views.  

| | |

 

Orchestration

Data orchestration is the process of taking siloed data from multiple data storage locations, combining and organizing it, and making it available for data analysis tools. .  

 

Modules

| | |

Pandas

NumPy

NumPy is numerical mathematics.   It is faster than native Python because many of the operations are implemented in C and use optimized algorithms.  

Matplotlib

Matplotlib is a plotting library.   Its numerical mathematics extension is NumPy.  

Seaborn

Seaborn is a Python data visualization library based on matplotlib.   It provides a high-level interface for drawing attractive and informative statistical graphics.  

SciPy

SciPy is an open source library for scientific computing.  

Function Optimization With SciPy

 

Related Links

EDA: Exploratory Data Analysis Cheatsheet (highly recommended - start with this one first)

Scientific Functions in NumPy and SciPy

Churn Modeling: A detailed step-by-step Guide in Python

EDA: IKEA Product Analysis and Price Prediction Using Linear Regression

Dashboards for Data Analysis and Data Science

Literally Everything You Can Do With Time Series! (yahoofinancials Python wrapper around the Yahoo Finance API)