python packages

but wait, there’s more.

Quick Intro

  • A package or library is a collection of modules

  • A module is a collection of functions and variables that you can import into Python; think module as a single-purpose library

  • To import a library from Jupyter (or any other platform), the syntax is import <library name> as <alias>

    • The alias is optional, but here are a few common ones by convention:
      import pandas as pd
      import numpy as np
      import matplotlib.pyplot as plt
      import seaborn as sns
      import scipy as stats

    • The alias helps shorten the query when calling a function from the library; e.g. pd.DataFrame() vs. pandas.DataFrame()

  • to call a specific function from the library, make sure to add the library name followed by a period as the prefix; e.g. plt.figure()

    • you can’t call a library-exclusive function before importing the module

    • to import specific functions and not the entire library itself, use syntax from <module> import <function(s), separate by commas>

    • to import all functions within module, use asterisk (from <module> import *); this is not recommended as the module may include variables that might override your Python workbook

  • You can create your own module to import later or share with others; e.g. currency exchange converter

  • You can create the module inside Jupyter itself (open new .txt file and save as a .py file) or use programs like Sublime which is a shareware cross-platform source code editor; save the file(s) in the same directory as the program

  • You can see the built-in modules in the “Lib” folder of where your program is located

    • Any outside modules installed will be located under “site-packages” folder within “Lib”

  • See list of Python modules available

Anaconda, Conda, & PIP

  • Anaconda (founded in 2012 and formerly known as Continuum Analytics) is a distribution platform of Python and R programming languages. It comes with over 300+ open-source libraries and packages used for data science.

  • Conda is the default package manager by Anaconda and it is called through the Command Line
    With Conda, you can:

    • create virtual environments

    • install and update packages within specific environments

    • share environments with others

  • A virtual environment is an empty space or directory where we can install a specific collection of Python packages. We are free to install any mix of packages in any version for specific tasks. This space avoids risk in breaking any code.

    • When you install Anaconda, a “base” environment is created automatically

    • Command lines for environments:

      • View list of environments with conda env list (active one will have asterisk *)

      • Activate a new environment with conda activate <environment name>

      • Create new environment with conda create -n myenv python=3.7 (you can choose any python version)

      • Return to base environment with conda deactivate

      • Delete an environment with conda remove --<environment name>

      • Copy existing environment (to share) with conda create -n <environment name> --clone myenv

      • View all installed packages within environment conda list --explicit

      • Pipe into a file with conda list --explicit > filename.txt

      • Create new environment with file conda create -n <environment name> --file filename.txt

      • Create Jupyter kernal for new environment with ipython kernel install --name <environment name> --user

      • Delete Jupyter kernal with jupyter kernelspec uninstall <environment name>

  • Channels are like an app store or directory of packages to install from; they host and maintain packages

    • Anaconda maintained its own channel which is default when you type conda install <package>

    • Conda-forge is another channel of packages which are maintained by the maintainers themselves and not Anaconda

    • You can have a channel collision if multiple channels host the same package; you can avoid this by setting the default channel as highest priority and/or ensure packages are exclusive to each channel. Read conda documentation to learn more about how Conda deals with channel collisions.

    • Command lines for channel configuration:

      • View list of channels with conda config --show channels

      • View which channels are prioritized with conda config --get

      • Tell Conda to honor channel priority with conda config --set channel_priority true

      • Add channel with conda config --append channels <channel name>

      • Remove channel with conda config --remove channels <channel name>

      • Reorder and move to top of configuration list with conda config --add channels <channel name>

  • Packages must be installed before importing modules. This can be done in two ways: Conda or PIP

    • Conda is the default package manager for Anaconda. View packages

      • Make sure to activate the environment that you want the package installed in

      • Simply run conda install <package name>

      • Install from another channel with conda install -c <channel name> <package name>

      • Update a package with conda update <package name>

    • PIP (or Package Installer for Python) is the default package manager for Python. You can download PIP if not installed already.

      ** There has been elevated security issues with PyPi so use with discretion **

      • PIP can install packages directly in Jupyter Notebook with pip install <package name>

      • Or run in the command line by entering location of Python’s script directory C:\Users\<name>\AppData\Local\Programs\Python\Python36-32\Scripts>pip install <package name>

      • Uninstall package with pip uninstall <package name>

      • See list of installed packages with pip list

Examples of popular modules & functions
Module Purpose Functions
pandas (alias: pd) data analysis and manipulation df = pd.read_csv(file name), df = pd.read_excel(file name) - read .csv/.xls file to pandas DataFrame format
df.head(rows) - display the data frame; default rows is 5
df.tail() - displays the last 5 rows of dataframe - header names
df.columns() - displays all the column
df.drop(columns=[column names sep by commas]) - drop unusable columns
df.shape() - how many rows and columns
df.describe() - quick overview/descriptive statistics of the data
df.query(boolean expression) - e.g. column1>column2 which rows are bigger?
df.iloc[row index, col index] - i.e. slicing the dataframe; can be a range
df.loc[[row indexes sep by commas],[column names]] - pull specific rows/columns
df.loc[df[column name].str.contains(“keyword”)].unique() - a way of cleaning data
df[column name].replace[{‘original’:’revised’, ‘original’:’revised’}].value_counts() - replace values
df.dtypes - get the data types for each column
df.memory_usage(deep=True) - memory usage of each column (in bytes)
df.column name.astype() - when data is not stored in correct type
pd.to_datetime(df.[column name]) - convert to datetime format
df[column name].value_counts() - count of unique values
df[column name].value_counts(normalize=True) - percentage % of the above
df.drop_duplicates(inplace=True) - remove duplicate rows
df.sort_values(by=column name, ascending=True).head() - sort by column name
df.groupby(by=column1 name).column2 name.mean() - groups column1 by the average of column2
df.merge(df2, on=column name, how='left') - merge another dataframe with common column
df[column name].fillna(default, inplace=True) - fill in NaN with a default
df.rename(columns = {‘original’:‘revised’, ‘original2’:’revised2’}) - rename columns
pd.pivot_table(data=df, index=[type], columns=[department], aggfunc=’mean’) - pivot table
numpy (alias: np)
(Numerical Python)
scientific computing np.min(), np.max() - find the min and max value of NumPy array
np.mean() - find the mean value of NumPy array
np.std() - find the standard deviation of NumPy array
np.median() - find the median of a NumPy array
np.percentile() - find the percentile of a NumPy array
np.linspace() - get evenly spaced numbers over a specified interval
np.shape() - get the shape of an array
np.reshape() - reshape an array
np.copyto() - copies the values of one array to another
np.transpose() - reverse the axes of an array
np.stack() - join the sequence of an array along a new axis
np.vstack() - join the sequence of an array along a new axis vertically
np.hstack() - join the sequence of an array along a new axis horizontally
np.sort() - get a sorted array
matplotlib (alias: plt)
(view more details)
data exploration and visualization - bar plot
plt.pie() - pie chart
plt.hist() - histogram
plt.scatter() - scatterplot
plt.specgram() - spectrogram
plt.stem() - stem plot
plt.step() - step plot
plt.bar_label() - label a bar plot
plt.figlegend() - add a legend on the figure
plt.xticks(rotation=n) - x-ticks; add rotation if they overlap
plt.xlabel('name') - label for x-axis
plt.ylabel('name') - label for y-axis
plt.title('name') - name for the chart
plt.savefig(directory) - save plot into computer - display the graph
seaborn (alias: sns)
(high level interface for plt)
plotting and styling sns.lmplot(x=name, y=name, data=df) - scatter plot
sns.violinplot(x=name, y=name, data=df) - violin plot (alt to box plot)
sns.swarmplot(x=name, y=name, data=df) - swarm plot
sns.heatmap() - heatmap
sns.countplot() - count plot (bar plot)
sns.factorplot() - factor plot
sns.kdeplot() - density plot
sns.jointplot() - joint distribution plot
sns.set_style('style') - style themes; default is 'darkgrid'
scikit-learn (alias: sklearn) machine learning from sklearn import datasets - inbuilt datasets such as the iris, house prices, diabetes, etc.
from sklearn.model_selection import train_test_split - split the dataset for training and testing
from sklearn.linear_model import LinearRegression - creates an object of linear regression
from sklearn.linear_model import LogisticRegression - supervised regression algorithm; output is categorical
from sklearn.tree import DecisionTreeClassifier - decision tree; model to make decisions and predict output
from sklearn.tree import DecisiionTreeRegression - decision tree used for regression
from sklearn.ensemble import RandomForestClassifier - random forest for classification and regression
from xgboost import XGBClassifier - extreme gradient boosting; gradient boosted decision trees
from sklearn import svm - Supervised Vector Machine
from sklearn.metrics import confusion_matrix - describe the performance of classification models
from sklearn.metrics import classification_report - analyze the predictions of the classification algorithm
from sklearn.cluster import KMeans - unsupervised ML algorithm used for classification
from sklearn.cluster import DBSCAN - unsupervised clustering algorithm
plotly express (alias: px) visualization for variety of data types px.scatter() - scatter plot
px.scatter_3d() - 3D scatter plot
px.histogram() - histogram
px.line() - line plot - box plot
import as pio - change overall theme of graphs
px.density_heatmap() - heatmap - built-in datasets; view them here
px.sunburst() - sunburst chart
px.funnel() - funnel chart
tensorflow (alias: tf)
(view more here)
implementing neural networks
tf.random.normal() - generates random values of the given shape, which follow normal distribution
tf.reshape() - reshape the tensor

gradio build and deploy web apps for ML models
streamlit build and deploy web apps for ML models
NLK natural language toolkit
keras deep learning models, neural networks
SciPy scientific and mathematical functions
derived from NumPy
Statsmodels statistical models and test
LIBLINEAR linear classification, regression, outlier detection