python packages
but wait, there’s more.
Quick Intro
A package or library is a collection of modules
A module is a collection of functions and variables that you can import into Python; think module as a single-purpose library
To import a library from Jupyter (or any other platform), the syntax is import <library name> as <alias>
The alias is optional, but here are a few common ones by convention:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy as statsThe alias helps shorten the query when calling a function from the library; e.g. pd.DataFrame() vs. pandas.DataFrame()
to call a specific function from the library, make sure to add the library name followed by a period as the prefix; e.g. plt.figure()
you can’t call a library-exclusive function before importing the module
to import specific functions and not the entire library itself, use syntax from <module> import <function(s), separate by commas>
to import all functions within module, use asterisk (from <module> import *); this is not recommended as the module may include variables that might override your Python workbook
You can create your own module to import later or share with others; e.g. currency exchange converter
You can create the module inside Jupyter itself (open new .txt file and save as a .py file) or use programs like Sublime which is a shareware cross-platform source code editor; save the file(s) in the same directory as the program
You can see the built-in modules in the “Lib” folder of where your program is located
Any outside modules installed will be located under “site-packages” folder within “Lib”
See list of Python modules available
Anaconda, Conda, & PIP
Anaconda (founded in 2012 and formerly known as Continuum Analytics) is a distribution platform of Python and R programming languages. It comes with over 300+ open-source libraries and packages used for data science.
Conda is the default package manager by Anaconda and it is called through the Command Line
With Conda, you can:create virtual environments
install and update packages within specific environments
share environments with others
A virtual environment is an empty space or directory where we can install a specific collection of Python packages. We are free to install any mix of packages in any version for specific tasks. This space avoids risk in breaking any code.
When you install Anaconda, a “base” environment is created automatically
Command lines for environments:
View list of environments with conda env list (active one will have asterisk *)
Activate a new environment with conda activate <environment name>
Create new environment with conda create -n myenv python=3.7 (you can choose any python version)
Return to base environment with conda deactivate
Delete an environment with conda remove --<environment name>
Copy existing environment (to share) with conda create -n <environment name> --clone myenv
View all installed packages within environment conda list --explicit
Pipe into a file with conda list --explicit > filename.txt
Create new environment with file conda create -n <environment name> --file filename.txt
Create Jupyter kernal for new environment with ipython kernel install --name <environment name> --user
Delete Jupyter kernal with jupyter kernelspec uninstall <environment name>
Channels are like an app store or directory of packages to install from; they host and maintain packages
Anaconda maintained its own channel which is default when you type conda install <package>
Conda-forge is another channel of packages which are maintained by the maintainers themselves and not Anaconda
You can have a channel collision if multiple channels host the same package; you can avoid this by setting the default channel as highest priority and/or ensure packages are exclusive to each channel. Read conda documentation to learn more about how Conda deals with channel collisions.
Command lines for channel configuration:
View list of channels with conda config --show channels
View which channels are prioritized with conda config --get
Tell Conda to honor channel priority with conda config --set channel_priority true
Add channel with conda config --append channels <channel name>
Remove channel with conda config --remove channels <channel name>
Reorder and move to top of configuration list with conda config --add channels <channel name>
Packages must be installed before importing modules. This can be done in two ways: Conda or PIP
Conda is the default package manager for Anaconda. View packages
Make sure to activate the environment that you want the package installed in
Simply run conda install <package name>
Install from another channel with conda install -c <channel name> <package name>
Update a package with conda update <package name>
PIP (or Package Installer for Python) is the default package manager for Python. You can download PIP if not installed already.
** There has been elevated security issues with PyPi so use with discretion **
PIP can install packages directly in Jupyter Notebook with pip install <package name>
Or run in the command line by entering location of Python’s script directory C:\Users\<name>\AppData\Local\Programs\Python\Python36-32\Scripts>pip install <package name>
Uninstall package with pip uninstall <package name>
See list of installed packages with pip list
Module | Purpose | Functions |
---|---|---|
pandas (alias: pd) | data analysis and manipulation |
df = pd.read_csv(file name), df = pd.read_excel(file name) - read .csv/.xls file to pandas DataFrame format df.head(rows) - display the data frame; default rows is 5 df.tail() - displays the last 5 rows of dataframe df.info() - header names df.columns() - displays all the column df.drop(columns=[column names sep by commas]) - drop unusable columns df.shape() - how many rows and columns df.describe() - quick overview/descriptive statistics of the data df.query(boolean expression) - e.g. column1>column2 which rows are bigger? df.iloc[row index, col index] - i.e. slicing the dataframe; can be a range df.loc[[row indexes sep by commas],[column names]] - pull specific rows/columns df.loc[df[column name].str.contains(“keyword”)].unique() - a way of cleaning data df[column name].replace[{‘original’:’revised’, ‘original’:’revised’}].value_counts() - replace values df.dtypes - get the data types for each column df.memory_usage(deep=True) - memory usage of each column (in bytes) df.column name.astype() - when data is not stored in correct type pd.to_datetime(df.[column name]) - convert to datetime format df[column name].value_counts() - count of unique values df[column name].value_counts(normalize=True) - percentage % of the above df.drop_duplicates(inplace=True) - remove duplicate rows df.sort_values(by=column name, ascending=True).head() - sort by column name df.groupby(by=column1 name).column2 name.mean() - groups column1 by the average of column2 df.merge(df2, on=column name, how='left') - merge another dataframe with common column df[column name].fillna(default, inplace=True) - fill in NaN with a default df.rename(columns = {‘original’:‘revised’, ‘original2’:’revised2’}) - rename columns pd.pivot_table(data=df, index=[type], columns=[department], aggfunc=’mean’) - pivot table |
numpy (alias: np) (Numerical Python) |
scientific computing |
np.min(), np.max() - find the min and max value of NumPy array np.mean() - find the mean value of NumPy array np.std() - find the standard deviation of NumPy array np.median() - find the median of a NumPy array np.percentile() - find the percentile of a NumPy array np.linspace() - get evenly spaced numbers over a specified interval np.shape() - get the shape of an array np.reshape() - reshape an array np.copyto() - copies the values of one array to another np.transpose() - reverse the axes of an array np.stack() - join the sequence of an array along a new axis np.vstack() - join the sequence of an array along a new axis vertically np.hstack() - join the sequence of an array along a new axis horizontally np.sort() - get a sorted array |
matplotlib (alias: plt) (view more details) |
data exploration and visualization |
plt.bar() - bar plot plt.pie() - pie chart plt.hist() - histogram plt.scatter() - scatterplot plt.specgram() - spectrogram plt.stem() - stem plot plt.step() - step plot plt.bar_label() - label a bar plot plt.figlegend() - add a legend on the figure plt.xticks(rotation=n) - x-ticks; add rotation if they overlap plt.xlabel('name') - label for x-axis plt.ylabel('name') - label for y-axis plt.title('name') - name for the chart plt.savefig(directory) - save plot into computer plt.show() - display the graph |
seaborn (alias: sns) (high level interface for plt) |
plotting and styling |
sns.lmplot(x=name, y=name, data=df) - scatter plot sns.violinplot(x=name, y=name, data=df) - violin plot (alt to box plot) sns.swarmplot(x=name, y=name, data=df) - swarm plot sns.heatmap() - heatmap sns.countplot() - count plot (bar plot) sns.factorplot() - factor plot sns.kdeplot() - density plot sns.jointplot() - joint distribution plot sns.set_style('style') - style themes; default is 'darkgrid' |
scikit-learn (alias: sklearn) | machine learning |
from sklearn import datasets - inbuilt datasets such as the iris, house prices, diabetes, etc. from sklearn.model_selection import train_test_split - split the dataset for training and testing from sklearn.linear_model import LinearRegression - creates an object of linear regression from sklearn.linear_model import LogisticRegression - supervised regression algorithm; output is categorical from sklearn.tree import DecisionTreeClassifier - decision tree; model to make decisions and predict output from sklearn.tree import DecisiionTreeRegression - decision tree used for regression from sklearn.ensemble import RandomForestClassifier - random forest for classification and regression from xgboost import XGBClassifier - extreme gradient boosting; gradient boosted decision trees from sklearn import svm - Supervised Vector Machine from sklearn.metrics import confusion_matrix - describe the performance of classification models from sklearn.metrics import classification_report - analyze the predictions of the classification algorithm from sklearn.cluster import KMeans - unsupervised ML algorithm used for classification from sklearn.cluster import DBSCAN - unsupervised clustering algorithm |
plotly express (alias: px) | visualization for variety of data types |
px.scatter() - scatter plot px.scatter_3d() - 3D scatter plot px.histogram() - histogram px.line() - line plot px.box() - box plot import plotly.io as pio - change overall theme of graphs px.density_heatmap() - heatmap px.data - built-in datasets; view them here px.sunburst() - sunburst chart px.funnel() - funnel chart |
tensorflow (alias: tf) (view more here) |
implementing neural networks |
tf.random.normal() - generates random values of the given shape, which follow normal distribution tf.reshape() - reshape the tensor |
gradio | build and deploy web apps for ML models | |
streamlit | build and deploy web apps for ML models | |
NLK | natural language toolkit | |
keras | deep learning models, neural networks | |
SciPy | scientific and mathematical functions derived from NumPy |
|
Statsmodels | statistical models and test | |
LIBLINEAR | linear classification, regression, outlier detection |