Seaborn Sample Project

In [1]:
# Imports
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
In [2]:
# Load built-in iris dataset
iris = sns.load_dataset("iris")
iris.head()
Out[2]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

Describe()

describe() is a very useful method in Pandas as it generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset distribution, excluding NaN values.

In [3]:
iris.describe()
Out[3]:
sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000

Swarm Plot

In [4]:
sns.set()
%matplotlib inline

sns.swarmplot(x="species", y="petal_length", data=iris)
Out[4]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a581fa18d0>

Load fatal police shootings data

In [5]:
df = pd.read_csv("fatal-police-shootings-data.csv", encoding="windows-1252")
df.head(10)
Out[5]:
id name date manner_of_death armed age gender race city state signs_of_mental_illness threat_level flee body_camera
0 3 Tim Elliot 2015-01-02 shot gun 53.0 M A Shelton WA True attack Not fleeing False
1 4 Lewis Lee Lembke 2015-01-02 shot gun 47.0 M W Aloha OR False attack Not fleeing False
2 5 John Paul Quintero 2015-01-03 shot and Tasered unarmed 23.0 M H Wichita KS False other Not fleeing False
3 8 Matthew Hoffman 2015-01-04 shot toy weapon 32.0 M W San Francisco CA True attack Not fleeing False
4 9 Michael Rodriguez 2015-01-04 shot nail gun 39.0 M H Evans CO False attack Not fleeing False
5 11 Kenneth Joe Brown 2015-01-04 shot gun 18.0 M W Guthrie OK False attack Not fleeing False
6 13 Kenneth Arnold Buck 2015-01-05 shot gun 22.0 M H Chandler AZ False attack Car False
7 15 Brock Nichols 2015-01-06 shot gun 35.0 M W Assaria KS False attack Not fleeing False
8 16 Autumn Steele 2015-01-06 shot unarmed 34.0 F W Burlington IA False other Not fleeing True
9 17 Leslie Sapp III 2015-01-06 shot toy weapon 47.0 M B Knoxville PA False attack Not fleeing False
In [6]:
df.describe()
Out[6]:
id age
count 4022.000000 3870.000000
mean 2252.446295 36.884496
std 1259.731835 13.126454
min 3.000000 6.000000
25% 1162.250000 27.000000
50% 2241.500000 35.000000
75% 3346.750000 45.000000
max 4432.000000 91.000000

Strip Plot

This plot is known as a Strip plot and pretty ideal for categorical values

In [7]:
sns.stripplot(x="armed", y="age", data=df)
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a5820bba20>
In [8]:
tips = sns.load_dataset("tips")
tips.head(10)
Out[8]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
5 25.29 4.71 Male No Sun Dinner 4
6 8.77 2.00 Male No Sun Dinner 2
7 26.88 3.12 Male No Sun Dinner 4
8 15.04 1.96 Male No Sun Dinner 2
9 14.78 3.23 Male No Sun Dinner 2
In [9]:
tips.describe()
Out[9]:
total_bill tip size
count 244.000000 244.000000 244.000000
mean 19.785943 2.998279 2.569672
std 8.902412 1.383638 0.951100
min 3.070000 1.000000 1.000000
25% 13.347500 2.000000 2.000000
50% 17.795000 2.900000 2.000000
75% 24.127500 3.562500 3.000000
max 50.810000 10.000000 6.000000

Bar Plot

In [10]:
sns.barplot(x="day", y="total_bill", data=tips)
C:\ProgramData\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a583698c50>

Styling with Seaborn

Seaborn splits Matplotlib parameters into two independent groups: First group sets the aesthetic style of the plot; and second scales various elements of the figure to get easily incorporated into different contexts. Seaborn doesn’t take away any of Matplotlib credits, but rather adds some nice default aesthetics and built-in plots that complement and sometimes replace the complicated Matplotlib code professionals needed to write. Facet plots and Regression plots are an example of that.

In [11]:
sns.set_style("whitegrid")
sns.boxplot(x="day", y="total_bill", data=tips)
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a5837585f8>
In [12]:
sns.set_style("ticks")
sns.boxplot(x="day", y="total_bill", data=tips)
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a582156940>
In [13]:
sns.set_style("white")
sns.boxplot(x="day", y="total_bill", data=tips)
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a583863c50>
In [14]:
sns.set_style("dark")
sns.boxplot(x="day", y="total_bill", data=tips)
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a5838e8eb8>
In [15]:
sns.set_style("ticks")
sns.boxplot(x="day", y="total_bill", data=tips)
sns.despine()
In [16]:
sns.set_style("ticks")
sns.boxplot(x="day", y="total_bill", data=tips)
sns.despine(left=True)

Visualize two types of background in a single plot

In [17]:
# This function will help us plot some offset since waves
def sinplot(flip=1):
    x = np.linspace(0, 14, 100)
    for i in range(1, 7):
        plt.plot(x, np.sin(x + i * 0.5) * (7 - i) * flip)
        
with sns.axes_style("darkgrid"):
    plt.subplot(211)
    sinplot()
plt.subplot(212)
sinplot(-1)

Scaling of plot elements

In [18]:
sns.set()
sns.set_context("paper")
sns.set_style("whitegrid")
sns.boxplot(x="day", y="total_bill", data=tips)
Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a583b05828>

I am pretty sure you must be thinking that this figure/plot in no ways is scaled as it looks similar to our previous plot outputs. So, I shall clarify that right away: Jupyter Notebook scales down large images in the notebook cell output. This is generally done because past a certain size, we get automatic figure scaling. For exploratory analysis, we prefer iterating quickly over a number of different analyses and it’s more useful to have facets that are of similar size; than to have overall figures that are same size in a particular context. When we’re in a situation where we need to have something that’s exactly a certain size overall; ideally we:

  • Know precisely what we want and
  • Can afford to take off some time and work through the calculations

With all that being said, if we plot the same figure in an Editor like Anaconda Spyder or JetBrains’ PyCharm or IntelliJ, we shall be able to visualize them in their original size. Hence what needs to be our take-away from scaling segment, is that an addition of a line of code can fetch the size of image as per our requirement and we may experiment accordingly. In practical world, we can also add a dictionary of parameters using rc to have a finer control over the aesthetics. Let me show you an example with the same sinplot function we defined earlier:

In [19]:
sns.set(style="whitegrid", rc={"grid.linewidth": 1.5})
sns.set_context("poster", font_scale=2.5, rc={"lines.linewidth": 5.0})
sinplot()

Though our Notebook didn’t display enlarged (scaled) plot, we may notice how in the backend (in memory) it has created the figure as per our instructions. We have thick lines now in our plot because I set linewidth to 5, font size on axes have thickened because of font_scale. Generally we don't use anything more than that during data analysis although exceptional scenarios may demand few more parameters as per requirement which we will slowly taking care of in our next next article of this series.