Handling an overplotting on a scatter plot: stat="sum"

The "sum" stat counts the number of observations at each location.

Computed variables:

  • ..n.. - number of observations at location
  • ..prop.. - value in range 0..1 : share of observations at location
  • ..proppct.. - value in range 0..100 : % of observations at location
In [1]:
from lets_plot import *
from lets_plot.mapping import *
import pandas as pd
In [2]:
LetsPlot.setup_html() 
In [3]:
mpg_df = pd.read_csv ("https://raw.githubusercontent.com/JetBrains/lets-plot-docs/master/data/mpg.csv")
mpg_df.head()
Out[3]:
Unnamed: 0 manufacturer model displ year cyl trans drv cty hwy fl class
0 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
1 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
2 3 audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
3 4 audi a4 2.0 2008 4 auto(av) f 21 30 p compact
4 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
In [4]:
p = ggplot(mpg_df, aes(x=as_discrete('class', order=1), y=as_discrete('drv', order=1)))

1. Plot an Observation Count by Location

In [5]:
p + geom_point(stat='sum')
Out[5]:

2. Plot an Observations Share by Location

In [6]:
p + geom_point(aes(size='..prop..'), stat='sum')
Out[6]:

3. Plot an Observations Share by Drivetrain Type within each Vehicle "class"

Note: group by "class".

In [7]:
p + geom_point(aes(size='..prop..', group='class'), stat='sum')
Out[7]: