一个用Storytelling将数据变为可视化图表的具体案例。

数据展示和初始可视化

用到的库如下:

1
2
3
4
5
6
7
# Data Handling
import pandas as pd

# Data visualization Libraries
import seaborn as sns
import plotly.express as px
import plotly.io as pio

用到的数据集是来自saeborn的mpg(汽车油耗数据):

1
2
3
4
5
6
7
8
9
10
# Load in data:
mpg = sns.load_dataset('mpg')

# groupby计算每年平均MPG:
mpg = mpg.groupby(['model_year'])['mpg'].mean().to_frame().reset_index()

# Rename columns:
mpg = mpg.rename(columns={'model_year': 'Year',
'mpg': 'Average MPG'})
mpg.head()

绘制初始柱状图:

1
px.bar(mpg, x='Year', y='Average MPG')

bar chart

但仅仅是这样的图表并不适合出现在报告里。

色彩Colors

更新图表的颜色方案:

1
2
3
4
5
6
7
8
9
10
11
# Generate base plot:
plot = px.bar(mpg, x='Year', y='Average MPG', color='Average MPG',
color_continuous_scale=px.colors.diverging.RdYlGn)

# Remove colorbar:
plot.update_coloraxes(showscale=False)

# Update plotly style:
plot.update_layout(template='plotly_white')

plot.show()

update color

这里将平均MPG和颜色值绑定展示,效率低的年代为红色,效率高的年代为绿色。Plotly的更多色带方案:https://plotly.com/python/builtin-colorscales/

另外因为y值和颜色深浅表达的变量是一样的,所以色彩图例是多余的,这里移除了色彩图例。还把Plotly的主题改为‘plotly_white’。更多主题方案见:https://plotly.com/python/templates/

标注Labels

下一步是标注轴线:

1
2
3
# Label axes,这里将x轴的dick设为1,确保每个柱子下面都有对应的值标注
plot.update_xaxes(title='Model Year',dtick=1)
plot.update_yaxes(title='Average Miles Per Gallon (MPG)')

Poltly中默认用列名作为轴线标注, update_xaxes() 和 update_yaxes()进一步优化标注。

更新图表标题:

1
2
3
4
5
6
7
8
9
10
11
12
13
# Update plot layout:
plot.update_layout(
title=dict(
text='<b>Average Miles Per Gallon of Cars Over Time</b>\
<br><i><sup>A Visualization of Improvements in \
Fuel Efficiency During the Energy Crisis Era</sup></i>',
x=0.085,
y=0.95,
font=dict(
family='Helvetica',
size=25,
color='#272b4f'
)))

update_layout可以传入HTML样式代码来控制标题样式,<b>加粗,<br>换行,<i>斜体,<sup>代表图表副标题。

然后用 add_annotation() 函数在图表上注释数据源。

1
2
3
4
5
6
7
8
9
10
11
# Add annotation on data source:
plot.add_annotation(x=0,
y=-0.15,
showarrow=False,
text="<i>Fuel mileage data courtesy of \
Python Seaborn Library</i>",
textangle=0,
xanchor='left',
xref="paper",
yref="paper",
font_color='#a6aeba')

效果如下:
add label

增加注释

为了给图表添加更多的内容,下面创建一条水平线,代表1970年至1982年的平均MPG,并增加描述该线的注释框:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Add average MPG across era:
plot.add_hline(y=mpg['Average MPG'].mean())

# Add explanation of line:
plot.add_annotation(x=.05,
y=0.67,
text="Average MPG, 1970 through 1982",
textangle=0,
xanchor='left',
xref="paper",
yref="paper",
font_color='black',
bordercolor='black',
borderpad=5,
showarrow=True,
arrowhead=2,
bgcolor='white',
arrowside='end'
)

add_annotation用来创建一个边框和箭头,x、y用来确定其位置。

想要强调1975年到1979之间有明显的改善,下面增加highlight box:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Add highlight box:
plot.add_vrect(x0="74.5",
x1="79.5",
fillcolor="lightgray",
opacity=0.3,
line_width=0)

# Add explanation of line:
plot.add_annotation(x=.45,
y=0.9,
text="Period of Consistent Improvement
<br>until Breakthrough in 1980's",
textangle=0,
xanchor='left',
xref="paper",
yref="paper",
font_color='black',
showarrow=False,
)

add_vrect增加一个矩形框,突出特定部分。代码放在add_hline之后,它展示在辅助线之上。

假设该图表是一项研究的一部分,该研究发现发动机尺寸的减小直接促进了MPG的改善。幸运的是,Seaborn MPG数据包括发动机排量数据。通过下面的一些数据和新的注释,为图表补上最后一块。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Data prep:
displacement = sns.load_dataset('mpg')
seventies = round(
displacement[displacement['model_year'] < 80]['displacement'].mean(), 2)
eighties = round(
displacement[displacement['model_year'] >= 80]['displacement'].mean(), 2)

# Create text string:
explanation = "<b>Why the Improvement in MPG?</b> <br>\
In the 70's, average engine size was {} <br>\
cubic inches versus {} from 1980 to 1982.<br>\
Larger engines are usually less efficient.".format(seventies, eighties)

# Add explanation for trends:
plot.add_annotation(x=.615,
y=0.02,
text=explanation,
textangle=0,
xanchor='left',
xref="paper",
yref="paper",
font_color='black',
bordercolor='black',
borderpad=5,
bgcolor='white',
showarrow=False
)

计算了70年代(70到79)以及80年代(80、81和82)的平均发动机排量,然后传入一个文本字符串,在add_annotation()函数中展示。

add explain

总结

  • 少即是多,图表上下文中能表达的信息就不要添加到图表中。
  • 独立的信息图需要更多的信息,不过在口头陈述和书面报告中的图表不需要太多的信息内容。
  • 配色、字体的选择和大小会影响可读性。
  • 如果受众不能理解图表,那不是受众的错。

完整代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
# Load in Libraries:

# Data Handling
import pandas as pd

# Data visualization Libraries
import seaborn as sns
import plotly.express as px

# Load in data:
mpg = sns.load_dataset('mpg')
mpg.head()

# Get dataset showing average MPG per year by using groupby:
mpg = mpg.groupby(['model_year'])['mpg'].mean().to_frame().reset_index()

# Rename columns:
mpg = mpg.rename(columns={'model_year': 'Year',
'mpg': 'Average MPG'})

# Generate base plot:
plot = px.bar(mpg, x='Year', y='Average MPG', color='Average MPG',
color_continuous_scale=px.colors.diverging.RdYlGn)

# Remove colorbar:
plot.update_coloraxes(showscale=False)

# Update plotly style:
plot.update_layout(template='plotly_white')

# Label axes:
plot.update_xaxes(title='Model Year',
dtick=1)
plot.update_yaxes(title='Average Miles Per Gallon (MPG)')

# Add labels and source:

# Update plot layout:
plot.update_layout(
title=dict(
text='<b>Average Miles Per Gallon of Cars Over Time</b>\
<br><i><sup>A Visualization of Improvements in \
Fuel Efficiency During the Energy Crisis Era</sup></i>',
x=0.085,
y=0.95,
font=dict(
family='Helvetica',
size=25,
color='#272b4f'
)))

# Add annotation on data source:
plot.add_annotation(x=0,
y=-0.15,
showarrow=False,
text="<i>Fuel mileage data courtesy of \
Python Seaborn Library</i>",
textangle=0,
xanchor='left',
xref="paper",
yref="paper",
font_color='#a6aeba')

# Add highlight box:
plot.add_vrect(x0="74.5",
x1="79.5",
fillcolor="lightgray",
opacity=0.3,
line_width=0)

# Add explanation of line:
plot.add_annotation(x=.45,
y=0.9,
text="Period of Consistent Improvement\
<br>until Breakthrough in 1980's",
textangle=0,
xanchor='left',
xref="paper",
yref="paper",
font_color='black',
showarrow=False,
)

# Add average MPG across era

# Create Line:
plot.add_hline(y=mpg['Average MPG'].mean())

# Add explanation of line:
plot.add_annotation(x=.05,
y=0.67,
text="Average MPG, 1970 through 1982",
textangle=0,
xanchor='left',
xref="paper",
yref="paper",
font_color='black',
bordercolor='black',
borderpad=5,
showarrow=True,
arrowhead=2,
bgcolor='white',
arrowside='end'
)

# Add a box to explain the trends

# Data prep:
displacement = sns.load_dataset('mpg')
seventies = round(
displacement[displacement['model_year'] < 80]['displacement'].mean(), 2)
eighties = round(
displacement[displacement['model_year'] >= 80]['displacement'].mean(), 2)

# Create text string:
explanation = "<b>Why the Improvement in MPG?</b> <br>\
In the 70's, average engine size was {} <br>\
cubic inches versus {} from 1980 to 1982.<br>\
Larger engines are usually less efficient.".format(seventies, eighties)

# Add explanation for trends:
plot.add_annotation(x=.615,
y=0.02,
text=explanation,
textangle=0,
xanchor='left',
xref="paper",
yref="paper",
font_color='black',
bordercolor='black',
borderpad=5,
bgcolor='white',
showarrow=False
)

plot.show()

原文链接:https://towardsdatascience.com/charts-that-tell-a-story-turn-a-plotly-visualization-into-something-more-a723e427d5aa