常言道,一图胜千言。画图是最基本、也是最重要的数据分析技能之一,以至于人们将画图的整个行为过程赋予了更为专业的称呼——数据可视化(Data Visualization)。作为强大且灵活的数据分析利器,Stata 能够实现各式各样的绘图。本专题 [Stata 画图] 旨在促进广泛且深入地掌握 Stata 的画图功能。为此,一系列文章将陆续推出,力求将画图的结构与细节尽数展示。
本文是 #用 Stata 画个图#
系列的第1讲, 是对 Stata 的画图功能与绘图类型的概括性介绍。
功能方面。Stata 的绘图功能主要通过绘图语法(Syntax)及其绘图编辑器(Graph Editor)得以实现。其中,Stata 的绘图语法以 graph
命令开头,包括绘制各类图形的绘图命令(graph commands)以及图形画好后用来删除、读取或多图合并的绘图管理命令(graph management commands)。下表简要列举了两类命令。
绘图命令(graph commands) | 绘图管理命令(graph management commands) | ||
---|---|---|---|
| scatterplots, line plots, etc |
| save graph to disk |
| scatterplot matrices |
| redisplay graph stored on disk |
| bar charts |
| redisplay graph stored in memory |
| dot charts |
| combine multiple graphs |
| box-and-whisker plots |
| redisplay graphs stored in memory and on disk |
| pie charts |
| export .gph file to PostScript, etc. |
Other graphics commands | More commands to draw statistical graphs:Distributional diagnostic plots; Smoothing and densities; Regression diagnostics; Time series; Vector autoregressive (VAR, SVAR, VECM) models; Longitudinal data/panel data | ...... | The commands for printing a graph, that deal with the graphs currently stored in memory, that describe available schemes and allow you to identify and set the default scheme, that lists available styles, for setting options for printing and exporting graphs, that allows you to draw graphs without displaying them. |
注:以上命令均为 Stata 自带的官方命令,还有许多实用的画图命令由用户编写(User-written Commands),这些命令将在本系列专题随后的文章种进行详细介绍。
绘图类型方面。从上表的绘图命令可知,Stata 绘图无非是要实现几种常见类型的图形绘制。对于绘图命令而言,我们可以进一步按照绘制对象的差异区分为描述性统计绘图(descriptive graph)和推断统计绘图(inferential graph),前者重在直观反应数据自身的分布和关联模式,后者则是通过图形的方式展示统计分析的结果。两种类型之间的关键差异在于:绘图所用数据的来源是否基于统计模型。本文介绍前者,即描述性统计绘图,重在对清理后的数据进行或分析结果进行可视化,是实证分析过程中重要的环节之一,体现着作者的技法、品味和思考。基于推断统计的绘图会结合具体的研究方法进行详细介绍。下图是 Stata 界面中工具栏的“图形”所包含的内容(图1)。
下图展示了Stata的绘图命令结构及绘图类型(图2)。
用命令画图。 若要在 Stata 中绘制一张图,可以通过点选上图中的“图形”按钮进行操作,这很方便。但是,随着技法的熟练和定制化的绘图需求不断上涨,使用命令进行绘图不仅效率更高,而且能够不断强化实践操作能力。要说明的是,由于绘图命令十分“庞大”,在学习和应用中,不断积累各方资料中的图形代码很有必要;同时在绘图中也要善用 Graph Editor
对图形进行局部细节的优化,毕竟我们不可能记得所有绘图命令的选项。
Stata 的绘图代码主要包括四个部分:(1)命令(Graph Commands);(2)选项(Options);(3)风格(Styles);(4)绘图管理命令(Graph Management Commands)。前三类命令是利用已有数据画图的基本元素,以常见的 graph twoway
为例,twoway
是刻画数值 y 与 x 之间对应关系的一组图形(twoway is a family of plots, all of which fit on numeric y and x scales),语法结构如下:
[graph] twoway plot [if] [in] [, twoway_options]
其中,“ plot ” 代表某类具体的图形。下图是 twoway
家族的所有成员(图3),图2只展示了部分常用的图形类型。" [ ] " 表示代码中可以省略的部分。虽然可以省略,但这部分却是掌握绘图命令的核心。选择合适的绘图类型(plottype)只能保证画“对”图,无法保证画“好”图。因此,我们学习 Stata 绘图的重点也就落在了对图形的呈现效果之上。
图4展示了twoway options
所包含的具体内容,有了这些选项,我们可以对基于 twoway 所绘图形的呈现效果进行改进和优化,例如,为 x 或 y 轴添加特定值(added_line_options
)。这些选项的使用方式也很有规律,它们出现在绘图命令后方的,
之后,并且可以将我们需要的各种选项一起使用,它们之间也没有顺序前后的差异。
上面以graph twoway
命令作为示例,阐述了Stata的绘图逻辑。然而,正如图2所示,能够绘制的图形种类还有很多,但它们的语法结构都是一致的。 熟练掌握绘图语法的结构,一旦有了数据,我们便能快速且优雅地开展数据可视化工作。下面,我们用一组容易混淆的示例进行Stata绘图的展示,以帮助我们了解 Stata 的功能与绘图类型。
graph bar
可以绘制垂直或水平的条形/柱形图。
在垂直的条形图中,y 轴是数值型的变量,x 轴是分类变量,水平的条形图则反之。
*** 条形图绘图语法示例 ***
graph bar (mean) numeric_var, over(cat_var) //(mean) 是 numeric_var 的统计量,如果去掉,则percent是默认的统计量
graph hbar (mean) numeric_var, over(cat_var) //hbar 表示 horizontal bar charts,即横向条形图
*** “,” 后为优化绘图效果的各类选项 ***
**group_options
over(varname [, over subopts]): specifies a categorical variable over which the yvars are to be repeated;
nofill: specifies that missing subcategories be omitted;
allcategories: specifies that all categories in the entire dataset be retained for the over() variables;
**yvar_options
graph bar y1 y2 y3, ascategory whatever_other_options //ascategory is a useful option
graph bar y, over(group) asyvars whatever_other_options
graph bar (mean) inc_male inc_female, over(region) percentage stack
graph bar (mean) wage, over(sex) over(region) asyvars percentage stack
...
**lookofbar_options
bargap(#):specifies the gap to be left between yvar bars as a percentage-of-bar-width units, and the
default is bargap(0)
intensity(#): specify the intensity of the color used to fill the inside of the bar
...
**legending_options
If more than one yvar is specified, a legend is produced.
**axis_options
axis_scale_options: specify how the numerical y axis is scaled and how it looks;
axis_label_options: specify how the numerical y axis is to be labeled;
ytitle(): overrides the default title for the numerical y axis;
**title_and_other_options
text(): adds text to a specified location on the graph;
yline(): adds horizontal (bar) or vertical (hbar) lines at specified y values;
aspect_option: allows you to control the relationship between the height and width of a graph’s plot region
std_options: Options for use with graph construction commands, which allow you to add titles, control the graph size, save the graph on disk, and much more;
by(varlist, . . . ) : draws separate plots within one graph;
**Suboptions for use with over( ) and yvaroptions( )
relabel(# "text" . . . ) : specifies text to override the default category labeling;
gap(#) and gap(*#): specify the gap between the bars in this over() group;gap(#) is specified in
percentage-of-bar-width units, so gap(67) means two-thirds the width of a bar. gap(*#) allows
modifying the default gap. gap(*1.2) would increase the gap by 20%, and gap(*.8) would
decrease the gap by 20%.
sort(varname), sort(#), and sort((stat) varname): control how bars are ordered;
本文试图在条形方向和 “over( )”选项的组合下给出条形图的绘图示例。下面,我们使用一份包含956个观测点的美国城市气温数据(City temperature data),以此展示条形图的画图思路以及各类选项的用法。
use https://www.stata-press.com/data/r17/citytemp,clear
describe
/*
Observations: 956 City temperature data
Variables: 6 3 Mar 2020 19:17
----------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
----------------------------------------------------------------------------
division int %16.0g division Census division
region int %13.0g region Census region
heatdd int %8.0g Heating degree days
cooldd int %8.0g Cooling degree days
tempjan float %9.0g Average January temperature
tempjuly float %9.0g Average July temperature
----------------------------------------------------------------------------
Sorted by: region */
graph bar (mean) tempjuly tempjan, over(region) ///
bargap(-30) ///*bargap(#) specifies the gap to be left between yvar bars as a percentage-of-bar-width units. The default is bargap(0), meaning that bars touch.
legend( label(1 "July") label(2 "January") ) ///
ytitle("温度(Fahrenheit)") ///
title("Average July and January temperatures") ///
subtitle("by regions of the United States") ///
note("Source: U.S. Census Bureau, U.S. Dept. of Commerce") ///
graphregion(fcolor(white)) plotregion(fcolor(white))
use https://www.stata-press.com/data/r17/citytemp, clear
graph hbar (mean) tempjan, over(division) over(region) nofill ///*nofill specifies that missing subcategories be omitted
ytitle("Degrees Fahrenheit") ///
title("Average January temperature") ///
subtitle("by region and division of the United States") ///
note("Source: U.S. Census Bureau, U.S. Dept. of Commerce") ///
graphregion(fcolor(white)) plotregion(fcolor(white))
use https://www.stata-press.com/data/r17/nlsw88, clear //载入一份新的数据
describe
/*
Observations: 2,246 NLSW, 1988 extract
Variables: 17 1 May 2020 22:52
(_dta has notes)
------------------------------------------------------------------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
------------------------------------------------------------------------------------------------------------------------------------------
idcode int %8.0g NLS ID
age byte %8.0g Age in current year
race byte %8.0g racelbl Race
married byte %8.0g marlbl Married
never_married byte %16.0g nev_mar Never married
grade byte %8.0g Current grade completed
collgrad byte %16.0g gradlbl College graduate
south byte %9.0g southlbl Lives in the south
smsa byte %9.0g smsalbl Lives in SMSA
c_city byte %16.0g ccitylbl Lives in a central city
industry byte %23.0g indlbl Industry
occupation byte %22.0g occlbl Occupation
union byte %8.0g unionlbl Union worker
wage float %9.0g Hourly wage
hours byte %8.0g Usual hours worked
ttl_exp float %9.0g Total work experience (years)
tenure float %9.0g Job tenure (years)
------------------------------------------------------------------------------------------------------------------------------------------
Sorted by: idcode */
notes
/*_dta:
1. 1988 data, extracted from National Longitudinal of Young Woman who were ages 14-24 in 1968 (NLSW).
2. This dataset is the result of extraction and processing by various people at various times.
3. For more information on the NLS, see http://www.bls.gov/nls/. */
graph bar (mean) wage, over(smsa) over(married) over(collgrad) /// * 注意三个 "over()" 的顺序
title("Average Hourly Wage, 1988, Women Aged 34-46") ///
subtitle("by College Graduation, Marital Status, and SMSA residence") ///
note("Source: 1988 data from NLS, U.S. Dept. of Labor, Bureau of Labor Statistics") ///
graphregion(fcolor(white)) plotregion(fcolor(white))
use https://www.stata-press.com/data/r17/educ99gdp, clear
generate total = private + public
graph hbar (asis) public private, ///
over(country, sort(total) descending) stack ///
title( "Spending on tertiary education as % of GDP, 1999", span pos(11) ) ///
subtitle(" ") ///
legend(region(lcolor(white))) ///
note("Source: OECD", span) ///
graphregion(fcolor(white)) plotregion(fcolor(white))
use https://www.stata-press.com/data/r17/educ99gdp, clear
generate frac = private/(private + public)
graph hbar (asis) public private, ///
over(country, sort(frac) descending) stack percent ///
title("Public and private spending on tertiary education, 1999", span pos(11) ) ///
subtitle(" ") ///
legend(region(lcolor(white))) ///
note("Source: OECD", span) ///
graphregion(fcolor(white)) plotregion(fcolor(white))
twoway bar
在图形中的(x, y)都是数值。
use https://www.stata-press.com/data/r17/sp500, clear
describe
/* Observations: 248 S&P 500
Variables: 7 22 Apr 2020 10:52
(_dta has notes)
-----------------------------------------------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
-----------------------------------------------------------------------------------------------------------------------
date int %td Date
open float %9.0g Opening price
high float %9.0g High price
low float %9.0g Low price
close float %9.0g Closing price
volume double %12.0gc Volume (thousands)
change float %9.0g Closing price change
-----------------------------------------------------------------------------------------------------------------------
Sorted by: date */
list date close change in 1/5
/* +---------------------------------+
| date close change |
|---------------------------------|
1. | 02jan2001 1283.27 . |
2. | 03jan2001 1347.56 64.29004 |
3. | 04jan2001 1333.34 -14.22009 |
4. | 05jan2001 1298.35 -34.98999 |
5. | 08jan2001 1295.86 -2.48999 |
+---------------------------------+ */
graph twoway bar change date in 1/50, ///
graphregion(fcolor(white)) plotregion(fcolor(white))
它好处在于它能够与其他 twoway
家族的绘图类型结合使用。
use https://www.stata-press.com/data/r17/sp500, clear
graph twoway line close date, yaxis(1) || bar change date, yaxis(2) || in 1/50, ///
yscale(axis(1) r(1000 1400)) ylab(1200(100)1400, axis(1)) ///
ytick(1200(100)1400, axis(1) grid) ///
yscale(axis(2) r(-50 300)) ylab(-50(50)50, axis(2)) ///
ytick(-50(50)50, axis(2) grid) ///
legend(off) ///
xtitle("Date") ///
title("S&P 500") ///
subtitle("January - March 2001") ///
note("Source: Yahoo!Finance and Commodity Systems, Inc.", span) ///
yline(1150, axis(1) lstyle(foreground)) ///
graphregion(fcolor(white)) plotregion(fcolor(white))
绘制变量(varname
)的直方图,除非指定了离散(discrete
)选项,一般假定 varname
是连续变量。根据 Beniger 和 Robyn (1978) ,虽然 A. M. Guerry 在 1833 年的发表中使用了直方图,但“直方图(histogram)”一词是在 1895 年由 Karl Pearson 首次使用的。
use https://www.stata-press.com/data/r17/sp500, clear
histogram volume
graph save "$figures\histo_01", replace
histogram volume, fraction
graph save "$figures\histo_02", replace
graph combine "$figures\histo_01" "$figures\histo_02", row(1)
save "$figures\histo_0102", replace
graph export "$figures\histo_0102.png", replace
直方图的柱图被叫做 bin,其个数(k)是按照一个数学规则确定的:
其中,N 为能够观测到的变量个数。如何能够更好地利用连续变量的统计特征呢?在上面的基准图形之上,我们可以通过下面的命令将标准差信息同时纳入到图形中,也是更为推荐使用的直方图绘图方式,可以在论文和研究报告中使用。
use "https://www.stata-press.com/data/r17/sp500", clear
sum volume //sum命令能够帮助我们得到变量的统计量
/*
Variable | Obs Mean Std. dev. Min Max
-------------+-----------------------------------------------------
volume | 248 12320.68 2585.929 4103 23308.3 */
return list //查看计算出的统计量,它们被保存在 “ scalars ”中
/*scalars:
r(N) = 248
r(sum_w) = 248
r(mean) = 12320.67661290323
r(Var) = 6687027.906981193
r(sd) = 2585.928828676689
r(min) = 4103
r(max) = 23308.3
r(sum) = 3055527.8 */
*找到均值 r(mean)和标准差 r(sd),计算出偏离标准差若干单位所对应的值
display r(mean)+r(sd) //14906.605
display r(mean)+2*r(sd) //17492.534
display r(mean)+3*r(sd) //20078.463
display r(mean)+4*r(sd) //22664.392
display r(mean)-r(sd) // 9734.7478
display r(mean)-2*r(sd) //7148.819
*绘制图形
histogram volume, freq normal kdensity ///
xaxis(1 2) ///
ylabel(0(10)80, grid) ///
xlabel(12320.68 "mean" ///* Mean=12320.68
9734.7478 "-1 s.d." ///
14906.605 "+1 s.d." ///
7148.819 "-2 s.d." ///
17492.534 "+2 s.d." ///
20078.463 "+3 s.d." ///
22664.392 "+4 s.d.", axis(2) grid gmax) ///
xtitle("", axis(2)) ///
subtitle("S&P 500, January 2001 - December 2001") ///
note("Source: Yahoo! Finance and Commodity Systems, Inc.") ///
graphregion(fcolor(white)) plotregion(fcolor(white))
使用discrete
选项,将变量视为离散的,而不再是连续的,即使变量自身可能是连续的。此时,变量的每一个唯一的值将有一个 bin,因而柱子的数量也较多,每个柱子的高度表示该值所对应的密度、频数、百分比或比例。
use https://www.stata-press.com/data/r17/auto, clear
histogram mpg //mpg would be treated as continuous and categorized into eight bins by the default number-of-bins calculation (here N=74)
graph save "$figures\histo_discrete01", replace
histogram mpg, discrete //Adding the discrete option makes a histogram with a bin for each of the 21 unique values
graph save "$figures\histo_discrete02", replace
histogram mpg, discrete freq addlabels ylabel(,grid) xlabel(10(10)40) xtick(10(5)40,grid)
graph save "$figures\histo_discrete03", replace
graph combine "$figures\histo_discrete01" "$figures\histo_discrete02" "figures\histo_discrete03", row(1)
graph export "figures\histo_discrete010203.png", replace
use https://www.stata-press.com/data/r17/voter, clear
describe
/*Observations: 15 1992 U.S. presidential voters
Variables: 5 3 Mar 2020 14:27
(_dta has notes)
-----------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
-----------------------------------------------------------------------------
candidat int %8.0g candidat Candidate voted for, 1992
inc int %8.0g inc2 Family income
frac float %9.0g
pfrac double %10.0g
pop double %10.0g
-----------------------------------------------------------------------------
Sorted by: inc */
label list candidat
/*candidat:
2 Clinton
3 Bush
4 Perot */
histogram candi [fweight=pop], discrete fraction by(inc, total) /// *frequency weights
barwidth(1) gap(40) xlabel(2 3 4, valuelabel) /// *place a gap between the bars by reducing bar width by #%
graphregion(fcolor(white)) plotregion(fcolor(white))
值得注意的是,我们用条形图也能够实现上面的示例,但画图的对象发生了变化。
通过这组示例,我们能够更好地理解三个命令。
use https://www.stata-press.com/data/r17/voter, clear
graph bar frac, over(candidat) by(inc, total)
graph save "$figures/histogram_bar",replace
graph twoway bar frac candidat, by(inc, total) xlabel(2 3 4, valuelabel) yscale(r(0 100))
graph save "$figures/histogram_2waybar",replace
graph combine "$figures/histogram_bar" "$figures/histogram_2waybar", row(2)
graph save "$figures/histogram_bar & 2waybar",replace
graph export "$figures/histogram_bar & 2waybar.png",replace
twoway histogram
和上方呈现的 histogram
几乎没有差异,并且后者能够将正态密度函数或是核密度估计叠加在直方图上,这也使后者的优势所在。因此,在实际应用中,建议使用 histogram
。
以上就是本文的内容,绘图的精要在于:(1)明确要利用手头可用的数据绘制何种图形(可以通过视觉意象或参考其他人的作品启发自己);(2)选择合适的绘图命令(比如使用 graph bar 还是 twoway graph bar);(3)通过各类绘图选项(options)让所绘图形更加美观且更具自我说明性(self-explanatory)。后续的文章会对主要的图形类型开展逐个击破,用图让数据说话!
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。