Stata的绘图功能与绘图类型

原创

直立行走

修改于 2022-02-23 12:46:57

6.4K0

修改于 2022-02-23 12:46:57

常言道，一图胜千言。画图是最基本、也是最重要的数据分析技能之一，以至于人们将画图的整个行为过程赋予了更为专业的称呼——数据可视化（Data Visualization）。作为强大且灵活的数据分析利器，Stata 能够实现各式各样的绘图。本专题 [Stata 画图] 旨在促进广泛且深入地掌握 Stata 的画图功能。为此，一系列文章将陆续推出，力求将画图的结构与细节尽数展示。

1. 引言

本文是 #用 Stata 画个图#系列的第1讲，是对 Stata 的画图功能与绘图类型的概括性介绍。

功能方面。Stata 的绘图功能主要通过绘图语法（Syntax）及其绘图编辑器（Graph Editor）得以实现。其中，Stata 的绘图语法以 graph 命令开头，包括绘制各类图形的绘图命令（graph commands）以及图形画好后用来删除、读取或多图合并的绘图管理命令（graph management commands）。下表简要列举了两类命令。

绘图命令（graph commands）		绘图管理命令（graph management commands）
`graph twoway`	scatterplots, line plots, etc	`graph save`	save graph to disk
`graph matrix`	scatterplot matrices	`graph use`	redisplay graph stored on disk
`graph bar`	bar charts	`graph display`	redisplay graph stored in memory
`graph dot`	dot charts	`graph combine`	combine multiple graphs
`graph box`	box-and-whisker plots	`graph replay`	redisplay graphs stored in memory and on disk
`graph pie`	pie charts	`graph export`	export .gph file to PostScript, etc.
Other graphics commands	More commands to draw statistical graphs：Distributional diagnostic plots; Smoothing and densities; Regression diagnostics; Time series; Vector autoregressive (VAR, SVAR, VECM) models; Longitudinal data/panel data	......	The commands for printing a graph, that deal with the graphs currently stored in memory, that describe available schemes and allow you to identify and set the default scheme, that lists available styles, for setting options for printing and exporting graphs, that allows you to draw graphs without displaying them.

注：以上命令均为 Stata 自带的官方命令，还有许多实用的画图命令由用户编写（User-written Commands），这些命令将在本系列专题随后的文章种进行详细介绍。

绘图类型方面。从上表的绘图命令可知，Stata 绘图无非是要实现几种常见类型的图形绘制。对于绘图命令而言，我们可以进一步按照绘制对象的差异区分为描述性统计绘图（descriptive graph）和推断统计绘图（inferential graph），前者重在直观反应数据自身的分布和关联模式，后者则是通过图形的方式展示统计分析的结果。两种类型之间的关键差异在于：绘图所用数据的来源是否基于统计模型。本文介绍前者，即描述性统计绘图，重在对清理后的数据进行或分析结果进行可视化，是实证分析过程中重要的环节之一，体现着作者的技法、品味和思考。基于推断统计的绘图会结合具体的研究方法进行详细介绍。下图是 Stata 界面中工具栏的“图形”所包含的内容（图1）。

2. 基于描述性统计的绘图类型

下图展示了Stata的绘图命令结构及绘图类型（图2）。

用命令画图。 若要在 Stata 中绘制一张图，可以通过点选上图中的“图形”按钮进行操作，这很方便。但是，随着技法的熟练和定制化的绘图需求不断上涨，使用命令进行绘图不仅效率更高，而且能够不断强化实践操作能力。要说明的是，由于绘图命令十分“庞大”，在学习和应用中，不断积累各方资料中的图形代码很有必要；同时在绘图中也要善用 Graph Editor

对图形进行局部细节的优化，毕竟我们不可能记得所有绘图命令的选项。

Stata 的绘图代码主要包括四个部分：（1）命令（Graph Commands）；（2）选项（Options）；（3）风格（Styles）；（4）绘图管理命令（Graph Management Commands）。前三类命令是利用已有数据画图的基本元素，以常见的 graph twoway 为例，twoway 是刻画数值 y 与 x 之间对应关系的一组图形（twoway is a family of plots, all of which fit on numeric y and x scales），语法结构如下：

[graph] twoway plot [if] [in] [,  twoway_options]

其中，“ plot ” 代表某类具体的图形。下图是 twoway 家族的所有成员（图3），图2只展示了部分常用的图形类型。" [ ] " 表示代码中可以省略的部分。虽然可以省略，但这部分却是掌握绘图命令的核心。选择合适的绘图类型（plottype）只能保证画“对”图，无法保证画“好”图。因此，我们学习 Stata 绘图的重点也就落在了对图形的呈现效果之上。

图4展示了twoway options 所包含的具体内容，有了这些选项，我们可以对基于 twoway 所绘图形的呈现效果进行改进和优化，例如，为 x 或 y 轴添加特定值（added_line_options）。这些选项的使用方式也很有规律，它们出现在绘图命令后方的,之后，并且可以将我们需要的各种选项一起使用，它们之间也没有顺序前后的差异。

上面以graph twoway命令作为示例，阐述了Stata的绘图逻辑。然而，正如图2所示，能够绘制的图形种类还有很多，但它们的语法结构都是一致的。熟练掌握绘图语法的结构，一旦有了数据，我们便能快速且优雅地开展数据可视化工作。下面，我们用一组容易混淆的示例进行Stata绘图的展示，以帮助我们了解 Stata 的功能与绘图类型。

3. 绘图示例

3.1 条形图（Bar charts）

graph bar可以绘制垂直或水平的条形/柱形图。

在垂直的条形图中，y 轴是数值型的变量，x 轴是分类变量，水平的条形图则反之。

*** 条形图绘图语法示例 ***  
    graph bar (mean) numeric_var, over(cat_var) //(mean) 是 numeric_var 的统计量，如果去掉，则percent是默认的统计量
    graph hbar (mean) numeric_var, over(cat_var) //hbar 表示 horizontal bar charts，即横向条形图

*** “,” 后为优化绘图效果的各类选项 ***
    **group_options
      over(varname [, over subopts]): specifies a categorical variable over which the yvars are to be repeated;
      nofill: specifies that missing subcategories be omitted;
      allcategories: specifies that all categories in the entire dataset be retained for the over() variables;
      
    **yvar_options
      graph bar y1 y2 y3, ascategory whatever_other_options //ascategory is a useful option
      graph bar y, over(group) asyvars whatever_other_options
      graph bar (mean) inc_male inc_female, over(region) percentage stack
      graph bar (mean) wage, over(sex) over(region) asyvars percentage stack
      ...
      
    **lookofbar_options
      bargap(#):specifies the gap to be left between yvar bars as a percentage-of-bar-width units, and the
                default is bargap(0)
      intensity(#):  specify the intensity of the color used to fill the inside of the bar
      ...

    **legending_options
      If more than one yvar is specified, a legend is produced.
      
    **axis_options
      axis_scale_options: specify how the numerical y axis is scaled and how it looks;
      axis_label_options: specify how the numerical y axis is to be labeled;
      ytitle(): overrides the default title for the numerical y axis;
      
    **title_and_other_options
      text(): adds text to a specified location on the graph;
      yline(): adds horizontal (bar) or vertical (hbar) lines at specified y values;
      aspect_option: allows you to control the relationship between the height and width of a graph’s plot region
      std_options: Options for use with graph construction commands, which allow you to add titles, control the graph size, save the graph on disk, and much more;
      by(varlist, . . . ) : draws separate plots within one graph;
    
    **Suboptions for use with over( ) and yvaroptions( )
      relabel(# "text" . . . ) : specifies text to override the default category labeling;
      gap(#) and gap(*#): specify the gap between the bars in this over() group;gap(#) is specified in
                          percentage-of-bar-width units, so gap(67) means two-thirds the width of a bar. gap(*#) allows
                          modifying the default gap. gap(*1.2) would increase the gap by 20%, and gap(*.8) would
                          decrease the gap by 20%.
      sort(varname), sort(#), and sort((stat) varname): control how bars are ordered;

本文试图在条形方向和 “over( )”选项的组合下给出条形图的绘图示例。下面，我们使用一份包含956个观测点的美国城市气温数据（City temperature data），以此展示条形图的画图思路以及各类选项的用法。

（1）垂直条形图 + 1 over( )

use https://www.stata-press.com/data/r17/citytemp,clear
describe
		/*
		 Observations:           956                  City temperature data
		    Variables:             6                  3 Mar 2020 19:17
		----------------------------------------------------------------------------
		Variable      Storage   Display    Value
		    name         type    format    label      Variable label
		----------------------------------------------------------------------------
		division        int     %16.0g     division   Census division
		region          int     %13.0g     region     Census region
		heatdd          int     %8.0g                 Heating degree days
		cooldd          int     %8.0g                 Cooling degree days
		tempjan         float   %9.0g                 Average January temperature
		tempjuly        float   %9.0g                 Average July temperature
		----------------------------------------------------------------------------
		Sorted by: region  */
		
graph bar (mean) tempjuly tempjan, over(region) ///
			  bargap(-30) ///*bargap(#) specifies the gap to be left between yvar bars as a percentage-of-bar-width units. The default is bargap(0), meaning that bars touch.
			  legend( label(1 "July") label(2 "January") ) ///
			  ytitle("温度（Fahrenheit）") ///
			  title("Average July and January temperatures") ///
			  subtitle("by regions of the United States") ///
			  note("Source: U.S. Census Bureau, U.S. Dept. of Commerce") ///
			  graphregion(fcolor(white))  plotregion(fcolor(white))

（2）水平条形图 + 2 over( )

use https://www.stata-press.com/data/r17/citytemp, clear
graph hbar (mean) tempjan, over(division) over(region) nofill ///*nofill specifies that missing subcategories be omitted
		      ytitle("Degrees Fahrenheit") ///
			  title("Average January temperature") ///
			  subtitle("by region and division of the United States") ///
		      note("Source: U.S. Census Bureau, U.S. Dept. of Commerce") ///
		      graphregion(fcolor(white))  plotregion(fcolor(white))

（3）垂直条形图 + 3 over( )

use https://www.stata-press.com/data/r17/nlsw88, clear  //载入一份新的数据
describe
		/*
		Observations:         2,246                  NLSW, 1988 extract
	       Variables:            17                  1 May 2020 22:52
		                                             (_dta has notes)
		------------------------------------------------------------------------------------------------------------------------------------------
		Variable      Storage   Display    Value
		    name         type    format    label      Variable label
		------------------------------------------------------------------------------------------------------------------------------------------
		idcode          int     %8.0g                 NLS ID
		age             byte    %8.0g                 Age in current year
		race            byte    %8.0g      racelbl    Race
		married         byte    %8.0g      marlbl     Married
		never_married   byte    %16.0g     nev_mar    Never married
		grade           byte    %8.0g                 Current grade completed
		collgrad        byte    %16.0g     gradlbl    College graduate
		south           byte    %9.0g      southlbl   Lives in the south
		smsa            byte    %9.0g      smsalbl    Lives in SMSA
		c_city          byte    %16.0g     ccitylbl   Lives in a central city
		industry        byte    %23.0g     indlbl     Industry
		occupation      byte    %22.0g     occlbl     Occupation
		union           byte    %8.0g      unionlbl   Union worker
		wage            float   %9.0g                 Hourly wage
		hours           byte    %8.0g                 Usual hours worked
		ttl_exp         float   %9.0g                 Total work experience (years)
		tenure          float   %9.0g                 Job tenure (years)
		------------------------------------------------------------------------------------------------------------------------------------------
		Sorted by: idcode  */
notes
		  /*_dta:
		  1.  1988 data, extracted from National Longitudinal of Young Woman who were ages 14-24 in 1968 (NLSW).
		  2.  This dataset is the result of extraction and processing by various people at various times.
		  3.  For more information on the NLS, see http://www.bls.gov/nls/.  */
		
graph bar (mean) wage, over(smsa) over(married) over(collgrad) /// * 注意三个 "over()" 的顺序
			  title("Average Hourly Wage, 1988, Women Aged 34-46") ///
			  subtitle("by College Graduation, Marital Status, and SMSA residence") ///
			  note("Source: 1988 data from NLS, U.S. Dept. of Labor, Bureau of Labor Statistics") ///
			  graphregion(fcolor(white))  plotregion(fcolor(white))

（4）水平条形图 +（数量）堆叠

use https://www.stata-press.com/data/r17/educ99gdp, clear
generate total = private + public
graph hbar (asis) public private, ///
			  over(country, sort(total) descending) stack  ///
			  title( "Spending on tertiary education as % of GDP, 1999", span pos(11) ) ///
			  subtitle(" ") ///
			  legend(region(lcolor(white))) ///
			  note("Source: OECD", span) ///
			  graphregion(fcolor(white))  plotregion(fcolor(white))

（4）水平条形图 + （比例）堆叠

use https://www.stata-press.com/data/r17/educ99gdp, clear
generate frac = private/(private + public) 
			  graph hbar (asis) public private, ///
			  over(country, sort(frac) descending) stack percent ///
			  title("Public and private spending on tertiary education, 1999", span pos(11) ) ///
			  subtitle(" ") ///
			  legend(region(lcolor(white))) ///
			  note("Source: OECD", span) ///
			  graphregion(fcolor(white))  plotregion(fcolor(white))

3.2 双向条形图（Twoway bar plots）

twoway bar 在图形中的（x, y）都是数值。

use https://www.stata-press.com/data/r17/sp500, clear
describe
	    /* Observations:           248                  S&P 500
		      Variables:             7                  22 Apr 2020 10:52
		                                              (_dta has notes)
		-----------------------------------------------------------------------------------------------------------------------
		Variable      Storage   Display    Value
		    name         type    format    label      Variable label
		-----------------------------------------------------------------------------------------------------------------------
		date            int     %td                   Date
		open            float   %9.0g                 Opening price
		high            float   %9.0g                 High price
		low             float   %9.0g                 Low price
		close           float   %9.0g                 Closing price
		volume          double  %12.0gc               Volume (thousands)
		change          float   %9.0g                 Closing price change
		-----------------------------------------------------------------------------------------------------------------------
		Sorted by: date */
list date close change in 1/5	    
		/*   +---------------------------------+
		     |      date     close      change |
		     |---------------------------------|
		  1. | 02jan2001   1283.27           . |
		  2. | 03jan2001   1347.56    64.29004 |
		  3. | 04jan2001   1333.34   -14.22009 |
		  4. | 05jan2001   1298.35   -34.98999 |
		  5. | 08jan2001   1295.86    -2.48999 |
		     +---------------------------------+ */
graph twoway bar change date in 1/50, ///
	    	  graphregion(fcolor(white))  plotregion(fcolor(white))

它好处在于它能够与其他 twoway 家族的绘图类型结合使用。

use https://www.stata-press.com/data/r17/sp500, clear
graph twoway line close date, yaxis(1) || bar change date, yaxis(2) || in 1/50, ///
			  yscale(axis(1) r(1000 1400)) ylab(1200(100)1400, axis(1)) ///
			  ytick(1200(100)1400, axis(1) grid) ///
			  yscale(axis(2) r(-50 300)) ylab(-50(50)50, axis(2)) ///
			  ytick(-50(50)50, axis(2) grid) ///
			  legend(off) ///
			  xtitle("Date") ///
			  title("S&P 500") ///
			  subtitle("January - March 2001") ///
			  note("Source: Yahoo!Finance and Commodity Systems, Inc.", span) ///
			  yline(1150, axis(1) lstyle(foreground))  ///
			  graphregion(fcolor(white))  plotregion(fcolor(white))

3.3 直方图（Histograms）

绘制变量（varname ）的直方图，除非指定了离散（discrete）选项，一般假定 varname是连续变量。根据 Beniger 和 Robyn (1978) ，虽然 A. M. Guerry 在 1833 年的发表中使用了直方图，但“直方图（histogram）”一词是在 1895 年由 Karl Pearson 首次使用的。

（1）连续变量的直方图

use https://www.stata-press.com/data/r17/sp500, clear
histogram volume
graph save "$figures\histo_01", replace
histogram volume, fraction
graph save "$figures\histo_02", replace
graph combine "$figures\histo_01" "$figures\histo_02", row(1)
save "$figures\histo_0102", replace
graph export "$figures\histo_0102.png", replace

直方图的柱图被叫做 bin，其个数（k）是按照一个数学规则确定的：

k=min(\sqrt{N}, 10 \times \frac{ln(N)}{ln(10)})

其中，N 为能够观测到的变量个数。如何能够更好地利用连续变量的统计特征呢？在上面的基准图形之上，我们可以通过下面的命令将标准差信息同时纳入到图形中，也是更为推荐使用的直方图绘图方式，可以在论文和研究报告中使用。

use "https://www.stata-press.com/data/r17/sp500", clear
sum volume //sum命令能够帮助我们得到变量的统计量
		/*
	    Variable |        Obs        Mean    Std. dev.       Min        Max
		-------------+-----------------------------------------------------
	      volume |        248    12320.68    2585.929       4103    23308.3     */

return list //查看计算出的统计量，它们被保存在 “ scalars ”中
	    /*scalars:
                  r(N) =  248
              r(sum_w) =  248
               r(mean) =  12320.67661290323
                r(Var) =  6687027.906981193
                 r(sd) =  2585.928828676689
                r(min) =  4103
                r(max) =  23308.3
                r(sum) =  3055527.8 */

*找到均值 r(mean)和标准差 r(sd)，计算出偏离标准差若干单位所对应的值             
display r(mean)+r(sd) //14906.605
display r(mean)+2*r(sd) //17492.534
display r(mean)+3*r(sd) //20078.463
display r(mean)+4*r(sd) //22664.392
display r(mean)-r(sd) // 9734.7478
display r(mean)-2*r(sd) //7148.819

*绘制图形
histogram volume, freq normal kdensity ///
				  xaxis(1 2) ///
				  ylabel(0(10)80, grid) ///
				  xlabel(12320.68 "mean" ///* Mean=12320.68
				  9734.7478 "-1 s.d." /// 
				  14906.605 "+1 s.d." ///
				  7148.819 "-2 s.d." ///
				  17492.534 "+2 s.d." ///
				  20078.463 "+3 s.d." ///
				  22664.392 "+4 s.d.", axis(2) grid gmax) ///
				  xtitle("", axis(2)) ///
				  subtitle("S&P 500, January 2001 - December 2001") ///
				  note("Source: Yahoo! Finance and Commodity Systems, Inc.") ///
				  graphregion(fcolor(white))  plotregion(fcolor(white))

（2）离散变量的直方图

使用discrete选项，将变量视为离散的，而不再是连续的，即使变量自身可能是连续的。此时，变量的每一个唯一的值将有一个 bin，因而柱子的数量也较多，每个柱子的高度表示该值所对应的密度、频数、百分比或比例。

use https://www.stata-press.com/data/r17/auto, clear
histogram mpg //mpg would be treated as continuous and categorized into eight bins by the default number-of-bins calculation (here N=74)
graph save "$figures\histo_discrete01", replace
histogram mpg, discrete //Adding the discrete option makes a histogram with a bin for each of the 21 unique values
graph save "$figures\histo_discrete02", replace
histogram mpg, discrete freq addlabels ylabel(,grid) xlabel(10(10)40) xtick(10(5)40,grid)
graph save "$figures\histo_discrete03", replace
graph combine "$figures\histo_discrete01" "$figures\histo_discrete02" "figures\histo_discrete03", row(1)
graph export "figures\histo_discrete010203.png", replace

（3）利用权重信息的直方图

use https://www.stata-press.com/data/r17/voter, clear

describe    
		/*Observations:            15                  1992 U.S. presidential voters
		     Variables:             5                  3 Mar 2020 14:27
		                                               (_dta has notes)
		-----------------------------------------------------------------------------
		Variable      Storage   Display    Value
		    name         type    format    label      Variable label
		-----------------------------------------------------------------------------
		candidat        int     %8.0g      candidat   Candidate voted for, 1992
		inc             int     %8.0g      inc2       Family income
		frac            float   %9.0g                 
		pfrac           double  %10.0g                
		pop             double  %10.0g                
		-----------------------------------------------------------------------------
		Sorted by: inc  */
label list candidat
		/*candidat:
           2 Clinton
           3 Bush
           4 Perot */
histogram candi [fweight=pop], discrete fraction by(inc, total) /// *frequency weights
 				   barwidth(1) gap(40) xlabel(2 3 4, valuelabel) /// *place a gap between the bars by reducing bar width by #%
 				   graphregion(fcolor(white)) plotregion(fcolor(white))

值得注意的是，我们用条形图也能够实现上面的示例，但画图的对象发生了变化。

通过这组示例，我们能够更好地理解三个命令。

use https://www.stata-press.com/data/r17/voter, clear
graph bar frac, over(candidat) by(inc, total)
graph save "$figures/histogram_bar",replace
graph twoway bar frac candidat, by(inc, total) xlabel(2 3 4, valuelabel) yscale(r(0 100))
graph save "$figures/histogram_2waybar",replace
graph combine "$figures/histogram_bar" "$figures/histogram_2waybar", row(2)
graph save "$figures/histogram_bar & 2waybar",replace
graph export "$figures/histogram_bar & 2waybar.png",replace

3.4 双向直方图（Histogram plots）

twoway histogram和上方呈现的 histogram几乎没有差异，并且后者能够将正态密度函数或是核密度估计叠加在直方图上，这也使后者的优势所在。因此，在实际应用中，建议使用 histogram。

以上就是本文的内容，绘图的精要在于：（1）明确要利用手头可用的数据绘制何种图形（可以通过视觉意象或参考其他人的作品启发自己）；（2）选择合适的绘图命令（比如使用 graph bar 还是 twoway graph bar）；（3）通过各类绘图选项（options）让所绘图形更加美观且更具自我说明性（self-explanatory）。后续的文章会对主要的图形类型开展逐个击破，用图让数据说话！

参考资料

StataCorp. (2021). [G] Stata Graphics Reference Manual, Stata: Release 17. Statistical Software. College Station, TX: StataCorp LLC.
Michael Mitchell. (2012) .Visual Guide to Stata Graphics(Third Edition), Published by Stata Press.

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

数据可视化

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

数据可视化

登录后参与评论

0 条评论

热度