$$ H_0 : {\mu} = {\mu_0} \text{ vs. } H_1 : {\mu} \neq{\mu_0} $$
$$ \text{test statistic }=\frac{{\bar X}-{\mu_0}}{s/{\sqrt n}} \sim t(n-1)
$$
. use data7_3, clear
. su inc_male
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
inc_male | 19 2728 2218.659 0 9200
. mean inc_male
Mean estimation Number of obs = 19
--------------------------------------------------------------
| Mean Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
inc_male | 2728 508.9954 1658.64 3797.36
--------------------------------------------------------------
. ttest inc_male=3000
One-sample t test
------------------------------------------------------------------------------
Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+--------------------------------------------------------------------
inc_male | 19 2728 508.9954 2218.659 1658.64 3797.36
------------------------------------------------------------------------------
mean = mean(inc_male) t = -0.5344
Ho: mean = 3000 degrees of freedom = 18
Ha: mean < 3000 Ha: mean != 3000 Ha: mean > 3000
Pr(T < t) = 0.2998 Pr(|T| > |t|) = 0.5996 Pr(T > t) = 0.7002
Q) 단측검정과 양측검정에 대한 해석은?
Q) t 검정에 p-value를 계산하는 명령어는 ?
$$ H_0 : {\mu_1} = {\mu_2} \text{ vs. } H_1 : {\mu_1} \neq{\mu_2} $$
$$ \text{test statistic }=\frac{\bar X_1-\bar X_2}{\sqrt{s^2_p\left(\frac{1}{n_1}+\frac{1}{n_2}\right)}} \sim t({n_1}+{n_2}-2) \text { : equal variance assumption } $$
. use data7_3, clear
. ttest inc_female=inc_male, unpaired
Two-sample t test with equal variances
------------------------------------------------------------------------------
Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+--------------------------------------------------------------------
inc_fe~e | 14 2272.714 674.6932 2524.471 815.1283 3730.3
inc_male | 19 2728 508.9954 2218.659 1658.64 3797.36
---------+--------------------------------------------------------------------
combined | 33 2534.848 404.8982 2325.963 1710.098 3359.599
---------+--------------------------------------------------------------------
diff | -455.2857 828.3372 -2144.691 1234.119
------------------------------------------------------------------------------
diff = mean(inc_female) - mean(inc_male) t = -0.5496
Ho: diff = 0 degrees of freedom = 31
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 0.2933 Pr(|T| > |t|) = 0.5865 Pr(T > t) = 0.7067
. stack inc_female inc_male, into(income) clear
. ttest income, by(_stack)
Two-sample t test with equal variances
------------------------------------------------------------------------------
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+--------------------------------------------------------------------
1 | 14 2272.714 674.6932 2524.471 815.1283 3730.3
2 | 19 2728 508.9954 2218.659 1658.64 3797.36
---------+--------------------------------------------------------------------
combined | 33 2534.848 404.8982 2325.963 1710.098 3359.599
---------+--------------------------------------------------------------------
diff | -455.2857 828.3372 -2144.691 1234.119
------------------------------------------------------------------------------
diff = mean(1) - mean(2) t = -0.5496
Ho: diff = 0 degrees of freedom = 31
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 0.2933 Pr(|T| > |t|) = 0.5865 Pr(T > t) = 0.7067
Q) unequal variance assumption 하에서 평균 동일성 검정 명령어는?
Q) 평균 동일성 검정에서 중요한 가정은?
비율차이 검정통계량의 분포는 표준 정규분포를 따른다.
$$ H_0 : {\pi_1} = {\pi_2} \text{ vs. } H_1 : {\pi_1} \neq{\pi_2} $$
$$ \text{test statistic }=\frac{p_1-p_2}{\sqrt{\bar p(1-\bar p)\left(\frac{1}{n_1}+\frac{1}{n_2}\right)}} \sim N(0,1) $$ $$ \text{ where } {\bar p= } \text{ ?? } $$
. use R_data2_2, clear
. stack treat control, into(cure) clear
. prtest cure, by(_stack)
Two-sample test of proportions 1: Number of obs = 30
2: Number of obs = 30
------------------------------------------------------------------------------
Group | Mean Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
1 | .6 .0894427 .4246955 .7753045
2 | .3333333 .0860663 .1646465 .5020202
-------------+----------------------------------------------------------------
diff | .2666667 .1241266 .023383 .5099503
| under Ho: .1288122 2.07 0.038
------------------------------------------------------------------------------
diff = prop(1) - prop(2) z = 2.0702
Ho: diff = 0
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(Z < z) = 0.9808 Pr(|Z| > |z|) = 0.0384 Pr(Z > z) = 0.0192
Q) 검정통계량이 정규분포를 따르기 위한 조건은?
상관계수: 두 확률변수의 $\color{red}{\text {선형관계 (linear relationship)}}$를 측정한다.
$$ corr(X,Y)=\frac{cov(X,Y)}{\sqrt{var(X)var(Y)}} $$ $$ \text{note that } -1 \leq corr(X,Y) \leq 1 $$
. use R_data4_1, clear
. twoway (scatter write math, msymbol(sh) mcolor(red%50))
Q) Stata의 그림파일을 HWP 또는 Word 로 옮기는 방법은 ?
Q) 그림파일을 *.png 파일로 저장하는 명령어는?
Stata에서는 표본 상관계수를 계산하는 명령어는 corr과 pwcorr을 사용한다.
. corr read write
(obs=200)
| read write
-------------+------------------
read | 1.0000
write | 0.5968 1.0000
. corr read write math
(obs=199)
| read write math
-------------+---------------------------
read | 1.0000
write | 0.5941 1.0000
math | 0.6600 0.6141 1.0000
Q) 공분산 $cov(X,Y)$ 를 계산하는 명령어는?
. pwcorr read write
| read write
-------------+------------------
read | 1.0000
write | 0.5968 1.0000
. pwcorr read write math , listwise sig star(0.05)
| read write math
-------------+---------------------------
read | 1.0000
|
|
write | 0.5941* 1.0000
| 0.0000
|
math | 0.6600* 0.6141* 1.0000
| 0.0000 0.0000
|
Q) corr 과 pwcorr 명령어의 차이점은?
Q) return list 의 활용
종속변수 y와 설명변수 x의 선형관계를 추정하는 모형을 설정한다.
$$ y_i=\alpha+\beta x_i+e_i $$
주어진 표본 데이터 x와 y를 이용하여 최소자승 추정량(OLS estimator)를 적용하여 추정치를 얻을 수 있다. 일정한 가정 하에서 OLS 추정량은 BLUE(Best Linear Unbiased Estimator)가 된다.
. use R_data8_1, clear
. reg food_exp income
Source | SS df MS Number of obs = 40
-------------+---------------------------------- F(1, 38) = 23.79
Model | 190626.98 1 190626.98 Prob > F = 0.0000
Residual | 304505.173 38 8013.29403 R-squared = 0.3850
-------------+---------------------------------- Adj R-squared = 0.3688
Total | 495132.153 39 12695.6962 Root MSE = 89.517
------------------------------------------------------------------------------
food_exp | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
income | 10.20964 2.093263 4.88 0.000 5.972052 14.44723
_cons | 83.41601 43.41016 1.92 0.062 -4.463272 171.2953
------------------------------------------------------------------------------
. ereturn list
scalars:
e(N) = 40
e(df_m) = 1
e(df_r) = 38
e(F) = 23.788841278548
e(r2) = .3850022234808774
e(rmse) = 89.5170041283277
e(mss) = 190626.97975307
e(rss) = 304505.1730682194
e(r2_a) = .3688180714672162
e(ll) = -235.5088193402595
e(ll_0) = -245.2315518722239
e(rank) = 2
macros:
e(cmdline) : "regress food_exp income"
e(title) : "Linear regression"
e(marginsok) : "XB default"
e(vce) : "ols"
e(depvar) : "food_exp"
e(cmd) : "regress"
e(properties) : "b V"
e(predict) : "regres_p"
e(model) : "ols"
e(estat_cmd) : "regress_estat"
matrices:
e(b) : 1 x 2
e(V) : 2 x 2
functions:
e(sample)
Q) ereturn list는 어떻게 활용할 수 있는가?
상수항이 없는 모형을 추정하기 위해서는 noconstant 옵션을 사용한다.
. reg food_exp income, noconstant cformat(%9.3f)
Source | SS df MS Number of obs = 40
-------------+---------------------------------- F(1, 39) = 394.28
Model | 3377595.38 1 3377595.38 Prob > F = 0.0000
Residual | 334093.953 39 8566.51161 R-squared = 0.9100
-------------+---------------------------------- Adj R-squared = 0.9077
Total | 3711689.33 40 92792.2333 Root MSE = 92.555
------------------------------------------------------------------------------
food_exp | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
income | 14.012 0.706 19.86 0.000 12.585 15.440
------------------------------------------------------------------------------
x 변수가 y변수의 변화를 설명하는 정도를 표현한다. 모형의 적합도(goodness of fit)으로 해석한다.
$$ R^2=\frac{SSR}{SST}=1-\frac{SSE}{SST} $$ $$ \text{ where } SSE= ?? , SSR=?? , SST=?? $$
. reg food_exp income
Source | SS df MS Number of obs = 40
-------------+---------------------------------- F(1, 38) = 23.79
Model | 190626.98 1 190626.98 Prob > F = 0.0000
Residual | 304505.173 38 8013.29403 R-squared = 0.3850
-------------+---------------------------------- Adj R-squared = 0.3688
Total | 495132.153 39 12695.6962 Root MSE = 89.517
------------------------------------------------------------------------------
food_exp | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
income | 10.20964 2.093263 4.88 0.000 5.972052 14.44723
_cons | 83.41601 43.41016 1.92 0.062 -4.463272 171.2953
------------------------------------------------------------------------------
. di e(r2)
.38500222
. corr food_exp income
(obs=40)
| food_exp income
-------------+------------------
food_exp | 1.0000
income | 0.6205 1.0000
. di r(rho)^2
.38500222
Q) 단순선형회귀모형에서 $ R^2 $ 와 $ corr(x,y) $ 의 관계는?
추정결과를 이용하면 y 변수의 fitted value를 구할 수 있다.
$$ \hat y_i=\hat \alpha+\hat \beta x_i $$
예측된 직선(predicted line)은 lfit 명령어를 이용하여 그래프로 표현할 수 있다.
. reg food_exp income
. twoway (scatter food income,mcolor(red%50) msymbol(dh)) (lfit food income)
설명변수에 해당하는 x 변수가 2개 이상인 선형회귀모형을 설정한다. 다음 식에서 설명변수는 (상수항 포함) $ k+1 $ 개 이다.
$$ y_i=\beta_0+\beta_1 x_{1i}+\beta_2 x_{2i}+\cdots + \beta_k x_{ki}+e_i $$
. use R_data8_3, clear
(Housing price data for Boston-area communities)
. reg price nox crime
Source | SS df MS Number of obs = 506
-------------+---------------------------------- F(2, 503) = 76.98
Model | 1.0036e+10 2 5.0181e+09 Prob > F = 0.0000
Residual | 3.2789e+10 503 65187676.7 R-squared = 0.2343
-------------+---------------------------------- Adj R-squared = 0.2313
Total | 4.2826e+10 505 84803032 Root MSE = 8073.9
------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
nox | -2538.31 341.962 -7.42 0.000 -3210.159 -1866.46
crime | -271.6976 46.11359 -5.89 0.000 -362.2966 -181.0986
_cons | 37579.82 1868.701 20.11 0.000 33908.4 41251.24
------------------------------------------------------------------------------
. reg price nox crime, beta
Source | SS df MS Number of obs = 506
-------------+---------------------------------- F(2, 503) = 76.98
Model | 1.0036e+10 2 5.0181e+09 Prob > F = 0.0000
Residual | 3.2789e+10 503 65187676.7 R-squared = 0.2343
-------------+---------------------------------- Adj R-squared = 0.2313
Total | 4.2826e+10 505 84803032 Root MSE = 8073.9
------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| Beta
-------------+----------------------------------------------------------------
nox | -2538.31 341.962 -7.42 0.000 -.3192976
crime | -271.6976 46.11359 -5.89 0.000 -.2534462
_cons | 37579.82 1868.701 20.11 0.000 .
------------------------------------------------------------------------------
Q) ** Beta coefficients ** 는 어떻게 해석하는가?
Q) 표준화 변수를 생성하는 명령어는?
reg 명령어를 사용하면 추정결과는 특정한 이름으로 저장되어 있고 그 저장결과를 활용할 수 있다.
. qui reg price nox crime
. di _b[nox]
-2538.3095
. di _se[nox]
341.96202
. di "t-value = " _b[nox]/_se[nox]
t-value = -7.4227821
. ereturn list
scalars:
e(N) = 506
e(df_m) = 2
e(df_r) = 503
e(F) = 76.97873479111324
e(r2) = .2343492184967946
e(rmse) = 8073.888574956597
e(mss) = 10036129755.88084
e(rss) = 32789401390.56978
e(r2_a) = .2313048813934021
e(ll) = -5268.652026932938
e(ll_0) = -5336.210392272958
e(rank) = 3
macros:
e(cmdline) : "regress price nox crime"
e(title) : "Linear regression"
e(marginsok) : "XB default"
e(vce) : "ols"
e(depvar) : "price"
e(cmd) : "regress"
e(properties) : "b V"
e(predict) : "regres_p"
e(model) : "ols"
e(estat_cmd) : "regress_estat"
matrices:
e(b) : 1 x 3
e(V) : 3 x 3
functions:
e(sample)
. mat list e(b)
e(b)[1,3]
nox crime _cons
y1 -2538.3095 -271.69761 37579.821
y 변수에 취하는 경우 추정계수 해석은 x 변수가 1단위 증가할 때 y 변수는 $ \hat \beta \times 100 \text {%} $ 증가/감소(% change)로 해석한다.
. reg lprice nox crime
Source | SS df MS Number of obs = 506
-------------+---------------------------------- F(2, 503) = 152.91
Model | 31.9812631 2 15.9906315 Prob > F = 0.0000
Residual | 52.6010079 503 .104574568 R-squared = 0.3781
-------------+---------------------------------- Adj R-squared = 0.3756
Total | 84.5822709 505 .167489645 Root MSE = .32338
------------------------------------------------------------------------------
lprice | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
nox | -.1230908 .0136965 -8.99 0.000 -.1500001 -.0961815
crime | -.0181402 .001847 -9.82 0.000 -.0217689 -.0145115
_cons | 10.6897 .0748463 142.82 0.000 10.54265 10.83675
------------------------------------------------------------------------------
. reg lprice lnox crime
Source | SS df MS Number of obs = 506
-------------+---------------------------------- F(2, 503) = 153.55
Model | 32.0636239 2 16.031812 Prob > F = 0.0000
Residual | 52.518647 503 .104410829 R-squared = 0.3791
-------------+---------------------------------- Adj R-squared = 0.3766
Total | 84.5822709 505 .167489645 Root MSE = .32313
------------------------------------------------------------------------------
lprice | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lnox | -.7145337 .0790602 -9.04 0.000 -.8698626 -.5592049
crime | -.017933 .0018537 -9.67 0.000 -.0215749 -.0142911
_cons | 11.21559 .1319038 85.03 0.000 10.95644 11.47474
------------------------------------------------------------------------------
$$ \text {elasticity of } x = \frac {\frac {\triangle y}{y}}{\frac {\triangle x}{x}} = \frac {\triangle log(y)}{\triangle log(x)} $$
Q) 위 명령문에서 elasticity of nox를 구하는 방법은?
계수에 대한 개별 검정(individual test) 또는 결합 검정(joint test)은 test 또는 testparm 명령어를 사용한다.
$$ \text {Joint test statistic } F=\frac {(RSSE-USSE)/J}{USSE/(n-k-1)} \sim F(J, n-k-1) $$ $$ \text { where } J= ?? $$
Q) F 분포에서 p-value를 계산하는 명령어는 ?
. reg lprice lnox ldist lproptax crime
Source | SS df MS Number of obs = 506
-------------+---------------------------------- F(4, 501) = 93.46
Model | 36.143062 4 9.0357655 Prob > F = 0.0000
Residual | 48.4392089 501 .096685048 R-squared = 0.4273
-------------+---------------------------------- Adj R-squared = 0.4227
Total | 84.5822709 505 .167489645 Root MSE = .31094
------------------------------------------------------------------------------
lprice | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lnox | -.938093 .1459205 -6.43 0.000 -1.224784 -.6514015
ldist | -.2203829 .0514951 -4.28 0.000 -.3215559 -.1192099
lproptax | -.2456268 .0509896 -4.82 0.000 -.3458066 -.145447
crime | -.0158604 .0019787 -8.02 0.000 -.019748 -.0119728
_cons | 13.30539 .349259 38.10 0.000 12.6192 13.99159
------------------------------------------------------------------------------
. test lnox
( 1) lnox = 0
F( 1, 501) = 41.33
Prob > F = 0.0000
. test lnox=1
( 1) lnox = 1
F( 1, 501) = 176.41
Prob > F = 0.0000
. test lnox ldist
( 1) lnox = 0
( 2) ldist = 0
F( 2, 501) = 21.25
Prob > F = 0.0000
. test (lnox+ldist=1) (lproptax=0)
( 1) lnox + ldist = 1
( 2) lproptax = 0
F( 2, 501) = 102.27
Prob > F = 0.0000
. testparm l*
( 1) lnox = 0
( 2) ldist = 0
( 3) lproptax = 0
F( 3, 501) = 43.47
Prob > F = 0.0000
Q) test 명령어와 testparm 명령어의 차이점은?
추정계수를 이용하면 다음과 같이 예측값(fitted values) 과 잔차(residuals) 를 계산할 수 있다. $$ \hat y_i=\hat \beta_0+\hat \beta_1 x_{1i}+\hat \beta_2 x_{2i}+\cdots + \hat \beta_k x_{ki} $$ $$ \hat e_i = y_i-\hat y_i $$
reg 명령문 사용 후 predict 명령어를 이용하여 예측값과 잔차 변수를 생성한다.
. use R_data8_4, clear
(NLSW, 1988 extract)
. reg wage age ttl_exp tenure
Source | SS df MS Number of obs = 200
-------------+---------------------------------- F(3, 196) = 17.08
Model | 729.23361 3 243.07787 Prob > F = 0.0000
Residual | 2788.81983 196 14.2286726 R-squared = 0.2073
-------------+---------------------------------- Adj R-squared = 0.1951
Total | 3518.05344 199 17.6786605 Root MSE = 3.7721
------------------------------------------------------------------------------
wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .0588066 .0882782 0.67 0.506 -.1152906 .2329037
ttl_exp | .2677731 .0733165 3.65 0.000 .1231827 .4123636
tenure | .1381316 .0590716 2.34 0.020 .021634 .2546292
_cons | 1.002511 3.41283 0.29 0.769 -5.728073 7.733095
------------------------------------------------------------------------------
. predict yhat, xb
. predict ehat, residual
연구자가 그리고자 하는 관심변수(tenure) 외 나머지 변수는 평균에서 고정시킨 상태에서 wage 변수와 tenure 변수 간 예측 그래프를 그릴 수 있다.
. reg wage age ttl_exp tenure
. margins , at(tenure=(0(1)25)) atmeans noatlegend
. marginsplot, noci recast(line)
Q) 위 직선의 기울기는?
OLS 추정을 위한 가정 중 하나는 (상수항을 포함하여) 설명변수 간 완전한 선형관계(perfect linear relationship)이 존재하지 않아야 한다. x 변수 간 완전한 선형관계가 존재하면 특정 x 변수의 추정계수는 식별되지 않는다.
$$ \text {Model 1: } HRS=\beta_0+\beta_1 AGE + \beta_2 NEIN +\beta_3 ASSET + e $$ $$ \text {Model 2: } HRS=\beta_0+\beta_1 AGE +\beta_2 ASSET + v $$ $$ \text { where NEIN: non-labor income and ASSET: the amount of asset } $$ Model (1)에서 $ SE(\hat \beta_2) $ 는 $ AGE $, $ NEIN $, $ ASSET $ 변수 간 선형관계에 의존한다. 선형관계가 커질수록 $ SE(\hat \beta_2) $ 역시 커지게 되고 결과적으로 $ \hat \beta_2 $는 유의하지 않게 된다.
$$ \text {1/(variance inflation factor) } = 1/VIF = 1-R^2_j $$ $$ \text { where } R^2_j = ?? $$
estat vif 명령어를 이용하여 다중공선성 여부를 확인할 수 있다. VIF 값이 5 또는 10보다 크다면 다중공선성을 의심하게 된다.
. use R_data10_3, clear
. reg HRS AGE NEIN ASSET
Source | SS df MS Number of obs = 39
-------------+---------------------------------- F(3, 35) = 25.83
Model | 107317.64 3 35772.5467 Prob > F = 0.0000
Residual | 48465.5908 35 1384.73117 R-squared = 0.6889
-------------+---------------------------------- Adj R-squared = 0.6622
Total | 155783.231 38 4099.5587 Root MSE = 37.212
------------------------------------------------------------------------------
HRS | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
AGE | -8.007181 1.88844 -4.24 0.000 -11.84092 -4.173443
NEIN | .3338277 .337171 0.99 0.329 -.3506658 1.018321
ASSET | .0044232 .015516 0.29 0.777 -.027076 .0359223
_cons | 2314.054 63.22636 36.60 0.000 2185.698 2442.411
------------------------------------------------------------------------------
. estat vif
Variable | VIF 1/VIF
-------------+----------------------
NEIN | 60.84 0.016436
ASSET | 56.07 0.017836
AGE | 1.74 0.573178
-------------+----------------------
Mean VIF | 39.55
. reg HRS AGE ASSET
Source | SS df MS Number of obs = 39
-------------+---------------------------------- F(2, 36) = 38.28
Model | 105960.234 2 52980.1169 Prob > F = 0.0000
Residual | 49822.9969 36 1383.97214 R-squared = 0.6802
-------------+---------------------------------- Adj R-squared = 0.6624
Total | 155783.231 38 4099.5587 Root MSE = 37.202
------------------------------------------------------------------------------
HRS | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
AGE | -6.952868 1.559142 -4.46 0.000 -10.11495 -3.790782
ASSET | .0196214 .0022597 8.68 0.000 .0150384 .0242044
_cons | 2288.056 57.49982 39.79 0.000 2171.441 2404.671
------------------------------------------------------------------------------
. estat vif
Variable | VIF 1/VIF
-------------+----------------------
AGE | 1.19 0.840402
ASSET | 1.19 0.840402
-------------+----------------------
Mean VIF | 1.19
. reg NEIN AGE ASSET
Source | SS df MS Number of obs = 39
-------------+---------------------------------- F(2, 36) = 1077.14
Model | 728896.478 2 364448.239 Prob > F = 0.0000
Residual | 12180.4963 36 338.347121 R-squared = 0.9836
-------------+---------------------------------- Adj R-squared = 0.9827
Total | 741076.974 38 19502.0256 Root MSE = 18.394
------------------------------------------------------------------------------
NEIN | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
AGE | 3.158253 .770909 4.10 0.000 1.594777 4.721729
ASSET | .0455272 .0011173 40.75 0.000 .0432612 .0477933
_cons | -77.88006 28.43047 -2.74 0.010 -135.5397 -20.22039
------------------------------------------------------------------------------
. di "1/VIF =", 1-e(r2)
1/VIF = .01643621
더미변수(dummy variable)는 질적특성을 나타내는 설명변수로 회귀모형에서 사용된다. indicator variable 이라고도 부른다. xi 또는 tab 명령어를 이용하여 더미변수를 생성할 수 있다.
. use R_data9_1, clear
(National Longitudinal Survey. Young Women 14-26 years of age in 1968)
. tab race, gen(race_dum)
1=white, |
2=black, |
3=other | Freq. Percent Cum.
------------+-----------------------------------
1 | 1,657 72.93 72.93
2 | 589 25.92 98.86
3 | 26 1.14 100.00
------------+-----------------------------------
Total | 2,272 100.00
. xi i.race , prefix(dum)
i.race dumrace_1-3 (naturally coded; dumrace_1 omitted)
Q) xi 명령문에서 모든 범주에 대한 더미변수를 만들고자 하는 경우 옵션은?
더미변수는 상수항을 shif 하는 역할을 한다. Race 변수의 3개 범주 중에서 2개 더미변수만 모형에 포함한다.
$$ y_i=\beta_0+\beta_1 Black_i+\beta_2 Other_i+\gamma x_i+e_i $$ $$ E(y_{Black})=\beta_0+\beta_1+\gamma x_i $$ $$ E(y_{Other})=\beta_0+\beta_2+\gamma x_i $$ $$ E(y_{White})=\beta_0+\gamma x_i $$
i. operator를 이용하여 범주형 변수임을 표현한다. 1번 범주를 자동으로 drop하고 나머지 범주에 대해서 더미변수로 만든다.
. reg ln_wage i.race ttl_exp
Source | SS df MS Number of obs = 2,272
-------------+---------------------------------- F(3, 2268) = 132.44
Model | 120.253715 3 40.0845716 Prob > F = 0.0000
Residual | 686.454941 2,268 .302669727 R-squared = 0.1491
-------------+---------------------------------- Adj R-squared = 0.1479
Total | 806.708656 2,271 .355221777 Root MSE = .55015
------------------------------------------------------------------------------
ln_wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
race |
2 | -.1827163 .0263983 -6.92 0.000 -.2344836 -.1309489
3 | .0405064 .1087378 0.37 0.710 -.1727296 .2537425
|
ttl_exp | .0471353 .0025044 18.82 0.000 .0422242 .0520463
_cons | 1.336392 .0340166 39.29 0.000 1.269686 1.403099
------------------------------------------------------------------------------
. reg ln_wage b1.race ttl_exp
Source | SS df MS Number of obs = 2,272
-------------+---------------------------------- F(3, 2268) = 132.44
Model | 120.253715 3 40.0845716 Prob > F = 0.0000
Residual | 686.454941 2,268 .302669727 R-squared = 0.1491
-------------+---------------------------------- Adj R-squared = 0.1479
Total | 806.708656 2,271 .355221777 Root MSE = .55015
------------------------------------------------------------------------------
ln_wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
race |
2 | -.1827163 .0263983 -6.92 0.000 -.2344836 -.1309489
3 | .0405064 .1087378 0.37 0.710 -.1727296 .2537425
|
ttl_exp | .0471353 .0025044 18.82 0.000 .0422242 .0520463
_cons | 1.336392 .0340166 39.29 0.000 1.269686 1.403099
------------------------------------------------------------------------------
Q) gender 변수처럼 이미 0과 1로 구성된 더미변수인 경우도 i. operator 를 사용해야 하는가?
Q) race 변수가 통계적으로 유의한지 가설검정하기 위해 필요한 명령문은?
margins와 marginsplot 명령어를 이용하면 prediction graph를 그릴 수 있다.
. reg ln_wage i.race ttl_exp
. margins race, at(ttl_exp=(0(1)25)) noatlegend
. marginsplot , noci recast(line)
Q) 조절효과를 path diagram 으로 설명하세요.
연속형 변수와 범주형 변수의 상호작용을 포함하는 모형이 가장 일반적이다. 각 범주에 따라 $ \frac {\triangle y}{\triangle x} $가 서로 다르다고 가정한다.
$$ y_i=\beta_0 + \beta_1 D_{2i}+ \beta_2 D_{3i}+\gamma x_i+\beta_3 D_{2i}x_i+\beta_4 D_{3i}x_i+e_i $$ $$ E(y_{Black})=\beta_0+\beta_1+(\gamma+\beta_3) x_i $$ $$ E(y_{Other})=\beta_0+\beta_2+(\gamma+\beta_4) x_i $$ $$ E(y_{White})=\beta_0+\gamma x_i $$
. reg ln_wage i.race ttl_exp i.race#c.ttl_exp
Source | SS df MS Number of obs = 2,272
-------------+---------------------------------- F(5, 2266) = 79.43
Model | 120.297066 5 24.0594131 Prob > F = 0.0000
Residual | 686.41159 2,266 .302917736 R-squared = 0.1491
-------------+---------------------------------- Adj R-squared = 0.1472
Total | 806.708656 2,271 .355221777 Root MSE = .55038
--------------------------------------------------------------------------------
ln_wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------------+----------------------------------------------------------------
race |
2 | -.1550527 .0778385 -1.99 0.046 -.3076949 -.0024106
3 | .0425614 .2724185 0.16 0.876 -.4916545 .5767773
|
ttl_exp | .0476876 .0029271 16.29 0.000 .0419474 .0534277
|
race#c.ttl_exp |
2 | -.0021888 .0057937 -0.38 0.706 -.0135503 .0091728
3 | -.000169 .0198287 -0.01 0.993 -.0390533 .0387153
|
_cons | 1.329508 .038911 34.17 0.000 1.253203 1.405812
--------------------------------------------------------------------------------
. reg ln_wage i.race##c.ttl_exp
. reg ln_wage b1.race##c.ttl_exp
Q) 종속변수가 로그 변환된 변수임을 감안하여 $ \frac {\triangle wage}{\triangle ttlexp} \mid_{White} $ 을 계산하세요.
앞서와 마찬가지로 margins 와 marginsplot 명령어를 이용하면 범주별 $ \color{red}{\text {서로 다른 기울기를 가진 prediction graph}} $ 를 그릴 수 있다.
. use R_data9_3,clear
. gen ln_wage=log(wage)
. reg ln_wage i.union ttl_exp i.union#c.ttl_exp
. margins union, at(ttl_exp=(0(1)25)) atmeans noatlegend
. marginsplot, noci recast(line)
Q) 위 그래프에서 nonunion prediction graph를 점선으로 나타내고자 하는 경우는?
Q) 선형회귀모형 비선형회귀모형의 차이점은 ?
$$ \text {Quadratic model: } y_i=\beta_0 + \beta_1 x_i +\beta_2 x^2_i +e_i $$ $$ \text {Cubic model: } y_i=\beta_0 + \beta_1 x_i +\beta_2 x^2_i +\beta_3 x^3_u+e_i $$
$$ \text {marginal effect: } \frac {\partial y}{\partial x} = ?? $$
. use R_data15_1, clear
(NLSW, 1988 extract)
. gen lwage=ln(wage)
. reg lwage tenure c.tenure#c.tenure
Source | SS df MS Number of obs = 2,231
-------------+---------------------------------- F(2, 2228) = 121.85
Model | 72.193832 2 36.096916 Prob > F = 0.0000
Residual | 660.023236 2,228 .296240232 R-squared = 0.0986
-------------+---------------------------------- Adj R-squared = 0.0978
Total | 732.217068 2,230 .328348461 Root MSE = .54428
-----------------------------------------------------------------------------------
lwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
------------------+----------------------------------------------------------------
tenure | .0610446 .0068122 8.96 0.000 .0476856 .0744035
|
c.tenure#c.tenure | -.0016852 .0003661 -4.60 0.000 -.0024031 -.0009673
|
_cons | 1.619121 .0223875 72.32 0.000 1.575218 1.663023
-----------------------------------------------------------------------------------
. reg lwage tenure c.tenure#c.tenure c.tenure#c.tenure#c.tenure
Source | SS df MS Number of obs = 2,231
-------------+---------------------------------- F(3, 2227) = 84.50
Model | 74.8319576 3 24.9439859 Prob > F = 0.0000
Residual | 657.38511 2,227 .295188644 R-squared = 0.1022
-------------+---------------------------------- Adj R-squared = 0.1010
Total | 732.217068 2,230 .328348461 Root MSE = .54331
--------------------------------------------------------------------------------------------
lwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------------------------+----------------------------------------------------------------
tenure | .0999755 .0146911 6.81 0.000 .0711658 .1287851
|
c.tenure#c.tenure | -.0069834 .0018096 -3.86 0.000 -.0105321 -.0034348
|
c.tenure#c.tenure#c.tenure | .0001791 .0000599 2.99 0.003 .0000616 .0002966
|
_cons | 1.569498 .027838 56.38 0.000 1.514907 1.624089
--------------------------------------------------------------------------------------------
Q) curvilinear relatioinship에 대한 가설검정을 위한 명령어는?
Q) Quadratic model에서 임금이 최대가 되는 tenure 값은 얼마인가?
다른 x 변수는 평균에서 고정시킨 상태에서 Prediction graph를 그릴 수 있다.
. reg lwage tenure c.tenure#c.tenure
. margins, at(tenure=(0(1)25)) atmeans noatlegend
. marginsplot , legend(off) noci recast(line) ///
> addplot((function y=_b[_cons]+_b[tenure]*x+_b[c.tenure#c.tenure]*x^2, recast(area) range(18.05 18.1)))