Skillmodels Quickstart — skillmodels 0.1.0dev4 documentation

Loading Model Specification and Data¶

Model specifications are python dictionaries that can be safed in yaml or json files. For a moment, just assume you know how to write a model specification and have a skillmodels compatible dataset. Both are explained in different tutorials.

Next we load the model specification and the dataset.

[2]:

with open(TEST_DIR / "model2.yaml") as y:
    model_dict = yaml.load(y, Loader=yaml.SafeLoader)

[3]:

data = pd.read_stata(TEST_DIR / "model2_simulated_data.dta")
data.set_index(["caseid", "period"], inplace=True)

Getting the inputs for `estimagic.maximize`¶

Skillmodels basically just has one public function called get_maximization_inputs. When called with a model specification and a dataset it contains a dictionary with everything you need to maximize the likelihood function using estimagic.

By everything you need I mean everything model specific. You should still use the optional arguments of maximize to tune the optimization.

[4]:

max_inputs = get_maximization_inputs(model_dict, data)

/home/janos/anaconda3/envs/skillmodels/lib/python3.7/site-packages/jax/lib/xla_bridge.py:130: UserWarning: No GPU/TPU found, falling back to CPU.
  warnings.warn('No GPU/TPU found, falling back to CPU.')

Filling the Params Template¶

Often you can greatly reduce estimation time by choosing good start parameters. What are good start parameters depends strongly on the model specifications, the scaling of your variables and the normalizations you make.

If you have strong difficulties to pick good start values, you probably want to think again about the interpretability of your model parameters and possibly change the normalizations and scaling of your measurements.

As a rule of thumb: If all measurements are standardized and, all fixed loadings are 1 and all fixed intercepts are 0 then one is a good start value for all free loadings and 0 is a good start value for all free intercepts.

Measurement and shock standard deviations are better started slightly larger than you would expect them.

Below I just load start parameters for the CHS example model that I filled out manually.

[5]:

params_template = max_inputs["params_template"]
params_template.head()

[5]:

				value	lower_bound
category	period	name1	name2
controls	0	y1	constant	NaN	-inf
		y1	x1	NaN	-inf
		y2	constant	NaN	-inf
		y2	x1	NaN	-inf
		y3	constant	NaN	-inf

[6]:

index_cols = ["category", "period", "name1", "name2"]
chs_path = TEST_DIR / "regression_vault" / "chs_results.csv"
chs_values = pd.read_csv(chs_path)
chs_values.set_index(index_cols, inplace=True)
chs_values = chs_values[["chs_value", "good_start_value", "bad_start_value"]]
chs_values.head()

[6]:

				chs_value	good_start_value	bad_start_value
category	period	name1	name2
controls	0	y1	constant	1.001618	1.0	0.0
		y1	x1	1.005455	1.0	0.0
		y2	constant	1.031439	1.0	0.0
		y2	x1	0.975992	1.0	0.0
		y3	constant	0.994091	1.0	0.0

[7]:

params = params_template.copy()
params["value"] = chs_values["chs_value"]
params.head()

[7]:

				value	lower_bound
category	period	name1	name2
controls	0	y1	constant	1.001618	-inf
		y1	x1	1.005455	-inf
		y2	constant	1.031439	-inf
		y2	x1	0.975992	-inf
		y3	constant	0.994091	-inf

Time compilation speed¶

Skillmodels uses jax to just-in-time compile the numerical code and get a gradient of the likelihood function by automatic differentiation.

This has the downside that it does not work on windows and that it has a relatively long compile time before the model. If you just want

[8]:

debug_loglike = max_inputs["debug_loglike"]
loglike = max_inputs["loglike"]
gradient = max_inputs["gradient"]
loglike_and_gradient = max_inputs["loglike_and_gradient"]

[9]:

start = time()
debug_loglike_value = debug_loglike(params)
print(time() - start)
debug_loglike_value

8.9623384475708

[9]:

{'pre_update_states':           fac1      fac2      fac3  mixture  period measurement    id
 0     0.000000  0.000000  0.000000        0       0          y1     0
 1     0.000000  0.000000  0.000000        0       0          y1     1
 2     0.000000  0.000000  0.000000        0       0          y1     2
 3     0.000000  0.000000  0.000000        0       0          y1     3
 4     0.000000  0.000000  0.000000        0       0          y1     4
 ...        ...       ...       ...      ...     ...         ...   ...
 3995 -0.393568 -0.260489 -0.304120        0       7     Q1_fac1  3995
 3996  0.111399  0.356767  0.299648        0       7     Q1_fac1  3996
 3997 -0.321585 -0.549056  0.511860        0       7     Q1_fac1  3997
 3998  0.293930  0.524977 -0.274992        0       7     Q1_fac1  3998
 3999 -0.216362  0.023973  0.063563        0       7     Q1_fac1  3999

 [236000 rows x 7 columns],
 'post_update_states':           fac1      fac2      fac3  mixture  period    id measurement
 0     0.111871  0.003323  0.003591        0       0     0          y1
 1     0.376414  0.011179  0.012084        0       0     1          y1
 2     0.174989  0.005197  0.005618        0       0     2          y1
 3     0.095859  0.002847  0.003077        0       0     3          y1
 4    -0.311033 -0.009238 -0.009985        0       0     4          y1
 ...        ...       ...       ...      ...     ...   ...         ...
 3995 -0.015864 -0.255050 -0.288638        0       7  3995     Q1_fac1
 3996  0.247573  0.359324  0.305756        0       7  3996     Q1_fac1
 3997 -0.591241 -0.553675  0.501790        0       7  3997     Q1_fac1
 3998  0.360258  0.525979 -0.271412        0       7  3998     Q1_fac1
 3999 -0.288590  0.022965  0.060775        0       7  3999     Q1_fac1

 [236000 rows x 7 columns],
 'filtered_states':           fac1      fac2      fac3  mixture  period    id
 0     0.186376  0.091633 -0.106732        0       0     0
 1     0.245693 -0.280525  0.285942        0       0     1
 2     0.494929  0.037534 -0.458473        0       0     2
 3     0.558594  0.067168  0.170920        0       0     3
 4     0.084291  0.250471  0.466898        0       0     4
 ...        ...       ...       ...      ...     ...   ...
 3995 -0.393568 -0.260489 -0.304120        0       7  3995
 3996  0.111399  0.356767  0.299648        0       7  3996
 3997 -0.321585 -0.549056  0.511860        0       7  3997
 3998  0.293930  0.524977 -0.274992        0       7  3998
 3999 -0.216362  0.023973  0.063563        0       7  3999

 [32000 rows x 6 columns],
 'state_ranges': {'fac1':          minimum   maximum
  period
  0      -1.286306  1.281440
  1      -1.747732  1.704974
  2      -1.956092  1.691804
  3      -1.608079  1.783263
  4      -1.835516  1.710544
  5      -1.806694  1.614255
  6      -1.728938  1.469292
  7      -1.735790  1.567366, 'fac2':          minimum   maximum
  period
  0      -1.008363  1.138972
  1      -1.308280  1.440339
  2      -1.625487  1.482808
  3      -1.539565  1.346799
  4      -1.514808  1.541269
  5      -1.377056  1.651247
  6      -1.640329  1.301055
  7      -1.422660  1.472981, 'fac3':          minimum   maximum
  period
  0      -1.243904  1.258522
  1      -1.253434  1.257631
  2      -1.265532  1.261948
  3      -1.264254  1.254055
  4      -1.264228  1.277353
  5      -1.277315  1.293874
  6      -1.288693  1.284524
  7      -1.297715  1.306730},
 'residuals':       residual  mixture  period    id measurement
 0     0.431990        0       0     0          y1
 1     1.453522        0       0     1          y1
 2     0.675720        0       0     2          y1
 3     0.370159        0       0     3          y1
 4    -1.201056        0       0     4          y1
 ...        ...      ...     ...   ...         ...
 3995  2.461561        0       7  3995     Q1_fac1
 3996  0.889661        0       7  3996     Q1_fac1
 3997 -1.758525        0       7  3997     Q1_fac1
 3998  0.433416        0       7  3998     Q1_fac1
 3999 -0.470317        0       7  3999     Q1_fac1

 [236000 rows x 5 columns],
 'residual_sds':       residual  mixture  period    id measurement
 0     0.825500        0       0     0          y1
 1     0.825500        0       0     1          y1
 2     0.825500        0       0     2          y1
 3     0.825500        0       0     3          y1
 4     0.825500        0       0     4          y1
 ...        ...      ...     ...   ...         ...
 3995  0.797613        0       7  3995     Q1_fac1
 3996  0.797394        0       7  3996     Q1_fac1
 3997  0.797556        0       7  3997     Q1_fac1
 3998  0.797379        0       7  3998     Q1_fac1
 3999  0.797690        0       7  3999     Q1_fac1

 [236000 rows x 5 columns],
 'all_contributions':       contribution measurement  period    id
 0        -0.864098          y1       0     0
 1        -2.277343          y1       0     1
 2        -1.062191          y1       0     2
 3        -0.827706          y1       0     3
 4        -1.785603          y1       0     4
 ...            ...         ...     ...   ...
 3995     -5.454991     Q1_fac1       7  3995
 3996     -1.314937     Q1_fac1       7  3996
 3997     -3.123508     Q1_fac1       7  3997
 3998     -0.840237     Q1_fac1       7  3998
 3999     -0.866717     Q1_fac1       7  3999

 [236000 rows x 4 columns],
 'value': -251177.3089520689,
 'contributions': array([-65.88419161, -57.35209259, -63.66290665, ..., -62.21729575,
        -64.62474922, -61.60588466])}

[10]:

start = time()
loglike_value = loglike(params)
print(time() - start)
loglike_value

54.89086103439331

[10]:

{'contributions': array([-65.88419161, -57.35209259, -63.66290665, ..., -62.21729575,
        -64.62474922, -61.60588466]), 'value': -251177.3089520689}

[11]:

%timeit loglike(params)

173 ms ± 3.75 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

[12]:

start = time()
gradient_value = gradient(params)
print(time() - start)

1000.3803389072418

[13]:

%timeit gradient(params)

614 ms ± 14.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

[14]:

start = time()
loglike_and_gradient_value = loglike_and_gradient(params)
print(time() - start)

0.6180074214935303

[15]:

%timeit loglike_and_gradient(params)

618 ms ± 19.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Insights¶

The debug_loglike takes a few seconds for one evaluation. The first and all subsequent evaluations have the same speed.

You should use this if you want to quickly find out if your start parameters produce a valid likelihood or want to debug why it does not. As the name suggests, you can step through the debug_loglike with a debugger. Moreover, the result of debug_loglike containts intermediate results like the filtered states, Kalman residuals and others. All intermediate results are tidy DataFrames that make it easy to plot things. Besides debugging purposes, you can use the debug_loglike to create inputs for plotting functions.

The jitted loglike takes long for the first evaluation (on my machine about 50 seconds, but under 200 milliseconds for all subsequent evaluations.

The gradient or loglike_and_gradient (which ever you call first) will take about 15 minutes to compile but under 700 milliseconds for all subsequent evaluations. In particular, evaluating just the gradient and evaluating loglike_and_gradient takes exactly the same time! If you have an optimizer that always evaluates gradient and criterion simultaneously, use it!

A few additional constraints¶

To get the same values as CHS we will have to do a little more work. The reason is that on top of the many constraints skillmodels generates atuomatically from the model specification, CHS impose two more constraints:

All but the self productivity paramet in the linear transition equaltion are fixed to zero
The initial mean of the states is not estimated but assumed to be zero.
The anchoring parameters (intercepts, control variables, loadings and SDs of measurement error are pairwise equal across periods).

Fortunately, estimagic makes it easy to express such constraints:

[16]:

constraints = max_inputs["constraints"]

additional_constraints = [
    {"query": "category == 'transition' & name1 == 'fac2' & name2 != 'fac2'",
     "type": "fixed", "value": 0},
    {"loc": "initial_states", "type": "fixed", "value": 0},
    {"queries": [f"period == {i} & name1 == 'Q1_fac1'" for i in range(8)],
     "type": "pairwise_equality"}
]

[17]:

constraints = constraints + additional_constraints

Generating a group column for better dashboard output¶

[18]:

from estimagic.optimization.process_constraints import process_constraints
pc, pp = process_constraints(constraints, params)
params["group"] = params.index.get_level_values("category")
params.loc["controls", "group"] = params.loc["controls"].index.get_level_values("name2")

params["group"] = params["group"].astype(str) + "_" + params.index.get_level_values("period").astype(str)
params["group"] = params["group"].str.replace("_", "-")
params["group"] = params["group"].astype("O")
params.loc[~pp["_internal_free"], "group"] = None
params

/home/janos/anaconda3/envs/skillmodels/lib/python3.7/site-packages/IPython/core/interactiveshell.py:2848: PerformanceWarning: indexing past lexsort depth may impact performance.
  raw_cell, store_history, silent, shell_futures)

[18]:

				value	lower_bound	group
category	period	name1	name2
controls	0	y1	constant	1.001618	-inf	constant-0
		y1	x1	1.005455	-inf	x1-0
		y2	constant	1.031439	-inf	constant-0
		y2	x1	0.975992	-inf	x1-0
		y3	constant	0.994091	-inf	constant-0
...	...	...	...	...	...	...
transition	6	fac1	phi	-0.407018	-inf	None
		fac2	fac1	0.000000	-inf	None
			fac2	0.608871	-inf	None
			fac3	0.000000	-inf	None
			constant	0.000000	-inf	None

441 rows × 3 columns

Estimating the model¶

[19]:

params["value"] = chs_values["good_start_value"]
loc = params.query("category == 'shock_sds' & name1 == 'fac3'").index
params.loc[loc, "lower_bound"] = 0.00
loglike(params)

/home/janos/anaconda3/envs/skillmodels/lib/python3.7/site-packages/IPython/core/interactiveshell.py:2848: PerformanceWarning: indexing past lexsort depth may impact performance.
  raw_cell, store_history, silent, shell_futures)

[19]:

{'contributions': array([-65.78459637, -56.84181997, -64.07640541, ..., -61.3151648 ,
        -64.35146984, -61.33764354]), 'value': -249617.99170565547}

[32]:

res = maximize(
    criterion=loglike,
    params=params,
    algorithm="scipy_lbfgsb",
    criterion_and_derivative=loglike_and_gradient,
    logging="model2.db",
    constraints=constraints,
    log_options={"if_exists": "replace"},
    algo_options={"relative_criterion_tolerance": 1e-9}
)

[21]:

res["message"]

[21]:

b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'

Inference Using Estimagic¶

[22]:

from estimagic.inference.likelihood_inference import do_likelihood_inference

[24]:

inference_df, free_cov = do_likelihood_inference(
    loglike=loglike,
    params=res["solution_params"],
    constraints=constraints,
)

[25]:

inference_df = inference_df.loc[free_cov.index]

[30]:

inference_df["correct"] = chs_values.loc[free_cov.index, "chs_value"]

[31]:

inference_df

[31]:

				value	standard_error	p_value	ci_lower	ci_upper	stars	correct
category	period	name1	name2
controls	0	y1	constant	1.002213	0.026971	3.050083e-302	0.949350	1.055075	***	1.001618
		y1	x1	1.005312	0.046886	5.360805e-102	0.913416	1.097209	***	1.005455
		y2	constant	1.032298	0.028347	2.284859e-290	0.976738	1.087858	***	1.031439
		y2	x1	0.975610	0.048971	2.562509e-88	0.879627	1.071593	***	0.975992
		y3	constant	0.995045	0.031394	1.742298e-220	0.933513	1.056576	***	0.994091
...	...	...	...	...	...	...	...	...	...	...
initial_cholcovs	0	mixture_0	fac3-fac3	0.450412	0.016153	4.031445e-171	0.418752	0.482071	***	0.481416
transition	0	fac1	fac1	0.587785	0.009386	0.000000e+00	0.569389	0.606181	***	0.659788
			fac2	0.218174	0.007976	9.757881e-165	0.202540	0.233807	***	0.174038
			phi	-0.374843	0.097642	1.210932e-04	-0.566221	-0.183464	***	-0.407018
		fac2	fac2	0.612401	0.011861	0.000000e+00	0.589153	0.635649	***	0.608871

203 rows × 7 columns

[ ]:

Previous topic

Next topic

Skillmodels Quickstart¶

Loading Model Specification and Data¶

Getting the inputs for `estimagic.maximize`¶

Filling the Params Template¶

Time compilation speed¶

Insights¶

A few additional constraints¶

Generating a group column for better dashboard output¶

Estimating the model¶

Inference Using Estimagic¶

Previous topic

Next topic

Quick search

Skillmodels Quickstart¶

Loading Model Specification and Data¶

Getting the inputs for estimagic.maximize¶

Filling the Params Template¶

Time compilation speed¶

Insights¶

A few additional constraints¶

Generating a group column for better dashboard output¶

Estimating the model¶

Inference Using Estimagic¶

Getting the inputs for `estimagic.maximize`¶