Skillmodels Quickstart

[1]:
from skillmodels.likelihood_function import get_maximization_inputs
from skillmodels.config import TEST_DIR
import pandas as pd
import numpy as np
import yaml
from time import time
from estimagic import maximize

Loading Model Specification and Data

Model specifications are python dictionaries that can be safed in yaml or json files. For a moment, just assume you know how to write a model specification and have a skillmodels compatible dataset. Both are explained in different tutorials.

Next we load the model specification and the dataset.

[2]:
with open(TEST_DIR / "model2.yaml") as y:
    model_dict = yaml.load(y, Loader=yaml.SafeLoader)
[3]:
data = pd.read_stata(TEST_DIR / "model2_simulated_data.dta")
data.set_index(["caseid", "period"], inplace=True)

Getting the inputs for estimagic.maximize

Skillmodels basically just has one public function called get_maximization_inputs. When called with a model specification and a dataset it contains a dictionary with everything you need to maximize the likelihood function using estimagic.

By everything you need I mean everything model specific. You should still use the optional arguments of maximize to tune the optimization.

[4]:
max_inputs = get_maximization_inputs(model_dict, data)
/home/janos/anaconda3/envs/skillmodels/lib/python3.7/site-packages/jax/lib/xla_bridge.py:130: UserWarning: No GPU/TPU found, falling back to CPU.
  warnings.warn('No GPU/TPU found, falling back to CPU.')

Filling the Params Template

Often you can greatly reduce estimation time by choosing good start parameters. What are good start parameters depends strongly on the model specifications, the scaling of your variables and the normalizations you make.

If you have strong difficulties to pick good start values, you probably want to think again about the interpretability of your model parameters and possibly change the normalizations and scaling of your measurements.

As a rule of thumb: If all measurements are standardized and, all fixed loadings are 1 and all fixed intercepts are 0 then one is a good start value for all free loadings and 0 is a good start value for all free intercepts.

Measurement and shock standard deviations are better started slightly larger than you would expect them.

Below I just load start parameters for the CHS example model that I filled out manually.

[5]:
params_template = max_inputs["params_template"]
params_template.head()
[5]:
value lower_bound
category period name1 name2
controls 0 y1 constant NaN -inf
x1 NaN -inf
y2 constant NaN -inf
x1 NaN -inf
y3 constant NaN -inf
[6]:
index_cols = ["category", "period", "name1", "name2"]
chs_path = TEST_DIR / "regression_vault" / "chs_results.csv"
chs_values = pd.read_csv(chs_path)
chs_values.set_index(index_cols, inplace=True)
chs_values = chs_values[["chs_value", "good_start_value", "bad_start_value"]]
chs_values.head()
[6]:
chs_value good_start_value bad_start_value
category period name1 name2
controls 0 y1 constant 1.001618 1.0 0.0
x1 1.005455 1.0 0.0
y2 constant 1.031439 1.0 0.0
x1 0.975992 1.0 0.0
y3 constant 0.994091 1.0 0.0
[7]:
params = params_template.copy()
params["value"] = chs_values["chs_value"]
params.head()
[7]:
value lower_bound
category period name1 name2
controls 0 y1 constant 1.001618 -inf
x1 1.005455 -inf
y2 constant 1.031439 -inf
x1 0.975992 -inf
y3 constant 0.994091 -inf

Time compilation speed

Skillmodels uses jax to just-in-time compile the numerical code and get a gradient of the likelihood function by automatic differentiation.

This has the downside that it does not work on windows and that it has a relatively long compile time before the model. If you just want

[8]:
debug_loglike = max_inputs["debug_loglike"]
loglike = max_inputs["loglike"]
gradient = max_inputs["gradient"]
loglike_and_gradient = max_inputs["loglike_and_gradient"]
[9]:
start = time()
debug_loglike_value = debug_loglike(params)
print(time() - start)
debug_loglike_value
8.9623384475708
[9]:
{'pre_update_states':           fac1      fac2      fac3  mixture  period measurement    id
 0     0.000000  0.000000  0.000000        0       0          y1     0
 1     0.000000  0.000000  0.000000        0       0          y1     1
 2     0.000000  0.000000  0.000000        0       0          y1     2
 3     0.000000  0.000000  0.000000        0       0          y1     3
 4     0.000000  0.000000  0.000000        0       0          y1     4
 ...        ...       ...       ...      ...     ...         ...   ...
 3995 -0.393568 -0.260489 -0.304120        0       7     Q1_fac1  3995
 3996  0.111399  0.356767  0.299648        0       7     Q1_fac1  3996
 3997 -0.321585 -0.549056  0.511860        0       7     Q1_fac1  3997
 3998  0.293930  0.524977 -0.274992        0       7     Q1_fac1  3998
 3999 -0.216362  0.023973  0.063563        0       7     Q1_fac1  3999

 [236000 rows x 7 columns],
 'post_update_states':           fac1      fac2      fac3  mixture  period    id measurement
 0     0.111871  0.003323  0.003591        0       0     0          y1
 1     0.376414  0.011179  0.012084        0       0     1          y1
 2     0.174989  0.005197  0.005618        0       0     2          y1
 3     0.095859  0.002847  0.003077        0       0     3          y1
 4    -0.311033 -0.009238 -0.009985        0       0     4          y1
 ...        ...       ...       ...      ...     ...   ...         ...
 3995 -0.015864 -0.255050 -0.288638        0       7  3995     Q1_fac1
 3996  0.247573  0.359324  0.305756        0       7  3996     Q1_fac1
 3997 -0.591241 -0.553675  0.501790        0       7  3997     Q1_fac1
 3998  0.360258  0.525979 -0.271412        0       7  3998     Q1_fac1
 3999 -0.288590  0.022965  0.060775        0       7  3999     Q1_fac1

 [236000 rows x 7 columns],
 'filtered_states':           fac1      fac2      fac3  mixture  period    id
 0     0.186376  0.091633 -0.106732        0       0     0
 1     0.245693 -0.280525  0.285942        0       0     1
 2     0.494929  0.037534 -0.458473        0       0     2
 3     0.558594  0.067168  0.170920        0       0     3
 4     0.084291  0.250471  0.466898        0       0     4
 ...        ...       ...       ...      ...     ...   ...
 3995 -0.393568 -0.260489 -0.304120        0       7  3995
 3996  0.111399  0.356767  0.299648        0       7  3996
 3997 -0.321585 -0.549056  0.511860        0       7  3997
 3998  0.293930  0.524977 -0.274992        0       7  3998
 3999 -0.216362  0.023973  0.063563        0       7  3999

 [32000 rows x 6 columns],
 'state_ranges': {'fac1':          minimum   maximum
  period
  0      -1.286306  1.281440
  1      -1.747732  1.704974
  2      -1.956092  1.691804
  3      -1.608079  1.783263
  4      -1.835516  1.710544
  5      -1.806694  1.614255
  6      -1.728938  1.469292
  7      -1.735790  1.567366, 'fac2':          minimum   maximum
  period
  0      -1.008363  1.138972
  1      -1.308280  1.440339
  2      -1.625487  1.482808
  3      -1.539565  1.346799
  4      -1.514808  1.541269
  5      -1.377056  1.651247
  6      -1.640329  1.301055
  7      -1.422660  1.472981, 'fac3':          minimum   maximum
  period
  0      -1.243904  1.258522
  1      -1.253434  1.257631
  2      -1.265532  1.261948
  3      -1.264254  1.254055
  4      -1.264228  1.277353
  5      -1.277315  1.293874
  6      -1.288693  1.284524
  7      -1.297715  1.306730},
 'residuals':       residual  mixture  period    id measurement
 0     0.431990        0       0     0          y1
 1     1.453522        0       0     1          y1
 2     0.675720        0       0     2          y1
 3     0.370159        0       0     3          y1
 4    -1.201056        0       0     4          y1
 ...        ...      ...     ...   ...         ...
 3995  2.461561        0       7  3995     Q1_fac1
 3996  0.889661        0       7  3996     Q1_fac1
 3997 -1.758525        0       7  3997     Q1_fac1
 3998  0.433416        0       7  3998     Q1_fac1
 3999 -0.470317        0       7  3999     Q1_fac1

 [236000 rows x 5 columns],
 'residual_sds':       residual  mixture  period    id measurement
 0     0.825500        0       0     0          y1
 1     0.825500        0       0     1          y1
 2     0.825500        0       0     2          y1
 3     0.825500        0       0     3          y1
 4     0.825500        0       0     4          y1
 ...        ...      ...     ...   ...         ...
 3995  0.797613        0       7  3995     Q1_fac1
 3996  0.797394        0       7  3996     Q1_fac1
 3997  0.797556        0       7  3997     Q1_fac1
 3998  0.797379        0       7  3998     Q1_fac1
 3999  0.797690        0       7  3999     Q1_fac1

 [236000 rows x 5 columns],
 'all_contributions':       contribution measurement  period    id
 0        -0.864098          y1       0     0
 1        -2.277343          y1       0     1
 2        -1.062191          y1       0     2
 3        -0.827706          y1       0     3
 4        -1.785603          y1       0     4
 ...            ...         ...     ...   ...
 3995     -5.454991     Q1_fac1       7  3995
 3996     -1.314937     Q1_fac1       7  3996
 3997     -3.123508     Q1_fac1       7  3997
 3998     -0.840237     Q1_fac1       7  3998
 3999     -0.866717     Q1_fac1       7  3999

 [236000 rows x 4 columns],
 'value': -251177.3089520689,
 'contributions': array([-65.88419161, -57.35209259, -63.66290665, ..., -62.21729575,
        -64.62474922, -61.60588466])}
[10]:
start = time()
loglike_value = loglike(params)
print(time() - start)
loglike_value
54.89086103439331
[10]:
{'contributions': array([-65.88419161, -57.35209259, -63.66290665, ..., -62.21729575,
        -64.62474922, -61.60588466]), 'value': -251177.3089520689}
[11]:
%timeit loglike(params)
173 ms ± 3.75 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
[12]:
start = time()
gradient_value = gradient(params)
print(time() - start)
1000.3803389072418
[13]:
%timeit gradient(params)
614 ms ± 14.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
[14]:
start = time()
loglike_and_gradient_value = loglike_and_gradient(params)
print(time() - start)
0.6180074214935303
[15]:
%timeit loglike_and_gradient(params)
618 ms ± 19.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Insights

The debug_loglike takes a few seconds for one evaluation. The first and all subsequent evaluations have the same speed.

You should use this if you want to quickly find out if your start parameters produce a valid likelihood or want to debug why it does not. As the name suggests, you can step through the debug_loglike with a debugger. Moreover, the result of debug_loglike containts intermediate results like the filtered states, Kalman residuals and others. All intermediate results are tidy DataFrames that make it easy to plot things. Besides debugging purposes, you can use the debug_loglike to create inputs for plotting functions.

The jitted loglike takes long for the first evaluation (on my machine about 50 seconds, but under 200 milliseconds for all subsequent evaluations.

The gradient or loglike_and_gradient (which ever you call first) will take about 15 minutes to compile but under 700 milliseconds for all subsequent evaluations. In particular, evaluating just the gradient and evaluating loglike_and_gradient takes exactly the same time! If you have an optimizer that always evaluates gradient and criterion simultaneously, use it!

A few additional constraints

To get the same values as CHS we will have to do a little more work. The reason is that on top of the many constraints skillmodels generates atuomatically from the model specification, CHS impose two more constraints:

  1. All but the self productivity paramet in the linear transition equaltion are fixed to zero

  2. The initial mean of the states is not estimated but assumed to be zero.

  3. The anchoring parameters (intercepts, control variables, loadings and SDs of measurement error are pairwise equal across periods).

Fortunately, estimagic makes it easy to express such constraints:

[16]:
constraints = max_inputs["constraints"]

additional_constraints = [
    {"query": "category == 'transition' & name1 == 'fac2' & name2 != 'fac2'",
     "type": "fixed", "value": 0},
    {"loc": "initial_states", "type": "fixed", "value": 0},
    {"queries": [f"period == {i} & name1 == 'Q1_fac1'" for i in range(8)],
     "type": "pairwise_equality"}
]

[17]:
constraints = constraints + additional_constraints

Generating a group column for better dashboard output

[18]:
from estimagic.optimization.process_constraints import process_constraints
pc, pp = process_constraints(constraints, params)
params["group"] = params.index.get_level_values("category")
params.loc["controls", "group"] = params.loc["controls"].index.get_level_values("name2")

params["group"] = params["group"].astype(str) + "_" + params.index.get_level_values("period").astype(str)
params["group"] = params["group"].str.replace("_", "-")
params["group"] = params["group"].astype("O")
params.loc[~pp["_internal_free"], "group"] = None
params
/home/janos/anaconda3/envs/skillmodels/lib/python3.7/site-packages/IPython/core/interactiveshell.py:2848: PerformanceWarning: indexing past lexsort depth may impact performance.
  raw_cell, store_history, silent, shell_futures)
[18]:
value lower_bound group
category period name1 name2
controls 0 y1 constant 1.001618 -inf constant-0
x1 1.005455 -inf x1-0
y2 constant 1.031439 -inf constant-0
x1 0.975992 -inf x1-0
y3 constant 0.994091 -inf constant-0
... ... ... ... ... ... ...
transition 6 fac1 phi -0.407018 -inf None
fac2 fac1 0.000000 -inf None
fac2 0.608871 -inf None
fac3 0.000000 -inf None
constant 0.000000 -inf None

441 rows × 3 columns

Estimating the model

[19]:
params["value"] = chs_values["good_start_value"]
loc = params.query("category == 'shock_sds' & name1 == 'fac3'").index
params.loc[loc, "lower_bound"] = 0.00
loglike(params)
/home/janos/anaconda3/envs/skillmodels/lib/python3.7/site-packages/IPython/core/interactiveshell.py:2848: PerformanceWarning: indexing past lexsort depth may impact performance.
  raw_cell, store_history, silent, shell_futures)
[19]:
{'contributions': array([-65.78459637, -56.84181997, -64.07640541, ..., -61.3151648 ,
        -64.35146984, -61.33764354]), 'value': -249617.99170565547}
[32]:
res = maximize(
    criterion=loglike,
    params=params,
    algorithm="scipy_lbfgsb",
    criterion_and_derivative=loglike_and_gradient,
    logging="model2.db",
    constraints=constraints,
    log_options={"if_exists": "replace"},
    algo_options={"relative_criterion_tolerance": 1e-9}
)
[21]:
res["message"]
[21]:
b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'

Inference Using Estimagic

[22]:
from estimagic.inference.likelihood_inference import do_likelihood_inference
[24]:
inference_df, free_cov = do_likelihood_inference(
    loglike=loglike,
    params=res["solution_params"],
    constraints=constraints,
)
[25]:
inference_df = inference_df.loc[free_cov.index]
[30]:
inference_df["correct"] = chs_values.loc[free_cov.index, "chs_value"]
[31]:
inference_df
[31]:
value standard_error p_value ci_lower ci_upper stars correct
category period name1 name2
controls 0 y1 constant 1.002213 0.026971 3.050083e-302 0.949350 1.055075 *** 1.001618
x1 1.005312 0.046886 5.360805e-102 0.913416 1.097209 *** 1.005455
y2 constant 1.032298 0.028347 2.284859e-290 0.976738 1.087858 *** 1.031439
x1 0.975610 0.048971 2.562509e-88 0.879627 1.071593 *** 0.975992
y3 constant 0.995045 0.031394 1.742298e-220 0.933513 1.056576 *** 0.994091
... ... ... ... ... ... ... ... ... ... ...
initial_cholcovs 0 mixture_0 fac3-fac3 0.450412 0.016153 4.031445e-171 0.418752 0.482071 *** 0.481416
transition 0 fac1 fac1 0.587785 0.009386 0.000000e+00 0.569389 0.606181 *** 0.659788
fac2 0.218174 0.007976 9.757881e-165 0.202540 0.233807 *** 0.174038
phi -0.374843 0.097642 1.210932e-04 -0.566221 -0.183464 *** -0.407018
fac2 fac2 0.612401 0.011861 0.000000e+00 0.589153 0.635649 *** 0.608871

203 rows × 7 columns

[ ]: