Norming model
\[ \newcommand{\expr}[3]{\begin{array}{c} #1 \\ \bbox[lightblue,5px]{#2} \end{array} ⊢ #3} \newcommand{\ct}[1]{\bbox[font-size: 0.8em]{\mathsf{#1}}} \newcommand{\updct}[1]{\ct{upd\_#1}} \newcommand{\abbr}[1]{\bbox[transform: scale(0.95)]{\mathtt{#1}}} \newcommand{\pure}[1]{\bbox[border: 1px solid orange]{\bbox[border: 4px solid transparent]{#1}}} \newcommand{\return}[1]{\bbox[border: 1px solid black]{\bbox[border: 4px solid transparent]{#1}}} \def\P{\mathtt{P}} \def\Q{\mathtt{Q}} \def\True{\ct{T}} \def\False{\ct{F}} \def\ite{\ct{if\_then\_else}} \def\Do{\abbr{do}} \]
Our first model addresses a fundamental question: how do we infer the “shape” of people’s prior beliefs about the degrees that gradable adjectives operate on? The norming study provides a clean test case where participants directly report degrees on scales.
We’ll start with a realistic model of the norming data that one might design as a means for analyzing that dataset. What we’ll do is to build up the model block-by-block, explaining each line. Then, we’ll turn to how we might analyze this experiment using PDS and show which components of this model correspond to the PDS kernel model and which ones are extensions of the model by the analyst.
Understanding the experimental setup
Before diving into the Stan code, let’s consider how we’ll represent the norming data, since this is important for understanding how we design Stan code. Here’s a sample of the data:
participant | item | item_number | adjective | adjective_number | condition | condition_number | scale_type | scale_type_number | response |
---|---|---|---|---|---|---|---|---|---|
1 | closed_mid | 6 | closed | 2 | mid | 3 | absolute | 1 | 0.66 |
1 | old_mid | 24 | old | 8 | mid | 3 | relative | 2 | 0.51 |
1 | expensive_mid | 15 | expensive | 5 | mid | 3 | relative | 2 | 0.62 |
1 | full_high | 16 | full | 6 | high | 1 | absolute | 1 | 1 |
1 | deep_low | 8 | deep | 3 | low | 2 | relative | 2 | 0.22 |
Each row represents one judgment: - participant
: Which person made this judgment (participant 1, 2, etc.) - item
: A unique identifier combining adjective and condition (e.g., “tall_high”) - item_number
: Numeric ID for the item (used in Stan) - adjective
: The gradable adjective being tested - condition
: Whether this is a high/mid/low standard context - response
: The participant’s slider response (0-1)
Our simplest model asks: what degree does each item have on its scale, and how do participants map these degrees to slider responses?
The structure of a Stan program
Every Stan program follows a particular architecture with blocks that appear in a specific order. Each block serves a specific purpose in defining our statistical model. Let’s build up our norming model block by block to understand how Stan works. This structure parallels the modular architecture of PDS itself—each block handles a distinct aspect of the modeling problem.
The data
block
Every Stan program begins with a data
block that tells Stan what information will be provided from the outside world—our experimental observations. Let’s build this up piece by piece:
data {
int<lower=1> N_item; // number of items
int<lower=1> N_participant; // number of participants
int<lower=1> N_data; // number of data points
These first lines declare basic counts. The syntax breaks down as:
int
: This will be an integer (whole number)<lower=1>
: This integer must be at least 1 (no negative counts!)N_item
: The variable name (we’ll use this throughout our program)// number of items
: A comment explaining what this represents
Why do we need these constraints? Stan uses them to:
- Catch data errors early (if we accidentally pass 0 items, Stan will complain)
- Optimize its algorithms (knowing bounds helps the sampler work efficiently)
Next, we handle a subtle but important issue—boundary responses:
int<lower=1> N_0; // number of 0s
int<lower=1> N_1; // number of 1s
Why separate these out? Slider responses of exactly 0 or 1 are “censored”—they might represent even more extreme judgments that the scale can’t capture. We’ll handle these in a second.
For the actual response data:
vector<lower=0, upper=1>[N_data] y; // response in (0, 1)
This declares a vector (like an array) of length N_data
, where each element must be between 0 and 1. Notice this is for responses between 0 and 1, not including the boundaries.
Finally, we need to map responses to items and participants:
array[N_data] int<lower=1, upper=N_item> item; // which item for each response
array[N_0] int<lower=1, upper=N_item> item_0; // which item for each 0
array[N_1] int<lower=1, upper=N_item> item_1; // which item for each 1
array[N_data] int<lower=1, upper=N_participant> participant;
array[N_0] int<lower=1, upper=N_participant> participant_0;
array[N_1] int<lower=1, upper=N_participant> participant_1;
}
These arrays work like lookup tables. If item[5] = 3
, then the 5th response in our data was about item #3. This indexing structure connects our flat data file to the hierarchical structure of our experiment.
Looking back at our CSV data, when Stan reads it, it will: 1. Count unique items → N_item
(e.g., 36 if we have 12 adjectives × 3 conditions) 2. Count unique participants → N_participant
3. Extract all responses between 0 and 1 → y
vector 4. Build index arrays mapping each response to its item and participant
The parameters block: What we want to learn
After declaring our data, we declare the parameters—the unknown quantities we want to infer. This is where semantic theory meets statistical inference:
parameters {
// Fixed effects
vector[N_item] mu_guess;
This declares a vector of “guesses” (degrees) for each item. Why mu_guess
? In statistics, μ (mu) traditionally denotes a mean or central tendency. These means can be understood as representing our best guess, as researchers, about each item’s true degree on its scale—the theoretical degrees that the semantic analysis posits. Crucially, they can also be viewed as representing subjects’ uncertainty about these degrees—what unresolved uncertainty do they maintain when they make these guesses?
But people differ! We need random effects to capture individual variation:
// Random effects
real<lower=0> sigma_epsilon_guess; // how much people vary
vector[N_participant] z_epsilon_guess; // each person's deviation
This uses a clever trick called “non-centered parameterization”: - sigma_epsilon_guess
: The overall amount of person-to-person variation - z_epsilon_guess
: Standardized (z-score) deviations for each person
We’ll combine these later to get each person’s actual adjustment. Why not just use vector[N_participant] epsilon_guess
directly? This separation often helps Stan’s algorithms converge much faster—a practical consideration that doesn’t affect the semantic theory but matters for implementation.
Next, measurement noise:
real<lower=0,upper=1> sigma_e; // response variability
Even if two people agree on an item’s degree, their slider responses might differ slightly. This parameter captures that noise.
Finally, those boundary responses:
// Censored data
array[N_0] real<upper=0> y_0; // true values for observed 0s
array[N_1] real<lower=1> y_1; // true values for observed 1s
}
This is subtle but important. When someone gives a 0 response, their “true” judgment might be -0.1 or -0.5—we just can’t see below 0. These parameters let Stan infer what those true values might have been.
The transformed parameters block: Building predictions
Now we combine our basic parameters to build what we actually need. This block serves as a bridge between abstract parameters and concrete predictions:
transformed parameters {
vector[N_participant] epsilon_guess;
vector[N_data] guess;
vector[N_0] guess_0;
vector[N_1] guess_1;
First, we convert those z-scores to actual participant adjustments:
// Non-centered parameterization
epsilon_guess = sigma_epsilon_guess * z_epsilon_guess;
If sigma_epsilon_guess = 0.2
and participant 3 has z_epsilon_guess[3] = 1.5
, then participant 3 tends to give responses 0.3 units higher than average.
Now we can compute predicted responses:
for (i in 1:N_data) {
guess[i] = mu_guess[item[i]] + epsilon_guess[participant[i]];
Let’s trace through one prediction: - Response i is about item 5 by participant 3 - item[i] = 5
, so we look up mu_guess[5]
(say it’s 0.7) - participant[i] = 3
, so we add epsilon_guess[3]
(say it’s 0.1) - guess[i] = 0.7 + 0.1 = 0.8
We repeat this for the boundary responses:
for (i in 1:N_0) {
guess_0[i] = mu_guess[item_0[i]] + epsilon_guess[participant_0[i]];
}
for (i in 1:N_1) {
guess_1[i] = mu_guess[item_1[i]] + epsilon_guess[participant_1[i]];
} }
The model
block
The model block is where we specify our statistical assumptions—both our prior beliefs and how the data was generated. This is where most of the action in terms of how PDS relates to data.
model {
// Priors on random effects
1);
sigma_epsilon_guess ~ exponential( z_epsilon_guess ~ std_normal();
These priors encode mild assumptions: - exponential(1)
: We expect person-to-person variation to be moderate (not huge) - std_normal()
: By construction, z-scores have a standard normal distribution
Notice we don’t specify priors for mu_guess
—Stan treats this as an implicit uniform prior over the real numbers. Since our responses are bounded, the data will naturally constrain these values.
Now the likelihood—how data relates to parameters:
// Likelihood
for (i in 1:N_data) {
y[i] ~ normal(guess[i], sigma_e); }
This says: each response is drawn from a normal distribution centered at our prediction with standard deviation sigma_e
. The ~
symbol means “is distributed as.”
For boundary responses, we use the latent values:
for (i in 1:N_0) {
y_0[i] ~ normal(guess_0[i], sigma_e);
}
for (i in 1:N_1) {
y_1[i] ~ normal(guess_1[i], sigma_e);
} }
Remember: we’re inferring y_0
and y_1
as parameters! Stan will sample plausible values that are consistent with both the model and the fact that we observed 0s and 1s.
The generated quantities
block
Finally, we compute quantities that help us understand and evaluate our model:
generated quantities {
vector[N_data] ll; // log-likelihoods
for (i in 1:N_data) {
if (y[i] >= 0 && y[i] <= 1)
ll[i] = normal_lpdf(y[i] | guess[i], sigma_e);else
ll[i] = negative_infinity();
} }
The log-likelihood tells us how probable each observation is under our model. We’ll use these for model comparison—models that assign higher probability to the actual data are better.
The complete model
Here’s our complete model with consistent naming:
data {
int<lower=1> N_item; // number of items
int<lower=1> N_participant; // number of participants
int<lower=1> N_data; // number of data points in (0, 1)
int<lower=1> N_0; // number of 0s
int<lower=1> N_1; // number of 1s
vector<lower=0, upper=1>[N_data] y; // response in (0, 1)
array[N_data] int<lower=1, upper=N_item> item;
array[N_0] int<lower=1, upper=N_item> item_0;
array[N_1] int<lower=1, upper=N_item> item_1;
array[N_data] int<lower=1, upper=N_participant> participant;
array[N_0] int<lower=1, upper=N_participant> participant_0;
array[N_1] int<lower=1, upper=N_participant> participant_1;
}
parameters {
vector[N_item] mu_guess;
real<lower=0> sigma_epsilon_guess;
vector[N_participant] z_epsilon_guess;
real<lower=0,upper=1> sigma_e;
array[N_0] real<upper=0> y_0;
array[N_1] real<lower=1> y_1;
}
transformed parameters {
vector[N_participant] epsilon_guess = sigma_epsilon_guess * z_epsilon_guess;
vector[N_data] guess;
vector[N_0] guess_0;
vector[N_1] guess_1;
for (i in 1:N_data) {
guess[i] = mu_guess[item[i]] + epsilon_guess[participant[i]];
}for (i in 1:N_0) {
guess_0[i] = mu_guess[item_0[i]] + epsilon_guess[participant_0[i]];
}for (i in 1:N_1) {
guess_1[i] = mu_guess[item_1[i]] + epsilon_guess[participant_1[i]];
}
}
model {
1);
sigma_epsilon_guess ~ exponential(
z_epsilon_guess ~ std_normal();
for (i in 1:N_data) {
y[i] ~ normal(guess[i], sigma_e);
}for (i in 1:N_0) {
y_0[i] ~ normal(guess_0[i], sigma_e);
}for (i in 1:N_1) {
y_1[i] ~ normal(guess_1[i], sigma_e);
}
}
generated quantities {
vector[N_data] ll;
for (i in 1:N_data) {
if (y[i] >= 0 && y[i] <= 1)
ll[i] = normal_lpdf(y[i] | guess[i], sigma_e);else
ll[i] = negative_infinity();
} }
This baseline model treats each item as having an inherent degree along the relevant scale, with participants providing noisy measurements of these degrees. The censoring approach handles the common issue of responses at the boundaries (0 and 1) of the slider scale.
PDS-to-Stan
So what components of the above model are derived from PDS? To answer this, we need to define our PDS model of the norming task itself. Here it is:
= termOf $ getSemantics @Adjectives 1 ["jo", "is", "a", "soccer player"]
s1' = termOf $ getSemantics @Adjectives 0 ["how", "tall", "jo", "is"]
q1' = ty tau $ assert s1' >>> ask q1'
discourse' = asTyped tau (betaDeltaNormal deltaRules . adjectivesRespond scaleNormingPrior) discourse' scaleNormingExample
This code:
- Asserts that Jo is a soccer player (establishing context)
- Asks “how tall is Jo?” using the degree-argument version of the adjective
- Applies beta and delta-reduction rules via
betaDeltaNormal
- Uses
scaleNormingPrior
to generate prior distributions - Applies
adjectivesRespond
to specify the response function
Note the use of certain convenience functions. For example, getSemantics
retrieves one of the meanings (in the λ-calculus) for the expression it is given as a string of strings, using the parser implemented at Grammar.Parser
. The other functions, termOf
, ty
, and tau
serve as basic plumbing and can be found in the documentation.
Working through degree questions
Degree questions like “how tall is Jo?” use a special lexical entry for adjectives that exposes the degree argument. From Grammar.Lexica.SynSem.Adjectives
:
instance Interpretation Adjectives SynSem where
= Convenience.combineR
combineR = Convenience.combineL
combineL
= [lex]
lexica where lex = \case
...
"tall" -> [ SynSem {
= AP :\: Deg,
syn = ty tau (purePP (lam d (lam x (lam i (sCon "(≥)" @@ (sCon "height" @@ i @@ x) @@ d)))))
sem
}...
]...
"how" -> [ SynSem {
= Qdeg :/: (S :/: AP) :/: (AP :\: Deg),
syn = ty tau (purePP (lam x (lam y (lam z (y @@ (x @@ z))))))
sem
}...
]...
delta-rules and semantic computation
PDS applies delta-rules to simplify these complex λ-terms. As discussed in the implementation section, delta-rules enable different semantic computations while preserving semantic equivalence. The formalism is strongly normalizing and confluent, so the order of rule application doesn’t affect the final result—a crucial property that ensures our semantic theory remains consistent.
Key delta-rules for adjectives include: - Arithmetic operations: Simplifying comparisons like \(\ct{(≥)}\) when applied to constants - State/index extraction: Rules for \(\ct{height}\), \(\ct{d\_tall}\), etc. - Beta reduction: Standard λ-calculus reduction
These rules transform the complex compositional semantics into simpler forms suitable for compilation to Stan. The transformation preserves the semantic content while making it computationally tractable.
Working through delta-reductions
Having seen the PDS code for degree questions, we now trace through how delta-rules transform these complex λ-terms into forms suitable for Stan compilation. delta-rules, as introduced in Lambda.Delta
, are partial functions from terms to terms that implement semantic computations.
For degree questions like how tall is Jo?, the compositional semantics produces:
\[ λd, i.\ct{height}(i)(j) ≥ d \]
When \(\abbr{respond}\) comes into the picture, some index \(i^{\prime}\) is sampled from the common ground, and the maximal answer to the question is determined to be:
\[
\ct{max}(λd.\ct{height}(i^{\prime})(j) ≥ d)
\] This term undergoes several delta-reductions. First, the indices
rule extracts the height value from whatever actual index is sampled from the common ground of the current discourse state. From Lambda.Delta
:
indices :: DeltaRule
= \case
indices ...
Height (UpdHeight p _) -> Just p
Height (UpdSocPla _ i) -> Just (Height i)
...
-> Nothing _
Calling this height value \(h\), this rule yields:
\[ \ct{max}(λd.h ≥ d) \]
where \(h\) represents Jo’s actual height at index \(i\). The \(\ct{max}\) operator then extracts this unique value using the following delta-rule:
maxes :: DeltaRule
= \case
maxes Max (Lam y (GE x (Var y'))) | y' == y -> Just x
-> Nothing _
This gives us \(h\).
This final form directly corresponds to the Stan parameter we need to infer—the degree on the height scale.
From lambda terms to Stan parameters
The challenge is translating abstract semantic computations into Stan’s parameter space. This translation embodies (some of) our linking hypothesis between semantic competence and performance.
- Degree extraction becomes parameter inference:
- \(\ct{max}(λd.\ct{height}(i)(j) ≥ d)\) → Infer parameter
height_jo
- The unique degree satisfying the equation becomes a parameter to estimate
- \(\ct{max}(λd.\ct{height}(i)(j) ≥ d)\) → Infer parameter
- Functions become arrays:
- \(\ct{height} : \iota \to e \to r\) → Array
height[person]
- Function application → Array indexing
- \(\ct{height} : \iota \to e \to r\) → Array
- Propositions become probabilities:
- Truth values → Real numbers in [0,1]
- Logical operations → Probabilistic operations
- The monad becomes Stan’s target:
The \(\Do\)-notation structures sequential computation:
\(\begin{array}[t]{l} x ∼ \ct{normal}(0, 1) \\ y ∼ \ct{normal}(x, 1) \\ \pure{y} \end{array}\)
This determines Stan’s log probability
This translation embodies our linking hypothesis: semantic computations generate behavioral data through a noisy measurement process captured by adjectivesRespond
.
Input PDS:
= ty tau $ assert s1' >>> ask q1'
discourse' = asTyped tau (betaDeltaNormal deltaRules . adjectivesRespond scaleNormingPrior) discourse' scaleNormingExample
delta-reductions:
- Parse the answer to “how tall jo is” → \(\ct{max}(λd.\ct{height}(i)(j) ≥ d)\)
- Apply
indices
rule → \(\ct{max}(λd.h ≥ d)\) - Apply \(\ct{max}\) extraction → \(h\)
- Monadic structure maps to Stan parameter inference
Kernel output:1
model {
// FIXED EFFECTS
0.0, 1.0);
w ~ normal(
// LIKELIHOOD
target += normal_lpdf(y | w, sigma);
}
The PDS kernel model
The PDS system outputs the following kernel model:2
model {
// FIXED EFFECTS
0.0, 1.0);
w ~ normal(
// LIKELIHOOD
target += normal_lpdf(y | w, sigma);
}
This is the semantic core—it captures the essential degree-based semantics where w
represents the degree on the height scale. But reality is complicated: we need random effects, the ability to model censored data, and proper indexing for multiple items and participants. This gap between the kernel model and a full statistical implementation represents ongoing research: how to get from here (PDS output) to here (actual implementation).
The full model with analyst augmentations looks like:
model {
// PRIORS (analyst-added)
1);
sigma_epsilon_guess ~ exponential(2, 10);
sigma_e ~ beta(
// FIXED EFFECTS (PDS kernel)
0.0, 1.0);
mu_guess ~ normal(
// RANDOM EFFECTS (analyst-added)
z_epsilon_guess ~ std_normal();
// LIKELIHOOD (PDS kernel with modifications)
y[i] ~ normal(mu_guess[item[i]] + epsilon_guess[participant[i]], sigma_e); }
Lines 6 and 13 (highlighted) show the kernel model from PDS. The unhighlighted portions add statistical machinery for real data: hierarchical priors, random effects, and indexed parameters. The kernel captures the core semantic computation—degrees on scales—while the augmentations handle the realities of experimental data.
This baseline model establishes how PDS transforms degree questions into parameter inference. Next, we’ll see how this extends to modeling the vagueness inherent in gradable adjective judgments.