• We are pleased to announce that the winner of our Feedback Prize Draw for the Winter 2024-25 session and winning £150 of gift vouchers is Zhao Liang Tay. Congratulations to Zhao Liang. If you fancy winning £150 worth of gift vouchers (from a major UK store) for the Summer 2025 exam sitting for just a few minutes of your time throughout the session, please see our website at https://www.acted.co.uk/further-info.html?pat=feedback#feedback-prize for more information on how you can make sure your name is included in the draw at the end of the session.
  • Please be advised that the SP1, SP5 and SP7 X1 deadline is the 14th July and not the 17th June as first stated. Please accept out apologies for any confusion caused.

GLMs and response variable Y

C

Copen

Member
Hi,

For GLM models, I know that the response variable Y must be from a member of the exponential family of distributions, but can some one explain how you would actually make it be a poisson distribution for example? (if you were building a GLM model)

Are the chosen link function and error distribution independent to the distribution for Y? So I guess making the distribution for the error function poisson doesn't cause Y to be Poisson.

I think my confusion is coming a bit from a mis-understanding of GLMs in general so any help would be much appreciated!
 
It's the error term that is assumed to follow a particular distribution, not Y. So if the error term is assumed to be Poisson and we choose the canonical link function, then lnY is Poisson, and Y wouldn't be Poisson.
 
I'm not a GLM specialist but I don't see that this answer makes much sense. If we say that \(\ln(Y)\) follows a Poisson distribution then \(Y\) takes non-integer values, which is not helpful when modelling claims frequency. Also the error term cannot have a Poisson distribution because it has to have an expected value of zero.

My understanding was that \(Y\) has a Poisson distribution. Then, if we assume a log-link function, we have \(\ln({\mathbb E}(Y))=\) some linear combination of the explanatory variables.

I think the confusion arises from the fact that distribution used is very often referred to as the "error distribution" and, when this is Gaussian, both \(Y\) and the error term have a Gaussian distribution so the distinction is not important.
 
There's a very good paper which discusses this in more detail - just search for 'duncan anderson glm' and you'll find it. The core reading was an extract from this paper, I think.
 
Hi,

Katherine is correct in that it is the error term that is assumed to be Poisson (not Y). Y is a function of the covariates and the parameters (plus the error term) so it therefore doesn't follow that Y will be Poisson.

By using a Poisson error term, we are effectively saying that the variance of the error is proportional to the mean. [Similarly, a gamma error term says that the variance is proportional to the mean squared.]

In general linear modelling (where the error term is assumed to be normally distributed), the expected value of the error would indeed be zero, as td290 suggests. However, this is not the case for generalised linear modelling (GLMs).

The fit of the GLM model would be tested using residuals; if a model with a Poisson error function showed a poor fit, the chosen error function distribution would need to be tweaked - perhaps by making the variance proportional to the mean to the power 1.5.

Copen - in terms of your original question as to how we would make this distribution Poisson - well, when you set up the model specification in the GLM software, this is one of the things you would need to define (along with your choice of link function etc.)

Coralie
 
Coralie/Katherine/anyone!

May I ask then how you are defining the "error term" of a GLM? I would have defined it as \(Y-{\mathbb E}(Y)\) (from rearranging \(Y={\mathbb E}(Y)+\varepsilon\)), from which it would follow that the expected value of the error term is zero.

Thanks,
td290
 
Katherine is correct in that it is the error term that is assumed to be Poisson (not Y). Y is a function of the covariates and the parameters (plus the error term) so it therefore doesn't follow that Y will be Poisson.

Thanks all,

So if it doesn't follow that Y will be Poisson, can we say what distribution Y is?

I think it says in the notes that "the response variable is assumed to be a member of the exponential family of distributions".
 
td290, the structure you have given here is for a linear model. The GLM has the added complication of the link function - often a log link. So the structure effectively becomes Y = g(-1)[E(Y)] + error.

Copen, you're right. We can't say what distribution Y will have except that the GLM model structure specifies that it must come from the exponential family (eg normal, gamma, Poisson, binomial, exponential etc).

Coralie
 
Coralie,

So if I've got this right, if we denote the error term by \(\varepsilon\), you are saying that:\[Y=g^{-1}\left[{\mathbb E}(Y)\right]+\varepsilon\]You are also saying that, in the case of a Poisson GLM with a log link function, \(\varepsilon\) has a Poisson distribution and \(g^{-1}(x)=e^x\). So:\[Y=e^{{\mathbb E}(Y)}+\varepsilon\]Taking expectations of both sides gives:\[{\mathbb E}(Y)=e^{{\mathbb E}(Y)}+{\mathbb E}(\varepsilon)\]and so\[{\mathbb E}(\varepsilon)={\mathbb E}(Y)-e^{{\mathbb E}(Y)}\]But \(e^x>x\) for all values of \(x\), and therefore \(x-e^x<0\). Thus \({\mathbb E}(Y)-e^{{\mathbb E}(Y)}<0\) and therefore \({\mathbb E}(\varepsilon)<0\).

So you are effectively saying that \(\varepsilon\) has a Poisson distribution with a negative expected value, which is clearly impossible.
 
Last edited by a moderator:
I'm not sure it works that way though, td290.

As I understand it, the thing that the GLM models is:

log E(Yi) = beta(0) + beta(1)X(1) + ...

when a log link function is used.

You would then need to "un-log" the resulting beta values using e. Specific GLM software would normally do this for you but, in the old days, we had to do this using a calculator.

[If you want lots of detail on the maths of this then I suggest you have a look at the paper Pede suggested.]
 
What I've said is correct. If you say that \(Y=g^{-1}\left[{\mathbb E}(Y)\right]+\varepsilon\) and \(g(x)=\ln(x)\) then \(\varepsilon\) would have to have a negative expected value. It makes no difference how you define \({\mathbb E}(Y)\) in terms of \(\beta\) and \(\mathbf X\).

I'm afraid the reason we got into this mess is because a lot of what's been said so far isn't true. To begin with, \(Y=g^{-1}\left[{\mathbb E}(Y)\right]+\varepsilon\) isn't right.

In reality, \({\mathbb E}(Y)=\mu=g^{-1}(\eta)\) and \(\eta=\beta_0+\beta_1 \times X_1\ldots\). Furthermore, it is \(Y\) that's assumed to have a Poisson distribution with the given expected value. It has to be, otherwise the relation \(\operatorname{var}(Y_i)=\frac{\phi\mu_i}{\omega_i}\) does not hold.

I would encourage readers to look at the following notes from the stats department of Columbia university:

http://www.stat.columbia.edu/~madigan/W2025/notes/GLM.pdf

On page 14, it explicitly states that the distribution assumption relates to the distribution of the response variable. On page 17 it even more explicitly states that for a Poisson-log model, \(Y_i\sim P(\mu_i)\).
 
Last edited by a moderator:
Even better, I've just tracked down a copy of the all-time authoritative text on GLMs by McCullagh and Nelder. They could not be clearer that the distributional assumption, Poisson or otherwise, refers to the reponse variable \(Y\).
 
It is true that Y is assumed to have a distribution from the exponential family of distributions. This may not be Poisson, even for a frequency model and, in fact, we don't actually need to know what this distribution is.

But I've worked in GLMs for many years now and one of the first things you learn is that the distributional assumption you need to make when you start fitting a model relates to the error term. That assumption defines the relationship between the mean and the variance. So, as Coralie said below, making a Poisson assumption is really just saying that the variance will be proportional to the mean. We don't need to specify what that value will be.

Sorry - no time to get in to maths/formulae but not sure if it helps to say that GLMs work by assuming that the log of the response is predicted to vary linearly - I think maybe td290 was assuming that the exponential was varying linearly??
 
interested,

Thanks for this. Absolutely we speak of GLMs having a certain "error structure", e.g. Poisson, signifying a relationship between the mean and variance, specifically the mean and variance of \(Y\). So a Poisson error structure means that \(V(\mu)=\mu\), where \(\operatorname{var}(Y_i)=\frac{\phi V(\mu_i)}{\omega_i}\). It is certainly common to assume a Poisson error structure without pinning down the actual distribution of the response variable. But the reason it's called a Poisson error structure is because it can be achieved by assuming that the reponse variable (not the error term) has a Poisson distribution.

Incidentally, I'm not even sure that the phrase "error term" is a good one to use here. I don't know which GLM software you use, but if you look at the Emblem User Guide, you'll see that the phrase is never used. Certainly if we accept Coralie's definition then assuming that the error term has a Poisson distribution does not yield a Poisson error structure whereas assuming that \(Y\) has a Poisson distribution does.
 
Back
Top