20. Perspective Projection

Dated: 14-05-2025

How do we map a 3D point in space onto a pixel?
To make the task easier, we will impose following restrictions:

The point of view(POV) must lie on the \(z\) axis.
The screen plane must be parallel to the \(xy\) plane.
Left and right edges of the screen should be parallel to \(y\) axis.
Top and bottom edges of the screen should be parallel to \(x\) axis.
\(z\) axis should pass through the middle of the screen.

There are 2 approaches (using the left hand coordinate system):

The screen center lies at the origin and POV lies at \((0, 0, -z)\).
The screen center lies at \((0, 0, z)\) and POV lies at the origin.

The 2nd approach is more convenient when we add features making it possible for the POV to move around the 3D or for objects to move around in the world.

\[\frac{|\overline{SB}|}{|\overline{AB}|} = \frac{|\overline{PC}|}{|\overline{AC}|}\]

\[|\overline{SB}| = \frac{|\overline{PC}|}{|\overline{AC}|} \times |\overline{AB}|\]

\[|\overline{SB}| = \frac{|\overline{PC}|}{|\overline{AC}|} \times \text{Screen.z}\]

\(|\overline{AB}|\) is also called the scaling factor.

Now the task is to map the world coordinates \((x, y, z)\) onto the screen pixels \((x^\prime, y^\prime)\).

\[x^\prime = s \times \frac x z + \frac{\text{screen.width}} 2\]

\[y^\prime = \text{screen.height} - \left(s \times \frac y z + \frac{\text{screen.height}} 2\right)\]

Here \(s = |\overline{AB}|\)

Because both \(x^\prime\) and \(y^\prime\) depend on \(s\), therefore these equations are limited to only square screens.
The above technique is called z buffering.

The Perspective Projection `Matrix`¹

Instead of using \(s\), we will use horizontal field of view(fov).
This allows us to calculate screen height based on screen width and aspect ratio.

\[\text{aspect} = \frac{\text{height}}{\text{width}}\]

\[h = \frac{\cos(\text{fov})}{\sin(\text{fov})}\]

\[w = \text{aspect} \times h\]

\[q = \frac{z_\text{far}}{z_\text{far} - z_\text{near}}\]

In z buffering, we can clip objects within the range \([z_\text{near}, z_\text{far}]\).

With these parameters, the following projection matrix¹ can be made.

\[ \textbf M = \begin{bmatrix} w & 0 & 0 & 0\\ 0 & h & 0 & 0\\ 0 & 0 & q & 1\\ 0 & 0 & -q(z_\text{near}) & 0\\ \end{bmatrix} \]

Let's perform a sanity check.

\[\textbf M \cdot \vec P = \vec P^\prime\]

\[ \begin{bmatrix} w & 0 & 0 & 0\\ 0 & h & 0 & 0\\ 0 & 0 & q & 1\\ 0 & 0 & -q(z_\text{near}) & 0\\ \end{bmatrix} \cdot \begin{bmatrix} x \\ y \\ z \\ 1 \end{bmatrix} = \begin{bmatrix} wx \\ hy \\ qz - q(z_\text{near}) \\ z \end{bmatrix} \]

To extract \((x, y, z)\) from \(\vec P^\prime\), the homogeneous component \(\vec P^\prime_{14}\) should be \(1\) but in this case, it is set to \(z\).
To normalize this,

\[\vec P^{\prime\prime} = \frac{\vec P^\prime}{z}\]

\[ \begin{bmatrix} \frac {wx} z\\ \frac {hy} z \\ q \left(1 - \frac{z_\text{near}} z\right) \\ 1 \end{bmatrix} \]

Now last piece of our puzzle is to develop the pipeline.
To render a scene, you set up a

World Matrix¹ - Responsible for transforming local coordinates of the object to global coordinates of the world.
View Matrix¹ - Responsible for transforming global coordinates of the world to space, relative to the viewer.
Projection Matrix¹ - Responsible for transforming viewer space coordinates to 2D screen pixel coordinates.

\[\vec v_\text{world} = \textbf M_\text{world} \cdot \vec v_\text{local}\]

\[\vec v_\text{view} = \textbf M_\text{view} \cdot \vec v_\text{world}\]

\[\vec v_\text{screen} = \textbf M_\text{projection} \cdot \vec v_\text{view}\]

We can use composition here as well.

\[\vec v_\text{screen} = (\textbf M_\text{projection} \cdot \textbf M_\text{view} \cdot \textbf M_\text{world}) \cdot \vec v_\text{local}\]

The Perspective Projection Matrix Used by `Microsoft Direct3D`

The projection matrix¹ is typically a scale and perspective projection.
The projection transformation converts the viewing frustum into a cuboid shape.
Because the near end of the viewing frustum is smaller than the far end, this has the effect of expanding objects that are near to the camera.

The Viewing Frustum

\[ \textbf P = \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & \frac 1 D \\ 0 & 0 & 0 & 1 \\ \end{bmatrix} \]

\[ \textbf T = \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & -D & 1 \\ \end{bmatrix} \]

The composition looks like

\[ \textbf T \cdot \textbf M= \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & -D & 1 \\ \end{bmatrix} \cdot \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & \frac 1 D \\ 0 & 0 & 0 & 1 \\ \end{bmatrix} = \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & \frac 1 D \\ 0 & 0 & -D & 1 \\ \end{bmatrix} \]

We can redefine the following parameters as

\[w = \cot\left(\frac{\text{fov}_w} 2\right)\]

\[h = \cot\left(\frac{\text{fov}_h} 2\right)\]

\(\text{fov}_w\) and \(\text{fov}_h\) represent the viewport's horizontal and vertical fields of view.

\[q = \frac{z_f}{z_f - z_n}\]

\[\Big\Downarrow\]

\[w = 2 \times \frac {z_n} {V_w}\]

\[h = 2 \times \frac {z_n} {V_h}\]

Where \(z_n\) represents the \(z\) position of near clipping plane.
\(V_h\) and \(V_w\) represent the height and width of the viewport respectively, in camera space.

References

Read more about matrices. ↩↩↩↩↩↩

20. Perspective Projection

The Perspective Projection Matrix1

The Perspective Projection Matrix Used by Microsoft Direct3D

The Viewing Frustum

References

The Perspective Projection `Matrix`¹

The Perspective Projection Matrix Used by `Microsoft Direct3D`