Hello World: Testing the Pipeline

This is a test post to verify the full rendering pipeline. If you can see properly formatted math, syntax-highlighted code, and a table below, everything is working.

Inline and Display Math

Einstein's famous mass-energy equivalence is $E = mc^2$ , and the gradient of a scalar field is $\nabla f = \left(\frac{\partial f}{\partial x_1}, \ldots, \frac{\partial f}{\partial x_n}\right)$ .

The Bellman optimality equation for the state-value function $V^*(s)$ under a discounted infinite-horizon MDP is:

V^*(s) = \max_{a \in \mathcal{A}} \left[ R(s, a) + \gamma \sum_{s' \in \mathcal{S}} P(s' \mid s, a) \, V^*(s') \right]

And the corresponding action-value form:

Q^*(s, a) = R(s, a) + \gamma \sum_{s' \in \mathcal{S}} P(s' \mid s, a) \max_{a'} Q^*(s', a')

Code Blocks

Here's a simple value iteration implementation in Python:

value_iteration.py

import numpy as np
 
def value_iteration(P, R, gamma=0.99, theta=1e-8):
    V = np.zeros(P.shape[0])
    while True:
        V_new = np.max(R + gamma * P @ V, axis=1)
        if np.max(np.abs(V_new - V)) < theta:
            break
        V = V_new
    # Extract the greedy policy
    policy = np.argmax(R + gamma * P @ V, axis=1)
    return V, policy

And a TypeScript utility for reading frontmatter:

parse-post.ts

import matter from "gray-matter";
import fs from "fs";
 
interface PostMeta {
  title: string;
  description?: string;
  tags: string[];
}
 
export function parsePost(filePath: string) {
  const raw = fs.readFileSync(filePath, "utf8");
  const { data, content } = matter(raw);
  return { meta: data as PostMeta, content };
}

A Blockquote

The reward hypothesis: all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward).

— Rich Sutton

Tables

Algorithm	On/Off-Policy	Model-Free	Continuous Actions
Q-Learning	Off	Yes	No
SARSA	On	Yes	No
PPO	On	Yes	Yes
SAC	Off	Yes	Yes
DDPG	Off	Yes	Yes

Lists

Key components of a typical RL agent:

A policy $\pi(a \mid s)$ mapping states to actions
A value function $V^\pi(s)$ or $Q^\pi(s,a)$
Optionally, a model of the environment dynamics
- Transition function $P(s' \mid s, a)$
- Reward function $R(s, a)$

Steps in the policy gradient derivation:

Define the objective $J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} [R(\tau)]$
Apply the log-derivative trick
Estimate the gradient with Monte Carlo samples
Add a baseline to reduce variance

That's everything. If the math renders, the code highlights, and the table aligns, you're good to go.