blog

Hello World: Testing the Pipeline

A demo post that exercises every feature of the markdown pipeline — math, code, tables, and more.

#reinforcement-learning#meta

This is a test post to verify the full rendering pipeline. If you can see properly formatted math, syntax-highlighted code, and a table below, everything is working.

Inline and Display Math

Einstein's famous mass-energy equivalence is E=mc2E = mc^2, and the gradient of a scalar field is f=(fx1,,fxn)\nabla f = \left(\frac{\partial f}{\partial x_1}, \ldots, \frac{\partial f}{\partial x_n}\right).

The Bellman optimality equation for the state-value function V(s)V^*(s) under a discounted infinite-horizon MDP is:

V(s)=maxaA[R(s,a)+γsSP(ss,a)V(s)]V^*(s) = \max_{a \in \mathcal{A}} \left[ R(s, a) + \gamma \sum_{s' \in \mathcal{S}} P(s' \mid s, a) \, V^*(s') \right]

And the corresponding action-value form:

Q(s,a)=R(s,a)+γsSP(ss,a)maxaQ(s,a)Q^*(s, a) = R(s, a) + \gamma \sum_{s' \in \mathcal{S}} P(s' \mid s, a) \max_{a'} Q^*(s', a')

Code Blocks

Here's a simple value iteration implementation in Python:

value_iteration.py
import numpy as np
 
def value_iteration(P, R, gamma=0.99, theta=1e-8):
    V = np.zeros(P.shape[0])
    while True:
        V_new = np.max(R + gamma * P @ V, axis=1)
        if np.max(np.abs(V_new - V)) < theta:
            break
        V = V_new
    # Extract the greedy policy
    policy = np.argmax(R + gamma * P @ V, axis=1)
    return V, policy

And a TypeScript utility for reading frontmatter:

parse-post.ts
import matter from "gray-matter";
import fs from "fs";
 
interface PostMeta {
  title: string;
  description?: string;
  tags: string[];
}
 
export function parsePost(filePath: string) {
  const raw = fs.readFileSync(filePath, "utf8");
  const { data, content } = matter(raw);
  return { meta: data as PostMeta, content };
}

A Blockquote

The reward hypothesis: all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward).

— Rich Sutton

Tables

Algorithm On/Off-Policy Model-Free Continuous Actions
Q-Learning Off Yes No
SARSA On Yes No
PPO On Yes Yes
SAC Off Yes Yes
DDPG Off Yes Yes

Lists

Key components of a typical RL agent:

  • A policy π(as)\pi(a \mid s) mapping states to actions
  • A value function Vπ(s)V^\pi(s) or Qπ(s,a)Q^\pi(s,a)
  • Optionally, a model of the environment dynamics
    • Transition function P(ss,a)P(s' \mid s, a)
    • Reward function R(s,a)R(s, a)

Steps in the policy gradient derivation:

  1. Define the objective J(θ)=Eτπθ[R(τ)]J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} [R(\tau)]
  2. Apply the log-derivative trick
  3. Estimate the gradient with Monte Carlo samples
  4. Add a baseline to reduce variance

That's everything. If the math renders, the code highlights, and the table aligns, you're good to go.