Piotr Trochim

Coding is not Software Engineering

2026-03-26T10:20:56+00:00

This topic has been on my mind for many years now.

The software industry has been inventing novel “Engineering” titles, but it really seems to be peaking now, with “Research Engineers”, “ML Engineer”, “Data Engineers” etc. On the other hand, we start seeing the proliferation of “Scientist” titles, surrounded with similar adjectives.

But when you meet and work with those folks - at FAANG or in the startups - what strikes you is their utter unfamiliarity with software engineering. At best they can be considered Hackers, and not the kind that breaks into system, but rather the kind that slaps two pieces of code together hoping it will do what they want.

The rise of generative AI and its applications to coding makes things even more confusing. I interview candidates who don’t even know how to code - folks I refer to as Wannabes.

Recruitment

It becomes one when it affects the social contract. I’m actively recruiting to my teams, and I’m spending a significant chunk of my time interviewing. I look for Engineers, but the recruiters keep either sending me Hackers or Wannabes.

The reason? All of them have extensive resumes plastered with Engineering titles.

The situation is so ridiculous, that one can’t really trust resumes any more and has to spend a lot of time figuring out people’s expertise.

Incentives

The other element that adds to confusion are the adjectives appended to the title “Engineer”, and their reflection of current industry trends.

These days, no one wants a “Software Engineering” role. But replace the word “Software” with “ML”, and applicants will swarm to it.

If you ask them about the difference, they will start describing their ambitions to work with AI and their fears for being rendered obsolete on the job market.

But not a single applicant focuses on the “Engineering” aspect of that job, and what it entails in both cases.

Coding is not Software Engineering

So what is the difference, really ?

Engineering is the process of building machines that don’t break. An engineer knows their creation through and through - understands the mechanisms that underlie it thoroughly, knows their operating specifications and safety margins. They have the skills and tools required to modify and repair those mechanism without sacrificing the complete understanding of the creation.

The term “complexity management” is often used to describe what engineers spend a significant portion of time doing. When a need to extend a mechanism arises, the engineer spends a lot of time rethiking the existing one in order to best accomodate the new mechanism and to comprehensively chart the new operating regime.

In software, these translate to refactoring and testing.

Coding on the other hand is just a tool an engineer has under their belt - and it’s one that is used at the lowest level of abstraction.

Engineers operate at 3 layers of abstraction:

the lowest one is writing code. This amounts to writing very small bricks the larger mechanism will be composed of.
above it sits the mechanism design - the skill of composing virtual machines that operate in a loop, and guaranteeing the stability of that loop - that the loop will not stop, break or go out of whack. In order to construct that, the engineer measures how bricks “sit together” and how the larger construct works and what might make it break. Testing is the main tool used at this level.
the highest level of abstaction is thinking about creation - planning how smaller mechanisms (aka. components) will come together to fulfil user requirements. The main tool here is refactoring that supports “complexity management”.

Given this ontology, we can now define the other terminology:

Coding - at skill employed at the lowest level of engineering abstraction
Hackers - folks operating only at the lowest level of abstraction - slapping pieces of code together and hoping for the best, without employing any heuristic that guarantees directionality of their efforts
Wannabes - folks who rely on external tools (such as Gen AI) to do the lowest rung work for them - they operate completely outside this framework.

Hussle culture and the rise of “fake it till you make it”

Corporate values such as “Move fast and break things” and the culture that prioritizes shipping code over understanding what is being built gave rise to this very confusing landscape.

I feel surrounded by people who keep slapping on titles and cheating their way through careers.

Meanwhile, the years meticulously spent on honing the craft of Software Engineering seem to be impairing my ability to grow my career. I keep encountering a very strange response to my skills - folks are afraid I will “slow the company down”, because “I want to write tests”, all the while they keep scoring me using “Engineering Excellence metrics”.

Now that I’m an enterpreneur, I’m making sure not to let such confusing cultural artifacts take root in my companies - but I still pitty good engineers who suffer elsewhere, trying to make sense of these very confusing rules.

And I fear for the future in which Software Engineering will disappear in all, but its name. I fear it, because if the same trend took place in construction of aviation - I would start fearing for my life.

PPO vs GRPO

2026-02-04T12:43:14+00:00

A Deep Dive into Two Policy Gradient Implementations

This post compares two policy optimization algorithms implemented in our RL codebase: Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO). We’ll examine their architectural differences, loss functions, and training loops from an engineering perspective.

Shared Abstractions: The Policy Interface

Both PPO and GRPO optimize a stochastic policy—a neural network that outputs a probability distribution over actions given a state. Despite their algorithmic differences, both algorithms need the same core operations from a policy:

Sample actions during environment rollout
Evaluate action probabilities during training (for importance sampling)

This shared requirement allows us to define a single StochasticPolicy interface that both algorithms consume:

class StochasticPolicy:
    def get_action(states) -> (action, log_prob)
    def evaluate_actions(states, actions) -> (log_probs, entropy)

The benefit is that policy architectures become algorithm-agnostic. Whether using a simple MLP, a continuous Gaussian policy, or a LoRA-finetuned LLM backbone, the same network can be trained with either PPO or GRPO without modification.

The Core Architectural Difference: Value Functions

The fundamental distinction between PPO and GRPO lies in how they estimate the baseline for variance reduction.

PPO trains a value network alongside the policy. This network learns to predict expected returns from each state, providing a per-timestep baseline for advantage computation.
GRPO eliminates the value network entirely. Instead, it collects multiple trajectories per update and normalizes rewards *across the group*. The baseline is simply the mean reward of the batch—no learned parameters required.

This trade-off is significant - PPO has more model complexity but needs fewer environment samples per update. GRPO has a simpler architecture but requires collecting group_dim trajectories before each optimization step.

Loss Function Comparison

Both algorithms use the clipped surrogate objective from PPO, but differ in advantage computation and regularization.

The Clipped Objective (Shared)

Both algorithms compute the probability ratio between current and old policies, then clip it to prevent destructive updates:

Where

is the importance sampling ratio.

Advantage estimaqtion

PPO uses Generalized Advantage Estimation (GAE)

where

Meanwhile, GRPO uses group-relative normalization (that gives rise to its name):

where

are computed are computed across multiple collected trajectories for the same episode rollout.

Value loss and regularization

GRPO looses PPO’s value loss term, because it’s not using the value function.

Regularization also differs - the baseline PPO algorithm used an Entropy bonus, and some later extensions migrated to using KL divergence to minimize the changes to policy distributions during a training step. And KL divergence is what GRPO uses.

GRPO paper also introduced a more stable way of approximating KL divergence.

Final loss

For PPO, it’s:

while for GRPO it’s:

Key Insight: Temporal vs. Batch Normalization

The advantage computation is where these algorithms fundamentally diverge:

PPO ’s GAE performs temporal credit assignment—it propagates information backward through time using the learned value function to determine which actions led to good outcomes.
GRPO ’s group normalization performs batch comparison—it asks “how did this trajectory’s reward compare to other trajectories from the same policy?” This sidesteps temporal credit assignment entirely.

Training Loop Comparison

Because GRPO trains on multiple trajectories for the same initial condition (or, in LLM terms - multiple completions to the same input query), several aspects of the training loop change:

PPO’s tensor shapes for State, Actions, Rewards etc. take form (T, …), where T - trajectory length. GRPO introduces an additional dimension G - group dimension - changing the shape to (T, G, …).

Because of this additional Group dimension, which can be thought of as a Batch dimension, trajectories for GRPO need to be padded and masked.

Lastly, because GRPO doesn’t use Value function, only the Policy module is trained.

The Masking Requirement in GRPO

Since GRPO stacks multiple trajectories of different lengths, it must:

Pad shorter trajectories to match the longest
Create a mask tensor indicating valid (non-padded) positions
Apply the mask during:
1. Advantage normalization (compute mean/std only over valid rewards)
2. Loss computation (exclude padded timesteps from gradients)

This adds implementation complexity but is unavoidable when batching variable-length sequences.

Pseudocode Comparison

PPO Training Loop:

for epoch in epochs:
    trajectory = collect_episodes(policy, env)          # Shape: (T, ...)

    advantages = compute_gae(trajectory, value_fn)      # Temporal credit assignment

    loss = ppo_loss(policy, value_fn, trajectory, advantages)

    optimizer.step(loss)                                # Updates policy AND value_fn

GRPO Training Loop:

for epoch in epochs:
    trajectories = [collect_episodes(policy, env)
                    for _ in range(group_dim)]          # List of G trajectories

    batched = stack_and_pad(trajectories)               # Shape: (T, G, ...)
    mask = compute_mask(batched)                        # Valid position indicator
    advantages = group_normalize(batched.rewards, mask) # Batch comparison

    loss = grpo_loss(policy, batched, advantages, mask)

    optimizer.step(loss)                                # Updates policy only

When to Use Each Algorithm

The obvious difference is that PPO is a general purpose policy optimization algorithm that can handle both LLM tuning as well as other RL use cases.

In addition, it takes the candle when dealing with limited environment samples - the value function increases sample efficiency and enables learning from fewer trajectories.

GRPO on the other hand has been designed specifically for LLM tuning. I took a liberty of unleashing it on a regular RL environment, however it’s not suitable to that - it’s much less sample efficient and assumes a strong policy baseline.

Conclusion

PPO and GRPO share the same clipped surrogate objective but differ fundamentally in how they estimate advantages:

PPO learns a value function for temporal credit assignment, adding model complexity but reducing sample requirements
GRPO uses group-relative normalization, trading sample efficiency for architectural simplicity

The shared policy interface in our implementation means you can benchmark both algorithms with identical network architectures—only the training loop and loss computation differ.

Thank you for reading !

Teaching Claude to Create Pixel Art

2026-01-25T23:27:28+00:00

Are Agentic Frameworks just slow, imperfect Finetuning frameworks ?

The Challenge

Like many developers working with AI assistants, I’ve noticed that Claude is quite good at understanding concepts, analyzing patterns, and explaining techniques—but there’s a significant gap between knowledge and capability.

With the recent conversations about the ever improving LLM capabilities, and the agentic frameworks approaching singularity, I thought I’d put Claude to the test.

Generating a pixel art character using Claude Opus 4.5

The challenge I selected was something that lies slightly outside Claude models area of expertise. I wanted to see how the model itself, and The agentic framework built around it - Claude code - can handle such a task and potentially self improve in order to execute it.

I started with this image downloaded from the internet and this prompt:

Here's a reference image. Can you generate a 64x64 pixel art image of a character, a detective in a coat with the extracted style and palette.

The result I got was:

This is not bad, however it’s far from some of the pixel art characters one can see in video games. Here’s an example:

Claude Code + Skills

I started with the following prompt

You are not great at creating pixel art graphics. 

I want you to create a skill that would allow you to: analyze reference pictures, understand the style/composition/color palette/how to represent game characters/buildings/scenery/mood, and render pixel art images based on prompts.=

Leveraging the Claude Skills, the model generated three skills that, given a reference image, would analyze the style & color palette and rendering a pixel art image.

Claude successfully analyzed the style (”Dark atmospheric pixel art with warm lighting accents”), extracted a 64-color palette organized by function (sky gradients, house structure, warm lighting), and identified key techniques like dithering and color banding - so far so good.

But then came the rendering test. I prompted Claude to create:

A character, a detective in a coat with the extracted style and palette, 32x32

… and another with a slightly higer resolution. The result was… disappointing.

Strange shapes, flat colors, and a character that looked nothing like a detective. The results remained poor. The approach was fundamentally flawed.

The Retrospective

I prompted Claude to analyze its own failure:

The results are disappointing - flat colors, strange shapes, bad design (not to mention very strange look of the character). Analyze your approach and explain why the skill you generated doesn’t give you ability to generate good pixel art

Some of Claude’s self-analysis was remarkably insightful. The response it generated explained that it had extensive knowledge about pixel art theory, but knowing these principles doesn’t translate to being able to apply them.

…Some however were outright hallucinations. It claimed the core problem was no visual feedback. As I mentioned at the beginning, it’s a multi-modal model that takes images as input. One might potentially consider this to be true - after all, an .svg document is an XML - however it is also a widely recognized vector graphics image, and I would assume that if the authors of the model went to lenghts of making it the output format of choice, they would go to pains of training the model with this format as input.

Pivoting to Blender 3D

I decided to give the approach one more go - this time by pivoting to Blender 3D. There were a few reasons:

there are MCP servers for Blender 3D on GitHub
there is ample documentation on Blender 3D API
… Claude generated that suggestion itself !

That last part is particularily important - but I don’t want to focus on it now, so let’s put a pin in it.

Claude generated the new skills and started rendering my character. This is the result:

What was this experiment really about

This was an example of In-Context Learning - the key technique Agentic Frameworks like Claude Code use to perform tasks and improve over time.

The agent was tasked with exploring an unknown domain, learning how to solve tasks in that domain, and ultimately demonstrating that newly acquired ability.

To achieve that it had at its disposal:

one of the most powerful LLMs (Anthropic Claude Opus 4.5) capable of reasoning , planning and tool execution
access to all of the built-in tools Claude Code comes with
unlimited thinking time

Finetuning perspective

The agent went ahead and build up a training dataset (skills, human feedback, reference images). It then used that information as input to an autoregressive model, effectively changing the model’s original output token distribution given the same prompt.

If we denoted the prompt to generate the character as X, all of the prompts that led to the model building skills as Y, the generated image as Z and the ideal (expected) image as Z’, then we would expect that:

and thus that the model would learn how to better generate to our expectations.

But the results we obtained do not indicate any learning whatsoever. Quite the contrary - each step introduced a significant degradation wrt. the baseline.

Is this form of training effective?

The key observation while working with Claude Code was that didn’t actually attempt to learn

It generated the skills - as in hallucinated them. It didn’t search for any online resources.
It didn’t attempt to test the skills it generated - it didn’t generate any images of its own volition, as a part of the process. I needed to do it myself, and then provide feedback

It therefore didn’t follow any learning process in the classical sense of word - create a train set, modify its beliefs, evaluate them on the test set, lather rince and repeat.

The actions it took also didn’t make it easy to understand what kind of feedback would be most useful to inject:

how to “refactor” the skills it generated to make them better.
what data to inject

In addition, the framework used context window compression which degraded the quality of the information it learned. The compression ran just before the Blender 3D skill refactor.

Skills as latent knowledge representation

The Skill definitions themselves are quite an interesting thing. It’s completely unclear how to write them or what to change in order to make them better.

Or more specifically - if a random word / sentence was removed or added to a Skill definition, to what degree would it affect the generation result.

Conclusion

This experiment in my opinion shows three large drawbacks of agentic frameworks:

they are learning systems, but the learning techniques that apply to them are not well documented
they build on top of LLMs and it looks that they can only achieve results as good as the underlying LLM
The systems thenselves do not readily help the developer build a working training framework to improve on a particular skill - this still seems to belong to the aesotheric “prompt engineering”, with little science explaining how to author prompts to achieve desired effects.

I’m convinced that, given enough time and effort one can coin sets of Skills that will aid an LLM to perform an arbitrary task - but is it worth it given the powerful finetuning techniques we have at our disposal ?

Thank you for reading !

Where RL training algorithm ends and a Policy starts

2026-01-20T14:34:11+00:00

From MLP to LLM: Refactoring PPO Policy implementations and introducing modular policy design

Previous post in the series

In the previous post we implemented the vanilla PPO policy optimization algorithm. In this one we’re going to play around with one crucial component of that vanilla implementation - the policy definition.

Specifically, we’re going to swap out the MLP based policy for one based around one of popular, open-source LLMs (we’ll use Qwen2-0.5B baseline model).

The main goal however is to gain a better understanding of the areas of responsibility of the components that make up a vanilla RL algorithm like that. I admit to often coming away from a whitepaper lecture with an impression that the descriptions of those algorithms conflates a lot of components. This post is an attempt to segregate them into exchangable parts.

This will be a pure software engineering post - we’ll do a fair bit of refactoring and design analysis.

Vanilla PPO architecture refresher

Figure 1. Vanilla PPO architecture in two contexts - when a policy is trained using PPO, and when a trained policy is used in an application.

After coming away from the lecture of the whitepaper and the code, I have to admit that the concept of PPO in my mind span three separate entities - training algorithm, the trained policy and the value function. All three were written to complement and aid the training process.

But if we take a step back and consider other contexts - such as when the policy would be used after training, we notice that we can jetison all of those entities but the policy itself.

The natural question then is how much of what PPOPolicy represents is related to PPO itself? Could we turn it into an entity that’s unrelated to PPO, one that could be trained with other approaches? We’ll complete this exploration in the next post, when we train it with GRPO, but here we’ll try nudging this concept from a different direction.

LLM based policy implementation

Figure 2. Policy architecture that uses QWEN2-0.5B base model

Our current policy implementations are simple MLPs. Let’s say we wanted to replace it with a foundational language model - what would need to change.

There’s a few questions we need to answer:

Q: Which model to pick - should it be one of large online LLMs, or something small?

The key factors are - our ability to train such a model, speed of execution and cost. I chose QWEN2-0.5B, because I can host it locally and its fairly fast for this little project.

Q: How to even train such a model?

I consider HuggingFace my main model repository. It saves me the trouble of adapting models that were written using different coding best practices.

That commodity hides an important aspect of a model though - access to its parameters. The other confusing aspect is the need to use a Tokenizer to generate input. Tokenizer takes strings as inputs - but we operate with states that are vectors of floating point values rather than strings.

The other issue is the training method - should we train the entire model, or append a few layers to it and train only those, freezing the weights of the underlying model (known as Parameter Efficient Fine Tuning - or PEFT for short)?

Generally speaking, training the entire model can lead to undesired effects, such as loss of generalization. But more importantly - even for a small model like QWEN2-0.5B, it would be very slow and very memory inefficient.

To solve both problems we will therefore use PEFT training method. And lucky for us, HuggingFace offers a neat PEFT implementation in its peft library. This library solves our first issue - access to parameters.

Q: How to bypass the tokenizer?

Figure 3. part “a” shows the regular setup in which Tokenizer is used as a string encoder/decoder. part “b” shows the bespoke encoder and decoder we need to introduce to encode the state tensor and decode the action tensor.

Tokenizer acts as an input encoder and an output decoder. It is either jointly trained with the model, or a pretrained tokenizer is used to train an LLM. Either way, the LLM depends on how its tokenizer works.

We will therefore need to train our own encoder and decoder. In this way, we’ll train those new layers to represent the inputs and decode the outputs in a way that best align with the LLM.

The final construct looks like this:

import peft
import transformers as tr
import torch
import torch.nn as nn


llm_id = "Qwen/Qwen2-0.5B"
llm = tr.AutoModelForCausalLM.from_pretrained(llm_id)
peft_config = peft.LoraConfig(
    r=16,  # rank
    lora_alpha=32,  # scaling factor
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    lora_dropout=0.05,
)
llm = peft.get_peft_model(llm, peft_config)

llm_hidden_dim = 896
state_encoder = nn.Sequential(
    nn.Linear(self.state_dim, 256), 
    nn.ReLU(), 
    nn.Linear(256, llm_hidden_dim)
)
action_head = nn.Sequential(
    nn.Linear(llm_hidden_dim, 128),
    nn.ReLU(),
    nn.Linear(128, self.action_hidden_dim),
)

At the end of this code we have 3 entities:

state_encoder which converts our state to input embeddings to be passed to the LLM
llm which is a LORA wrapper around the base QWEN2-0.5B model
action_head which converts the LLM output to action logits

The following is the code that executes that conversion:

states_embeddings = state_encoder(states)
states_embeddings = states_embeddings.unsqueeze(1)
llm_outputs = llm(
    inputs_embeds=states_embeddings, 
    output_hidden_states=True
)
last_hidden = llm_outputs.hidden_states[-1][:, -1, :]
action_logits = action_head(last_hidden)

MLP based policy refresher

We could express the code above with this pseudocode:

def get_action_logits(state: torch.Tensor) -> torch.Tensor

Let’s see how this compares to our ML Policy implementation:

class PPODiscretePolicy(PPOPolicy):

    def __init__(self, state_dim: int, action_dim: int) -> None:
        super().__init__(
             # ...
        )
        self.policy = nn.Sequential(
             # ...
        )

    def _get_action_distribution(self, state: torch.Tensor) -> dist.Distribution:
        action_logits = self.policy(state)
        action_logits_positive = nn.functional.softplus(action_logits)
        return torch_utils.CategoricalUnsqueezed(action_logits_positive)

   def get_action(self, states: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
     actions_dist = self._get_action_distribution(states)
     # ...

   def evaluate_actions(
        self, states: torch.Tensor, actions: torch.Tensor
   ) -> tuple[torch.Tensor, torch.Tensor]:
     actions_dist = self._get_action_distribution(states)
     # ...

Notice the patterns in this code:

__init__ is where we initialize our network
_get_action_distribution calculates the logits and then wraps them in a distribution
get_action and evaluate_actions use _get_action_distribution

So for our purpose, we would need to replace the contents of __init__ and the contents of _get_action_distribution.

Let’s go one step further though - we have another version of this class that represents a continuous policy (link to full code from the previous post) - how does it differ from this code?

log_probs tensor shape is different, so the continuous policy needs to reduce that shape
torch.distributions.Normal is used instead of torch.distributions.Categorical

Policy classes refactored

Figure 4. Refactored Policy is now a final class, strategized with a PolicyNetwork implementation.

What changes is network creation & distribution calculation What stays the same is - get_action and evaluate_actions.

This leads to refactoring shown in Figure 4 - where Policy class will be closed and finalized, owning the methods for getting an action and evaluating actions coming from other policies.

The responsibility for caclulating the distribution of actions given input states will fall to different implementations of PolicyNetwork.

Let’s consider for a moment what Policy class now represents though. It works with distributions of actions, and it samples those distributions to generate actions. It is a StochasticPolicy, as opposed to a deterministic policy that would employ a non-probabilistic mechanism for generating actions.

Summary

As Figure 4 also shows, we have now, to a degree, isolated the concept of Policy from PPO.

Is that separation complete ? That depends on:

whether PPO would work with deterministic policies
whether out Stochastic policy could be trained using algorithms other than PPO (e.g. MPO, TRPO, GRPO etc.)

In the next posts I’ll try to answer some of those questions. Thank you for reading !

Implementing an RL algorithm (PPO) from a whitepaper

2026-01-07T14:54:45+00:00

How difficult is it really?

Deep Learning whitepapers are not straightforward to implement. They require multiple readings, reading related literature, multiple iterations and lots of trial and error.

It makes the task of implementing and engineering research breakthroughs an arduous tasks, especially for less experienced engineers.

I would like to show how that might look like on a random paper I selected.

Proximal Policy Optimization Algorithms (PPO)

…is an on-policy policy gradient reinforcement learning algorithm arXiv:1707.06347

I chose this one for a few reasons - it stood the test of time, is widely used and generic and several baseline implementations exist (e.g. ClearRL PPO) - these qualities make it a very worthwhile piece of code to implement and understand.

But most important of all - despite the being a high quality paper, I found the experience of reading it very similar to many other Deep Learning whitepapers. So it’s not an outlier, but a good representative on the class of problems I want to focus on.

First read of the paper

I want to take you, my dear reader, on a journey. And that means you need to get your feet dirty to fully appreciate the experience.

So please click on the link below and read the original whitepaper. Approach it with a sense of curiosity that you need knowing that in a moment you will need to implement it. Please note your observations somewhere.

You may or may not be familiar with Reinforcement Learning. No matter your experience, I will assume you have no clue about it this is your first experience with it. I will however assume that you are a software engineer - you have a working knowledge of a select programming language (we’ll be using Python here), basic data structures and algorithms, code complexity management, testing your code.

Go ahead - arXiv:1707.06347 - come back here when you’re done.

Engineering approach to implementation

The first order questions I will attempt to answer are:

what exactly am I implementing
how should I validate my implementation

Note that we are translating a paper to an implementation, rather than designing our own algorithm/system - therefore we don’t need to worry about defining and fulfilling functional and non-functional requirements.

Evaluation

In terms of evaluation we can find references to various RL environments the original paper was evaluated on in sections Experiments and Appendix B:

environments that have been moved to OpenAI Gymnasium since the publication of this paper
Roboschool has been deprecated, and the link to PyBullet mentioned in the deprecation note no longer works

OpenAI Gymnasium is well documented and has a standardized environment API which allows to interact with any environment using the same code.

Cumulative score from each environment is used as the metric. Notice that the scores are specific to each environment (different Y axis scales):

Figure 1. Example PPO evaluation plots

The X axis on the plots refers to the evaluation scores obtained after x training steps.

Based on this information, we can develop the following framework:

import logging
import gymnasium as gym

def evaluate(**evaluated_algo: ???** , training_step: int) -> None:
  eval_environment_ids = ["HalfCheetah-v1", "Hopper-v1", "Swimmer-v1", ] # ... add others

  for env_id in eval_environment_ids:
    env = gym.make(env_id)

    # Use the standardized OpenAI Gumnasium API to run the environments
    observation = env.reset()
    episode_finished = False
    score = 0
    while not episode_finished:
       **action = evaluated_algo(observation)**       observation, reward, episode_finished, _, _ = env.step(action)
       score += reward

    logging.info("Score for %s after training step %d is %f", env_id, training_step, score)

I highlighted the two open questions:

what exactly is the implemented algorithm is still unclear
we can technically figure out that the implemented algo is supposed to convert observations returned by the environment into actions that will be fed back to the environment - this information is NOT mentioned in the paper however.

Arcane knowledge

The second point specifically requires the reader to scale the ladder of references and build an understanding of what is the essence of Reinforcement Learning algorithms.

Today, with ample resources, RL is not a mystery it used to be in year 2017, when this paper was originally published. Back then however, this was arcane knowledge, with few materials and fewer available example implementations. What was worse - the knowledge that existed introduced such vast terminology that it was very difficult to piece a cohesive understanding of these algorithms easily.

This led (in my case, and perhaps in yours too) to loosing the forest for the trees - instead of being able to implement e.g. PPO, one had to first undestand “on-policy”, “off-policy”, “gradient policy”, “policy functions”, “value functions”, “Bellman equation”, “losses”, “rewards - sparse and dense” …

Just take a look for yourself:

Figure 2. Taxonomy of Reinforcement learning models, Springer

And here’s a very good overview paper - arXiv:2209.14940v

This approach is employed by 99% of the whitepapers I’ve read during my career, and makes each of them a delta that builds on top of other information, rather than a self contained thing.

Arcane knowledge of Reinforcement Learning demistified

Figure 3. Reinforcement learning system, simplified view

Reinforcement Learning describes a system of 2 entities - an environment and a policy, that trade 3 key pieces of data between themselves: observations (sometimes referred to as state), actions and rewards.

Both can be thought of as functions:

def environment(action) -> tuple[observation, reward]:
   pass

def policy(observation, reward) -> action:
   pass

The “learning” part of Reinforcement Learning system pertains to that reward signal we are passign to the policy - we are stating that the policy learns and changes its definition based on that signal. Environment on the other hand never changes its definition and stays constant.

How the policy learns - that is the real reason behind such diversity of methods shown in Figure 2 - some versions learn from each action, some require massive amounts of accumulated data, and some use auxiliary entities.

Enter PPO.

PPO demistified

Figure 4. PPO overlayed on top of the training system. System from Figure 3 colored purple.

I arrived at the diagram in Figure 4 by stitching the information scattered across chapters 2, 3, 5 and 6. That information has to be grafted onto a framework that the paper does not mention but assumes familiarity with.

Step 1 overall PPO training algorithm

Figure 5. Relationship between the PPO training algorithm and the RL system design that uses PPO.

We can find it in section 5. In Figure 5 I colored the relevant components from Figure 4 to highlight the role they play in the algorithm

Step 2 Computing advantages

You will find references to those concepts showing up in Sections 2-5, with the definition of generalized advantage estimation presented in equation 11 and 12:

Notice the use of V(s_{t+1}) and V(s_{t}) - these are values obtained from the Value function, an auxiliary neural function that PPO algorithm introduces.

The code, implemented in pytorch, looks like this:

def compute_gae(
    rewards: torch.Tensor,
    values: torch.Tensor,
    dones: torch.Tensor,
    discount: float,
    gae_lambda: float = 0.95,
) -> torch.Tensor:
    """Compute Generalized Advantage Estimation."""
    advantages: list[torch.Tensor] = []
    gae = torch.tensor(0.0)

    for t in reversed(range(len(rewards))):
        if t == len(rewards) - 1:
            next_value = torch.tensor(
                0.0
            )  # Bootstrap value (or get V(s_T) if not terminal)
        else:
            next_value = values[t + 1]

        # Mask next_value if episode ended
        next_value = next_value * (1 - dones[t])

        # δ_t = r_t + γV(s_{t+1}) - V(s_t)
        delta = rewards[t] + discount * next_value - values[t]

        # A_t = δ_t + (γλ) δ_{t+1} + ... + (γλ)_{T-t+1}δ_{T-1}
        gae = delta + discount * gae_lambda * (1 - dones[t]) * gae
        advantages.insert(0, gae)

    return torch.FloatTensor(advantages)

Step 3 Optimize surrogate L (loss)

Loss is defined in chapters 3 and 5

Clipped policy loss (chapter 3):

Full loss equation that combines the clipped policy loss with value function loss and entropy bonus (chapter 5):

PyTorch implementation of the loss looks like this:

class Trajectory(typing.NamedTuple):
    states: torch.Tensor
    actions: torch.Tensor
    rewards: torch.Tensor
    log_probs: torch.Tensor
    dones: torch.Tensor


def ppo_loss(
    agent: _PPOAgent, trajectory: Trajectory, discount: float, clip_eps: float
) -> torch.Tensor:
    # LCLIP + L_VF + L_St(θ) = E_t[LCLIP_t(θ) − c_1 L_t^VF(θ) + c_2 S[πθ](st)]
    log_probs, entropies = agent.policy.evaluate_actions(
        trajectory.states, trajectory.actions
    )
    values = agent.value_fn(trajectory.states)
    advantages = compute_gae(trajectory.rewards, values, trajectory.dones, discount)
    returns = advantages + values
    # Normalize advantages to stabilize training
    advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)

    assert log_probs.shape == (trajectory.states.shape[0],)
    assert trajectory.log_probs.shape == (trajectory.states.shape[0],)
    ratio = torch.exp(log_probs - trajectory.log_probs)
    policy_loss = -torch.min(
        ratio * advantages,
        torch.clip(ratio, 1.0 - clip_eps, 1.0 + clip_eps) * advantages,
    ).mean()
    value_loss = 0.5 * (returns - values).pow(2).mean()
    c_2 = 0.01
    entropy_loss = c_2 * entropies.mean()
    loss = policy_loss + value_loss - entropy_loss

    return loss

I am on purpose omitting the implementations of PPOPolicy and _PPOValue for the time being in order not to distract from the main objective of this section - the loss function.

Note that the loss function expects several values to be provided either as input or as the function of the policy, among them log probabilities and entropies.

Also notice that there are 2 sets of log probabilities - those calculated by the policy.evaluate_actions, and those provided as a part of the trajectory input.

Picking up on this implementation nuance requires jumping to the beginning of chapter 3 of the paper and looking at the definition of probability ratio that is used in clipped policy loss.

pi_{theta} and pi_{theta}_old refer to probabilities (but not the log probabilities), of actions generated by a new (in this context - trained) policy, and the old (in this context - the policy used to collect the trajectory).

An algorithmic trick that inolves taking a logarithm of this equation allows us to use log probabilities , which packages such as pytorch readily provide from their distribution implementations.

Step 4 Policy and Value function network architecture

This detail is described all the way in chapter 6

Step 5 Implementing the Policy function

At this point we can go ahead and implement the networks themselves.

To make the long story short - depending on the types of actions an environment works with - continuous or discrete - we need to interpret the logits returned by a model differently, and slightly change the architecture of the network

Policy base class

class PPOPolicy(nn.Module, abc.ABC):

    def __init__(self, state_dim: int, action_dim: int) -> None:
        super().__init__()
        self.state_dim = state_dim
        self.action_dim = action_dim

    @abc.abstractmethod
    def get_action(self, state: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
        pass

    @abc.abstractmethod
    def evaluate_actions(
        self, states: torch.Tensor, actions: torch.Tensor
    ) -> tuple[torch.Tensor, torch.Tensor]:
        pass

Implementation for the continuous action spaces trains the policy model to return parameters (1-st moments) of a gaussian distribution, and then samples those to return action values. An action is a vector of floating points - they may for example represent the speeds of motors rotating a robot’s arm:

class PPOPolicyContinuous(PPOPolicy):

    def __init__(self, state_dim: int, action_dim: int) -> None:
        super().__init__(state_dim, action_dim)

        self.policy = nn.Sequential(
            nn.Linear(state_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 64),
            nn.Tanh(),
            nn.Linear(64, 2 * action_dim),
        )

    def _get_action_distribution(self, state: torch.Tensor) -> dist.Distribution:
        action_gaussian_params = self.policy(state)
        means, log_stds = torch.chunk(action_gaussian_params, 2, dim=-1)
        stds = torch.exp(log_stds.clamp(-20, 2))

        return dist.Normal(loc=means, scale=stds)

    def get_action(self, states: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
        if len(states.shape) != 2 or states.shape[1] != self.state_dim:
            raise ValueError(
                f"state should have shape (T, {self.state_dim}) != {states.shape}"
            )

        actions_dist = self._get_action_distribution(states)
        action = actions_dist.sample()

        # aggregate across the last dimension - we want the probability
        # of the joint distribution of actions for a given timestep
        log_prob = actions_dist.log_prob(action).sum(-1)

        assert action.shape == (states.shape[0], self.action_dim)
        assert log_prob.shape == (states.shape[0],)

        return action, log_prob

    def evaluate_actions(
        self, states: torch.Tensor, actions: torch.Tensor
    ) -> tuple[torch.Tensor, torch.Tensor]:
        if len(states.shape) != 2 or states.shape[1] != self.state_dim:
            raise ValueError(
                f"states should have shape (T, {self.state_dim}) != {states.shape}"
            )
        if len(actions.shape) != 2 or actions.shape[1] != self.action_dim:
            raise ValueError(
                f"actions should have shape (T, {self.action_dim}) != {actions.shape}"
            )

        actions_dist = self._get_action_distribution(states)

        log_probs = actions_dist.log_prob(actions)
        entropy = actions_dist.entropy()

        # aggregate across the last dimension - we want the probability
        # and the entropy of the joint distribution of actions for a given timestep
        log_probs = log_probs.sum(-1)
        entropy = entropy.sum(-1)

        assert log_probs.shape == (states.shape[0],)
        assert entropy.shape == (states.shape[0],)

        return log_probs, entropy

Policy for discrete action spaces on the other hand assumes that the policy returns a single integer number that represents a discrete action to be taken by an environment. An example of such action is the index of an arrow key on a keyboard when we play one of atari games.

class PPOPolicyDiscrete(PPOPolicy):

    def __init__(self, state_dim: int, action_dim: int) -> None:
        super().__init__(state_dim, action_dim)

        self.policy = nn.Sequential(
            nn.Linear(state_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 64),
            nn.Tanh(),
            nn.Linear(64, action_dim),
        )

    def _get_action_distribution(self, state: torch.Tensor) -> dist.Distribution:
        logits = self.policy(state)
        logits = nn.functional.softplus(logits)
        return dist.Categorical(logits)

    def get_action(self, states: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
        if len(states.shape) != 2 or states.shape[1] != self.state_dim:
            raise ValueError(
                f"State should have shape (T, {self.state_dim}) != {states.shape}"
            )

        actions_dist = self._get_action_distribution(states)
        action = actions_dist.sample()
        log_prob = actions_dist.log_prob(action)

        assert action.shape == (states.shape[0],)
        assert log_prob.shape == (states.shape[0],)

        return action, log_prob

    def evaluate_actions(
        self, states: torch.Tensor, actions: torch.Tensor
    ) -> tuple[torch.Tensor, torch.Tensor]:
        if len(states.shape) != 2 or states.shape[1] != self.state_dim:
            raise ValueError(
                f"states should have shape (T, {self.state_dim}) != {states.shape}"
            )
        if len(actions.shape) != 1:
            raise ValueError(f"actions should have shape (T,) != {actions.shape}")

        actions_dist = self._get_action_distribution(states)

        log_probs = actions_dist.log_prob(actions)
        entropy = actions_dist.entropy()

        assert log_probs.shape == (states.shape[0],)
        assert entropy.shape == (states.shape[0],)

        return log_probs, entropy

Step 6 Implementing the Value function

The value function uses the same network architecture, with the difference of returning a single floating point value - the value the function would assign to the state (observation) of an environment.

class _PPOValue(nn.Module):

    def __init__(self, state_dim: int):
        super().__init__()

        self.value = nn.Sequential(
            nn.Linear(state_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 64),
            nn.Tanh(),
            nn.Linear(64, 1),
        )

    def forward(self, states: torch.Tensor) -> torch.Tensor:
        values = self.value(states)
        return values.squeeze(
            -1
        )  # squeeze returns a tensor that's shaped just like the rewards tensor

Complete code

Here’s the complete solution:

import abc
import logging
import typing

import gymnasium as gym
from gymnasium.wrappers import RecordVideo
import mlflow
import torch
import torch.distributions as dist
import torch.nn as nn
from tqdm import trange  # type: ignore


def compute_gae(
    rewards: torch.Tensor,
    values: torch.Tensor,
    dones: torch.Tensor,
    discount: float,
    gae_lambda: float = 0.95,
) -> torch.Tensor:
    """Compute Generalized Advantage Estimation."""
    advantages: list[torch.Tensor] = []
    gae = torch.tensor(0.0)

    for t in reversed(range(len(rewards))):
        if t == len(rewards) - 1:
            next_value = torch.tensor(
                0.0
            )  # Bootstrap value (or get V(s_T) if not terminal)
        else:
            next_value = values[t + 1]

        # Mask next_value if episode ended
        next_value = next_value * (1 - dones[t])

        # δ_t = r_t + γV(s_{t+1}) - V(s_t)
        delta = rewards[t] + discount * next_value - values[t]

        # A_t = δ_t + (γλ) δ_{t+1} + ... + (γλ)_{T-t+1}δ_{T-1}
        gae = delta + discount * gae_lambda * (1 - dones[t]) * gae
        advantages.insert(0, gae)

    return torch.FloatTensor(advantages)


class PPOPolicy(nn.Module, abc.ABC):

    def __init__(self, state_dim: int, action_dim: int) -> None:
        super().__init__()
        self.state_dim = state_dim
        self.action_dim = action_dim

    @abc.abstractmethod
    def get_action(self, state: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
        pass

    @abc.abstractmethod
    def evaluate_actions(
        self, states: torch.Tensor, actions: torch.Tensor
    ) -> tuple[torch.Tensor, torch.Tensor]:
        pass


class PPOPolicyContinuous(PPOPolicy):

    def __init__(self, state_dim: int, action_dim: int) -> None:
        super().__init__(state_dim, action_dim)

        self.policy = nn.Sequential(
            nn.Linear(state_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 64),
            nn.Tanh(),
            nn.Linear(64, 2 * action_dim),
        )

    def _get_action_distribution(self, state: torch.Tensor) -> dist.Distribution:
        action_gaussian_params = self.policy(state)
        means, log_stds = torch.chunk(action_gaussian_params, 2, dim=-1)
        stds = torch.exp(log_stds.clamp(-20, 2))

        return dist.Normal(loc=means, scale=stds)

    def get_action(self, states: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
        if len(states.shape) != 2 or states.shape[1] != self.state_dim:
            raise ValueError(
                f"state should have shape (T, {self.state_dim}) != {states.shape}"
            )

        actions_dist = self._get_action_distribution(states)
        action = actions_dist.sample()

        # aggregate across the last dimension - we want the probability
        # of the joint distribution of actions for a given timestep
        log_prob = actions_dist.log_prob(action).sum(-1)

        assert action.shape == (states.shape[0], self.action_dim)
        assert log_prob.shape == (states.shape[0],)

        return action, log_prob

    def evaluate_actions(
        self, states: torch.Tensor, actions: torch.Tensor
    ) -> tuple[torch.Tensor, torch.Tensor]:
        if len(states.shape) != 2 or states.shape[1] != self.state_dim:
            raise ValueError(
                f"states should have shape (T, {self.state_dim}) != {states.shape}"
            )
        if len(actions.shape) != 2 or actions.shape[1] != self.action_dim:
            raise ValueError(
                f"actions should have shape (T, {self.action_dim}) != {actions.shape}"
            )

        actions_dist = self._get_action_distribution(states)

        log_probs = actions_dist.log_prob(actions)
        entropy = actions_dist.entropy()

        # aggregate across the last dimension - we want the probability
        # and the entropy of the joint distribution of actions for a given timestep
        log_probs = log_probs.sum(-1)
        entropy = entropy.sum(-1)

        assert log_probs.shape == (states.shape[0],)
        assert entropy.shape == (states.shape[0],)

        return log_probs, entropy


class PPOPolicyDiscrete(PPOPolicy):

    def __init__(self, state_dim: int, action_dim: int) -> None:
        super().__init__(state_dim, action_dim)

        self.policy = nn.Sequential(
            nn.Linear(state_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 64),
            nn.Tanh(),
            nn.Linear(64, action_dim),
        )

    def _get_action_distribution(self, state: torch.Tensor) -> dist.Distribution:
        logits = self.policy(state)
        logits = nn.functional.softplus(logits)
        return dist.Categorical(logits)

    def get_action(self, states: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
        if len(states.shape) != 2 or states.shape[1] != self.state_dim:
            raise ValueError(
                f"State should have shape (T, {self.state_dim}) != {states.shape}"
            )

        actions_dist = self._get_action_distribution(states)
        action = actions_dist.sample()
        log_prob = actions_dist.log_prob(action)

        assert action.shape == (states.shape[0],)
        assert log_prob.shape == (states.shape[0],)

        return action, log_prob

    def evaluate_actions(
        self, states: torch.Tensor, actions: torch.Tensor
    ) -> tuple[torch.Tensor, torch.Tensor]:
        if len(states.shape) != 2 or states.shape[1] != self.state_dim:
            raise ValueError(
                f"states should have shape (T, {self.state_dim}) != {states.shape}"
            )
        if len(actions.shape) != 1:
            raise ValueError(f"actions should have shape (T,) != {actions.shape}")

        actions_dist = self._get_action_distribution(states)

        log_probs = actions_dist.log_prob(actions)
        entropy = actions_dist.entropy()

        assert log_probs.shape == (states.shape[0],)
        assert entropy.shape == (states.shape[0],)

        return log_probs, entropy


class _PPOValue(nn.Module):

    def __init__(self, state_dim: int):
        super().__init__()

        self.value = nn.Sequential(
            nn.Linear(state_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 64),
            nn.Tanh(),
            nn.Linear(64, 1),
        )

    def forward(self, states: torch.Tensor) -> torch.Tensor:
        values = self.value(states)
        return values.squeeze(
            -1
        )  # squeeze returns a tensor that's shaped just like the rewards tensor


class Trajectory(typing.NamedTuple):
    states: torch.Tensor
    actions: torch.Tensor
    rewards: torch.Tensor
    log_probs: torch.Tensor
    dones: torch.Tensor

    def enable_grad(self) -> None:
        self.states.requires_grad = self.states.dtype.is_floating_point
        self.actions.requires_grad = self.actions.dtype.is_floating_point
        self.log_probs.requires_grad = self.log_probs.dtype.is_floating_point
        self.rewards.requires_grad = self.rewards.dtype.is_floating_point
        self.dones.requires_grad = self.dones.dtype.is_floating_point

    def __len__(self) -> int:
        if self.states is not None and len(self.states.shape) >= 1:
            return self.states.shape[0]
        else:
            return 0

    @staticmethod
    def concat(lhs: typing.Optional["Trajectory"], rhs: "Trajectory") -> "Trajectory":
        if lhs is None:
            return rhs

        return Trajectory(
            torch.concat([lhs.states, rhs.states]),
            torch.concat([lhs.actions, rhs.actions]),
            torch.concat([lhs.rewards, rhs.rewards]),
            torch.concat([lhs.log_probs, rhs.log_probs]),
            torch.concat([lhs.dones, rhs.dones]),
        )


class _PPOAgent(nn.Module):

    def __init__(self, policy: PPOPolicy, value_fn: _PPOValue) -> None:
        super().__init__()
        self.policy = policy
        self.value_fn = value_fn


def ppo_loss(
    agent: _PPOAgent, trajectory: Trajectory, discount: float, clip_eps: float
) -> torch.Tensor:
    # LCLIP + L_VF + L_St(θ) = E_t[LCLIP_t(θ) − c_1 L_t^VF(θ) + c_2 S[πθ](st)]
    log_probs, entropies = agent.policy.evaluate_actions(
        trajectory.states, trajectory.actions
    )
    values = agent.value_fn(trajectory.states)
    advantages = compute_gae(trajectory.rewards, values, trajectory.dones, discount)
    returns = advantages + values
    # Normalize advantages to stabilize training
    advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)

    assert log_probs.shape == (trajectory.states.shape[0],)
    assert trajectory.log_probs.shape == (trajectory.states.shape[0],)
    ratio = torch.exp(log_probs - trajectory.log_probs)
    policy_loss = -torch.min(
        ratio * advantages,
        torch.clip(ratio, 1.0 - clip_eps, 1.0 + clip_eps) * advantages,
    ).mean()
    value_loss = 0.5 * (returns - values).pow(2).mean()
    c_2 = 0.01
    entropy_loss = c_2 * entropies.mean()
    loss = policy_loss + value_loss - entropy_loss

    return loss


def train_one_trajectory(
    agent: _PPOAgent,
    optimizer: torch.optim.Optimizer,
    trajectory: Trajectory,
    num_updates: int,
    discount: float = 0.95,
    clip_eps: float = 0.2,
) -> list[torch.Tensor]:
    """Trains the PPO policy and value networks on a single trajectory.

    Multiple steps of training are preformed, the number defined by `num_updates` parameter.
    After calling this function, `trajectory` should be discarded and a new trajectory should be
    sampled.
    """
    trajectory.enable_grad()

    losses: list[torch.Tensor] = []
    for _ in trange(num_updates, desc="PPO update step", leave=False):
        optimizer.zero_grad()

        loss = ppo_loss(agent, trajectory, discount, clip_eps)
        losses.append(loss)

        loss.backward()
        # TODO: does gradient clipping fix training?
        # nn.utils.clip_grad_norm_(agent.parameters(), args.max_grad_norm)
        optimizer.step()

    return losses


def rollout_episode(env, policy: PPOPolicy, max_trajectory_len: int) -> Trajectory:
    states, rewards, actions, log_probs = [], [], [], []
    dones: list[int] = []
    state = env.reset()[0]
    done = False

    num_steps = 0
    with torch.no_grad():
        while not done and num_steps < max_trajectory_len:
            state_t = torch.tensor(state)

            one_state_in_batch_t = state_t.unsqueeze(0)

            action_t, log_prob_t = policy.get_action(one_state_in_batch_t)
            # Extract the only action and log probability from the batch tensor
            log_prob_f = log_prob_t.detach()[0]
            action_f = action_t.detach()[0]

            next_state, reward, done, _, _ = env.step(action_f.numpy())
            states.append(state_t)
            actions.append(action_f)
            rewards.append(reward)
            log_probs.append(log_prob_f)
            dones.append(1 if done else 0)
            state = next_state

            num_steps += 1

    if not states:
        raise RuntimeError("No trajectory rolled out")

    states_t = torch.stack(states)
    actions_t = torch.stack(actions)
    rewards_t = torch.tensor(rewards)
    log_probs_t = torch.stack(log_probs)
    dones_t = torch.tensor(dones)

    assert states_t.shape == (num_steps, policy.state_dim)
    assert rewards_t.shape == (num_steps,)
    assert log_probs_t.shape == (num_steps,)
    assert dones_t.shape == (num_steps,)
    if isinstance(policy, PPOPolicyContinuous):
        assert actions_t.shape == (num_steps, policy.action_dim)
    elif isinstance(policy, PPOPolicyDiscrete):
        assert actions_t.shape == (num_steps,)
    else:
        raise TypeError(f"Unsupported policy type {type(policy)}")

    return Trajectory(states_t, actions_t, rewards_t, log_probs_t, dones_t)


def train(
    env,
    policy: PPOPolicy,
    num_updates_per_epoch: int,
    num_episodes_per_epoch: int,
    num_epochs: int,
    max_trajectory_len: int,
    discount: float = 0.95,
    clip_eps: float = 0.2,
) -> None:

    # TODO: how to implement learning rate decay?

    value_fn = _PPOValue(policy.state_dim)
    agent = _PPOAgent(policy, value_fn)
    optimizer = torch.optim.Adam(agent.parameters(), lr=3e-4)

    experiment = mlflow.set_experiment("PPO training")
    with mlflow.start_run(experiment_id=experiment.experiment_id):
        mlflow.log_param("torch_version", torch.__version__)
        mlflow.log_param("cuda_available", torch.cuda.is_available())
        mlflow.log_param("num_updates_per_epoch", num_updates_per_epoch)
        mlflow.log_param("num_episodes_per_epoch", num_episodes_per_epoch)
        mlflow.log_param("num_epochs", num_epochs)
        mlflow.log_param("max_trajectory_len", max_trajectory_len)
        mlflow.log_param("discount", discount)
        mlflow.log_param("clip_eps", clip_eps)
        if torch.cuda.is_available():
            mlflow.log_param("cuda_device_count", torch.cuda.device_count())
            try:
                mlflow.log_param("cuda_device_name", torch.cuda.get_device_name(0))
            except Exception:
                pass

        total_params = sum(p.numel() for p in agent.parameters())
        trainable_params = sum(p.numel() for p in agent.parameters() if p.requires_grad)
        mlflow.log_param("total_parameters", total_params)
        mlflow.log_param("trainable_parameters", trainable_params)
        mlflow.log_param(
            "trainable_percentage", 100.0 * trainable_params / total_params
        )

        for epoch in trange(num_epochs, desc="train", leave=False):

            trajectory = None
            for _ in trange(
                num_episodes_per_epoch, desc="Collecting trajectory", leave=False
            ):
                traj_step = rollout_episode(env, policy, max_trajectory_len)
                trajectory = Trajectory.concat(trajectory, traj_step)

                if len(trajectory) > max_trajectory_len:
                    break

            if trajectory is None:
                logging.warning("Trajectory not collected")
                return

            mlflow.log_metric("traj_len", trajectory.states.shape[0], step=epoch)
            losses = train_one_trajectory(
                agent, optimizer, trajectory, num_updates_per_epoch, discount, clip_eps
            )

            losses_t = torch.stack(losses)
            mlflow.log_metric("loss", float(losses_t.mean()), step=epoch)

        # evaluation
        # TODO: extract to a separate function - but it will require extracting mlflow initialization
        # to log artifacts to the same run
        env_eval: RecordVideo = RecordVideo(
            env,
            video_folder="./videos/",
            episode_trigger=lambda x: (x % 10 == 0),  # record every 10th episode
            name_prefix="ppo-lunarlander",
        )
        with torch.no_grad():
            for eval_epoch in trange(num_epochs, desc="eval", leave=False):
                trajectory = rollout_episode(env_eval, policy, max_trajectory_len)
                loss = ppo_loss(agent, trajectory, discount, clip_eps)
                mlflow.log_metric("eval loss", float(loss), step=eval_epoch)



def main():
    env = gym.make("LunarLander-v3", render_mode="rgb_array")
    action_dim = int(env.action_space.n)
    state_dim = env.reset()[0].shape[0]

    policy = PPOPolicyDiscrete(state_dim=state_dim, action_dim=action_dim)

    train(
        env,
        policy,
        num_updates_per_epoch=20,
        num_episodes_per_epoch=10,
        num_epochs=100,
        max_trajectory_len=1000,
    )


if __name__ == "__main__":
    main()

It was written and tested with Python 3.14 and uses the following dependencies:

gymnasium[box2d,other]>=1.2.3
mlflow>=3.3.1
pydantic>=2.11.7
swig>=4.4.1
torch>=2.9.0
tqdm>=4.67.1

Conclusion

I’d love to hear you comments and thoughts on the topic.

The point here was to show that implementing a Deep Learning algorithm from a whitepaper is not a straightforward challenge and to encourage you to do it.

What Language Leaves Behind

2025-12-12T12:38:36+00:00

Language compresses thought. What gets lost?

Every thought you’ve ever communicated has been compressed.

Written and spoken language is a representation of human thought—one of many, alongside images, music, architecture, gesture, clothing. But unlike a painting or a melody, language carries an illusion of precision. We treat words as if they perfectly capture what we mean.

Piotr’s Substack is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

They don’t.

Language is a codec: it encodes the high-dimensional, tangled, contextual thing happening in your mind into a low-dimensional stream of symbols. And like every codec, it’s lossy. Information is discarded. Structure is flattened. Nuance bleeds away.

Worse, the decompression happens on the other end—in someone else’s mind, using their priors, their context, their understanding of what your words were supposed to mean.

This isn’t a flaw in how we use language. It’s a flaw in what language is.

Watching the Compression Happen

Summarization makes this visible.

Take the Wikipedia article on addition. If I asked you to summarize it in two sentences, you’d have to choose: which parts of your understanding deserve to survive the compression?

Here’s my instinctive attempt:

Summary 1: Addition is an arithmetic operation that combines two or more numbers to produce their total or sum, typically denoted with the + sign.

And here’s what I wrote when I deliberately reached for different information:

Summary 2: Addition is an operation defined for all kinds of numbers, is associative and commutative, and 0 is its identity element.

Same source. Same length constraint. Completely different outputs.

The difference isn’t in the facts—both are accurate. The difference is in which thoughts I chose to encode. My first summary emerged from resonance: the overlap between the webpage and my existing mental model. It felt like the “main point” because it matched what I already believed about addition.

My second summary required effort. I had to override my instincts and select information that wouldn’t naturally surface.

This is the compression in action. The thought “what addition means to me” is vast and associative. The two-sentence summary is a narrow channel. Something has to be thrown away.

The Illusion of Consensus

What’s remarkable is how reliably people throw away the same things.

I ran the same prompt through several LLMs:

Prompt: Summarize addition defined on this website https://en.wikipedia.org/wiki/Addition in 2 sentences

Claude 4.5: Addition is one of the four basic operations of arithmetic that combines two or more numbers (called addends or summands) to produce their total or sum, typically denoted with the plus sign (+).

GPT 5.0: Addition is a mathematical operation that combines two numbers, called addends, to produce a sum. It is one of the basic arithmetic operations and is widely used in various fields of mathematics, science, and everyday life.

Gemini 3.0: Addition, typically symbolized by the plus sign (+), is one of the four core arithmetic operations, where the sum of two whole numbers represents the total when those values are combined. It is a fundamental concept in mathematics that is both commutative and associative.

All three converge on essentially the same framing as my instinctive summary. They share a prior—trained into them by human preferences—about what “important” means, what “summary” means, what deserves to survive the compression.

This feels like agreement. It’s actually shared bias.

The Information That Doesn’t Exist

Here are two more summaries of that same Wikipedia page:

Summary 3: The webpage describes a mathematical operation that children as young as 5 months, and animals of some species, are capable of performing.

Summary 4: The webpage mentions the word “three” 16 times and the word “addition” 181 times.

Both are accurate. Both are valid compressions. And both probably feel wrong to you—like they’ve violated some unspoken rule about what a summary should contain.

That feeling of wrongness? That’s your prior asserting itself. Your mental codec has a built-in definition of “main fact,” shaped by culture, education, and a thousand past encounters with the word “summary.”

The LLMs share that prior because they were trained on human outputs. The result is a systematic pattern in what gets discarded: not random noise, but structured absence. Statisticians call this “missing not at random.” The gaps in a summary aren’t accidents. They’re artifacts of the compression algorithm—which is to say, artifacts of how we collectively decided to encode thought into language.

The Receiver’s Problem

But compression is only half the story.

When you read my summary of addition, you’re not recovering my original thought. You’re reconstructing something in your own mind, using your own priors. The same sentence means different things to a mathematician, a kindergarten teacher, and someone who’s never heard the word “commutative.”

Language doesn’t transmit meaning. It transmits symbols that prompt the receiver to construct meaning locally. The fidelity of that reconstruction depends on how well the sender’s and receiver’s codecs align.

Usually, we don’t notice the drift. Shared culture, shared context, shared assumptions—these create enough overlap that communication feels seamless. But the overlap is never complete. Every sentence you speak arrives slightly different than it left.

What This Means

Language is not a window into thought. It’s a compression artifact.

When we summarize, explain, argue, or describe, we’re not transferring our mental states intact. We’re lossy-encoding them into a symbolic stream, hoping the receiver’s decompression produces something close enough to be useful.

Sometimes it does. Sometimes what you meant and what I understood are close enough that we never notice the gap.

But the gap is always there.

Every act of communication is an act of interpretation—twice over. First by the speaker, who must choose what survives the encoding. Then by the listener, who must reconstruct meaning from the symbols that remain.

The question is never “did I say it clearly?” The question is: whose priors are doing the work?

Piotr’s Substack is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Puer Aeternus (forever-child) and Procrastination

2025-12-10T12:42:46+00:00

The Hidden Mental Traps Behind Procrastination

Procrastination isn’t just about laziness—it’s a complex web of mental traps that keep us stuck. Understanding these patterns can help us break free from them.

The Six Faces of Procrastination

When we procrastinate, we typically fall into one of these mental traps:

Piotr’s Substack is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

The “Not Enough” Trap: We convince ourselves that taking action on this particular thing isn’t sufficient. For example, we might think, “ I can ’t improve LLM summarization because I’d first need to research the right metric.” We create arbitrary prerequisites that prevent us from starting.

The “Why Bother” Trap: Past failures loom large and convince us that starting again is pointless. “ I ’ve already tried writing a book so many times and failed—I’ll probably fail this time too, so why bother?” This trap transforms previous attempts into evidence against future success.

The Efficiency Trap : We postpone action under the guise of strategic timing. “ Why bother finding a partner now when I don ’t have a job? If I wait until I’m financially stable, it will be much easier.” We tell ourselves we’re being smart when we’re actually avoiding discomfort.

The Guilt Spiral : After finally making progress, instead of celebrating, we punish ourselves: “ I should have done that years ago. ” This guilt actually strengthens the procrastination pattern by making achievement feel bad rather than good.

The Devaluation Trap : We minimize our accomplishments to avoid recognizing progress: “ This was so easy —it wasn’t real progress at all.” By dismissing what we’ve done, we rob ourselves of the momentum that success creates.

The Comparison Trap : We measure our progress against others and come up short: “ In the same time, I did so little while this other person did so much. Why even bother? ” This trap ensures that any progress we make feels inadequate.

The Puer Aeternus Connection

The concept of Puer Aeternus—Latin for “ eternal child ”—describes a psychological pattern of avoiding commitment and maintaining unlimited potential. Three mental traps define this pattern:

Failure to Constellate : The inability to commit to a cohesive identity or expertise—being a jack of all trades but master of none. There’s no clear narrative, no ability to say “ I am an expert in this particular thing. ”

Fear of Wasting Time : An overwhelming anxiety about investing energy in something that might not be “ good enough ” or worthy of one’s talents.

Focus on the Loss : Obsessive attention to what might be lost or sacrificed by committing to a particular path, rather than what might be gained.

How These Patterns Feed Each Other

Procrastination becomes the perfect outlet for the Puer Aeternus mindset. Someone caught in this pattern will avoid engaging with tasks because they fear wasting time on something below their standards. They focus relentlessly on potential losses—the opportunities foreclosed, the paths not taken—rather than the value of actually doing the work.

This creates a vicious cycle. Procrastination inevitably leads to failed or never-started projects. These failures reinforce the belief that committing to any particular career area isn’t worthwhile, causing the person to drift between different fields. They remain forever potential, never actual—perpetually failing to constellate into something concrete.

The tragedy is that the very behavior meant to preserve unlimited potential actually destroys it. By avoiding commitment to protect ourselves from wasting time or making the wrong choice, we ensure that no choice ever bears fruit.

References

“The Problem of the Puer Aeternus” Marie-Louise von Franz
Dr. Alok Kanojia “Why You Still Haven’t Grown Up” vlog (Healthy GamerGG)
Dr. Alok Kanojia “The Reason You Can Never Progress” vlog (HealthyGamerGG)

Piotr’s Substack is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Ubuntu 22.04 setup on Razer Blade 15 2022

2023-07-10T00:00:00+00:00

Closing the lid doesn’t suspend the laptop

create a new file: /etc/systemd/system/acpi-wake-andy.service

[Unit]
Description=ACPI Wake Service
 
[Service]
Type=oneshot
ExecStart=/bin/sh -c "echo RP05 | sudo tee /proc/acpi/wakeup"
 
[Install]
WantedBy=multi-user.target

Enable that service

sudo systemctl start acpi-wake-andy.service
sudo systemctl enable acpi-wake-andy.service
sudo systemctl status acpi-wake-andy.service # check status

Create a new file /etc/modprobe.d/nvidia-s2idle.conf

options nvidia NVreg_EnableS0ixPowerManagement=1
NVreg_S0ixPowerManagementVideoMemoryThreshold=10000

Check status cat /sys/power/mem_sleep

If the contents are: [s2idle] deep, then you’re done - reboot and test it that the reboot worked. If the contents are s2idle [deep], repeat step 5.

Perform this step only if the above command output is s2idle [deep]

Edit file /etc/default/grub, adding mem_sleep_default=s2idle to GRUB_CMDLINE_LINUX_DEFAULT

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash mem_sleep_default=s2idle"

run sudo update-grub
reboot, you’re done test changes

Useful libraries

2023-07-02T00:00:00+00:00

ML

Name and link	Description
PyTorch Lighting	Simplifies building scalable research workflows and production pipelines in PyTorch
OmegaConf	Hierarchical configuration system, supports merging configurations
Weights & Biases	Online visualization, better than Tensorboard

Emotion of a memory

2020-08-14T00:00:00+00:00

Just the other day I was watching “Star Trek: Deep Space Nine”, when one of the characters seemed familiar to me - Dr Julian Bashir.

As it turned out, the actor who played Dr Bashir - Alexander Siddig - starred in “The Spy” as Ahmed Suidani. This dark and menacing character stands in stark contrast to the character of the always eager, friendly and unassuming Dr Bashir.

A couple episodes later, I noticed that my attitude towards Dr Bashir changed, and I no longer perceived his friend as friendly. And then I asked myself - would that have changed had I first watched Deep Space Nine?

How and where are emotions associated with memories?