<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://piotrtrochim.net/feed.xml" rel="self" type="application/atom+xml" /><link href="https://piotrtrochim.net/" rel="alternate" type="text/html" /><updated>2026-05-06T11:22:56+00:00</updated><id>https://piotrtrochim.net/feed.xml</id><title type="html">Piotr Trochim</title><subtitle>Thoughts and results I wish to persist.
</subtitle><author><name>Piotr Trochim</name></author><entry><title type="html">Coding is not Software Engineering</title><link href="https://piotrtrochim.net/2026/03/26/coding-is-not-software-engineering/" rel="alternate" type="text/html" title="Coding is not Software Engineering" /><published>2026-03-26T10:20:56+00:00</published><updated>2026-03-26T10:20:56+00:00</updated><id>https://piotrtrochim.net/2026/03/26/coding-is-not-software-engineering</id><content type="html" xml:base="https://piotrtrochim.net/2026/03/26/coding-is-not-software-engineering/"><![CDATA[<p>This topic has been on my mind for many years now.</p>

<p>The software industry has been inventing novel “Engineering” titles, but it really seems to be peaking now, with “Research Engineers”, “ML Engineer”, “Data Engineers” etc. On the other hand, we start seeing the proliferation of “Scientist” titles, surrounded with similar adjectives.</p>

<p>But when you meet and work with those folks - at FAANG or in the startups - what strikes you is their utter unfamiliarity with software engineering. At best they can be considered Hackers, and not the kind that breaks into system, but rather the kind that slaps two pieces of code together hoping it will do what they want.</p>

<p>The rise of generative AI and its applications to coding makes things even more confusing. I interview candidates who don’t even know how to code - folks I refer to as Wannabes.</p>

<h2 id="recruitment">Recruitment</h2>

<p>It becomes one when it affects the social contract. I’m actively recruiting to my teams, and I’m spending a significant chunk of my time interviewing. I look for Engineers, but the recruiters keep either sending me Hackers or Wannabes.</p>

<p>The reason? All of them have extensive resumes plastered with Engineering titles.</p>

<p>The situation is so ridiculous, that one can’t really trust resumes any more and has to spend a lot of time figuring out people’s expertise.</p>

<h2 id="incentives">Incentives</h2>

<p>The other element that adds to confusion are the adjectives appended to the title “Engineer”, and their reflection of current industry trends.</p>

<p>These days, no one wants a “Software Engineering” role. But replace the word “Software” with “ML”, and applicants will swarm to it.</p>

<p>If you ask them about the difference, they will start describing their ambitions to work with AI and their fears for being rendered obsolete on the job market.</p>

<p>But not a single applicant focuses on the “Engineering” aspect of that job, and what it entails in both cases.</p>

<h2 id="coding-is-not-software-engineering">Coding is not Software Engineering</h2>

<p>So what is the difference, really ?</p>

<p>Engineering is the process of building machines that don’t break. An engineer knows their creation through and through - understands the mechanisms that underlie it thoroughly, knows their operating specifications and safety margins. They have the skills and tools required to modify and repair those mechanism without sacrificing the complete understanding of the creation.</p>

<p>The term “complexity management” is often used to describe what engineers spend a significant portion of time doing. When a need to extend a mechanism arises, the engineer spends a lot of time rethiking the existing one in order to best accomodate the new mechanism and to comprehensively chart the new operating regime.</p>

<p>In software, these translate to refactoring and testing.</p>

<p>Coding on the other hand is just a tool an engineer has under their belt - and it’s one that is used at the lowest level of abstraction.</p>

<p>Engineers operate at 3 layers of abstraction:</p>

<ul>
  <li>
    <p>the lowest one is writing code. This amounts to writing very small bricks the larger mechanism will be composed of.</p>
  </li>
  <li>
    <p>above it sits the mechanism design - the skill of composing virtual machines that operate in a loop, and guaranteeing the stability of that loop - that the loop will not stop, break or go out of whack. In order to construct that, the engineer measures how bricks “sit together” and how the larger construct works and what might make it break. Testing is the main tool used at this level.</p>
  </li>
  <li>
    <p>the highest level of abstaction is thinking about creation - planning how smaller mechanisms (aka. components) will come together to fulfil user requirements. The main tool here is refactoring that supports “complexity management”.</p>
  </li>
</ul>

<p>Given this ontology, we can now define the other terminology:</p>

<ul>
  <li>
    <p>Coding - at skill employed at the lowest level of engineering abstraction</p>
  </li>
  <li>
    <p>Hackers - folks operating only at the lowest level of abstraction - slapping pieces of code together and hoping for the best, without employing any heuristic that guarantees directionality of their efforts</p>
  </li>
  <li>
    <p>Wannabes - folks who rely on external tools (such as Gen AI) to do the lowest rung work for them - they operate completely outside this framework.</p>
  </li>
</ul>

<h2 id="hussle-culture-and-the-rise-of-fake-it-till-you-make-it">Hussle culture and the rise of “fake it till you make it”</h2>

<p>Corporate values such as “Move fast and break things” and the culture that prioritizes shipping code over understanding what is being built gave rise to this very confusing landscape.</p>

<p>I feel surrounded by people who keep slapping on titles and cheating their way through careers.</p>

<p>Meanwhile, the years meticulously spent on honing the craft of Software Engineering seem to be impairing my ability to grow my career. I keep encountering a very strange response to my skills - folks are afraid I will “slow the company down”, because “I want to write tests”, all the while they keep scoring me using “Engineering Excellence metrics”.</p>

<p>Now that I’m an enterpreneur, I’m making sure not to let such confusing cultural artifacts take root in my companies - but I still pitty good engineers who suffer elsewhere, trying to make sense of these very confusing rules.</p>

<p>And I fear for the future in which Software Engineering will disappear in all, but its name. I fear it, because if the same trend took place in construction of aviation - I would start fearing for my life.</p>]]></content><author><name>Piotr Trochim</name></author><category term="engineering" /><summary type="html"><![CDATA[This topic has been on my mind for many years now.]]></summary></entry><entry><title type="html">PPO vs GRPO</title><link href="https://piotrtrochim.net/2026/02/04/ppo-vs-grpo/" rel="alternate" type="text/html" title="PPO vs GRPO" /><published>2026-02-04T12:43:14+00:00</published><updated>2026-02-04T12:43:14+00:00</updated><id>https://piotrtrochim.net/2026/02/04/ppo-vs-grpo</id><content type="html" xml:base="https://piotrtrochim.net/2026/02/04/ppo-vs-grpo/"><![CDATA[<blockquote>
  <p>A Deep Dive into Two Policy Gradient Implementations</p>
</blockquote>

<p>This post compares two policy optimization algorithms implemented in our RL codebase: Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO). We’ll examine their architectural differences, loss functions, and training loops from an engineering perspective.</p>

<h2 id="shared-abstractions-the-policy-interface">Shared Abstractions: The Policy Interface</h2>

<p>Both PPO and GRPO optimize a stochastic policy—a neural network that outputs a probability distribution over actions given a state. Despite their algorithmic differences, both algorithms need the same core operations from a policy:</p>

<ol>
  <li>
    <p><strong>Sample actions</strong> during environment rollout</p>
  </li>
  <li>
    <p><strong>Evaluate action probabilities</strong> during training (for importance sampling)</p>
  </li>
</ol>

<p>This shared requirement allows us to define a single <code class="language-plaintext highlighter-rouge">StochasticPolicy</code> interface that both algorithms consume:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">StochasticPolicy</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">get_action</span><span class="p">(</span><span class="n">states</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="p">(</span><span class="n">action</span><span class="p">,</span> <span class="n">log_prob</span><span class="p">)</span>
    <span class="k">def</span> <span class="nf">evaluate_actions</span><span class="p">(</span><span class="n">states</span><span class="p">,</span> <span class="n">actions</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="p">(</span><span class="n">log_probs</span><span class="p">,</span> <span class="n">entropy</span><span class="p">)</span>
</code></pre></div></div>

<p>The benefit is that policy architectures become algorithm-agnostic. Whether using a simple MLP, a continuous Gaussian policy, or a LoRA-finetuned LLM backbone, the same network can be trained with either PPO or GRPO without modification.</p>

<h2 id="the-core-architectural-difference-value-functions">The Core Architectural Difference: Value Functions</h2>

<p>The fundamental distinction between PPO and GRPO lies in how they estimate the baseline for variance reduction.</p>

<ul>
  <li>
    <p><strong>PPO</strong> trains a value network alongside the policy. This network learns to predict expected returns from each state, providing a per-timestep baseline for advantage computation.</p>
  </li>
  <li>
    <p><strong>GRPO</strong> eliminates the value network entirely. Instead, it collects multiple trajectories per update and normalizes rewards <em>*across the group*</em>. The baseline is simply the mean reward of the batch—no learned parameters required.</p>
  </li>
</ul>

<p>This trade-off is significant - PPO has more model complexity but needs fewer environment samples per update. GRPO has a simpler architecture but requires collecting <code class="language-plaintext highlighter-rouge">group_dim</code> trajectories before each optimization step.</p>

<h2 id="loss-function-comparison">Loss Function Comparison</h2>

<p>Both algorithms use the clipped surrogate objective from PPO, but differ in advantage computation and regularization.</p>

<h3 id="the-clipped-objective-shared">The Clipped Objective (Shared)</h3>

<p>Both algorithms compute the probability ratio between current and old policies, then clip it to prevent destructive updates:</p>

<p>Where</p>

<p>is the importance sampling ratio.</p>

<h3 id="advantage-estimaqtion">Advantage estimaqtion</h3>

<p>PPO uses Generalized Advantage Estimation (GAE)</p>

<p>where</p>

<p>Meanwhile, GRPO uses group-relative normalization (that gives rise to its name):</p>

<p>where</p>

<p>are computed are computed across multiple collected trajectories for the same episode rollout.</p>

<h3 id="value-loss-and-regularization">Value loss and regularization</h3>

<p>GRPO looses PPO’s value loss term, because it’s not using the value function.</p>

<p>Regularization also differs - the baseline PPO algorithm used an Entropy bonus, and some later extensions migrated to using KL divergence to minimize the changes to policy distributions during a training step. And KL divergence is what GRPO uses.</p>

<p>GRPO paper also introduced a more stable way of approximating KL divergence.</p>

<h3 id="final-loss">Final loss</h3>

<p>For PPO, it’s:</p>

<p>while for GRPO it’s:</p>

<h3 id="key-insight-temporal-vs-batch-normalization">Key Insight: Temporal vs. Batch Normalization</h3>

<p>The advantage computation is where these algorithms fundamentally diverge:</p>

<ul>
  <li>
    <p><strong>PPO ’s GAE</strong> performs <em>temporal</em> credit assignment—it propagates information backward through time using the learned value function to determine which actions led to good outcomes.</p>
  </li>
  <li>
    <p><strong>GRPO ’s group normalization</strong> performs <em>batch</em> comparison—it asks “how did this trajectory’s reward compare to other trajectories from the same policy?” This sidesteps temporal credit assignment entirely.</p>
  </li>
</ul>

<h2 id="training-loop-comparison">Training Loop Comparison</h2>

<p>Because GRPO trains on multiple trajectories for the same initial condition (or, in LLM terms - multiple completions to the same input query), several aspects of the training loop change:</p>

<p>PPO’s tensor shapes for State, Actions, Rewards etc. take form <code class="language-plaintext highlighter-rouge">(T, …)</code>, where T - trajectory length. GRPO introduces an additional dimension G - group dimension - changing the shape to <code class="language-plaintext highlighter-rouge">(T, G, …)</code>.</p>

<p>Because of this additional Group dimension, which can be thought of as a Batch dimension, trajectories for GRPO need to be <strong>padded</strong> and <strong>masked</strong>.</p>

<p>Lastly, because GRPO doesn’t use Value function, only the Policy module is trained.</p>

<h3 id="the-masking-requirement-in-grpo">The Masking Requirement in GRPO</h3>

<p>Since GRPO stacks multiple trajectories of different lengths, it must:</p>

<ol>
  <li>
    <p>Pad shorter trajectories to match the longest</p>
  </li>
  <li>
    <p>Create a mask tensor indicating valid (non-padded) positions</p>
  </li>
  <li>
    <p>Apply the mask during:</p>

    <ol>
      <li>
        <p>Advantage normalization (compute mean/std only over valid rewards)</p>
      </li>
      <li>
        <p>Loss computation (exclude padded timesteps from gradients)</p>
      </li>
    </ol>
  </li>
</ol>

<p>This adds implementation complexity but is unavoidable when batching variable-length sequences.</p>

<h3 id="pseudocode-comparison">Pseudocode Comparison</h3>

<p>PPO Training Loop:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="n">epoch</span> <span class="ow">in</span> <span class="n">epochs</span><span class="p">:</span>
    <span class="n">trajectory</span> <span class="o">=</span> <span class="n">collect_episodes</span><span class="p">(</span><span class="n">policy</span><span class="p">,</span> <span class="n">env</span><span class="p">)</span>          <span class="c1"># Shape: (T, ...)
</span>
    <span class="n">advantages</span> <span class="o">=</span> <span class="n">compute_gae</span><span class="p">(</span><span class="n">trajectory</span><span class="p">,</span> <span class="n">value_fn</span><span class="p">)</span>      <span class="c1"># Temporal credit assignment
</span>
    <span class="n">loss</span> <span class="o">=</span> <span class="n">ppo_loss</span><span class="p">(</span><span class="n">policy</span><span class="p">,</span> <span class="n">value_fn</span><span class="p">,</span> <span class="n">trajectory</span><span class="p">,</span> <span class="n">advantages</span><span class="p">)</span>

    <span class="n">optimizer</span><span class="p">.</span><span class="n">step</span><span class="p">(</span><span class="n">loss</span><span class="p">)</span>                                <span class="c1"># Updates policy AND value_fn
</span></code></pre></div></div>

<p>GRPO Training Loop:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="n">epoch</span> <span class="ow">in</span> <span class="n">epochs</span><span class="p">:</span>
    <span class="n">trajectories</span> <span class="o">=</span> <span class="p">[</span><span class="n">collect_episodes</span><span class="p">(</span><span class="n">policy</span><span class="p">,</span> <span class="n">env</span><span class="p">)</span>
                    <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">group_dim</span><span class="p">)]</span>          <span class="c1"># List of G trajectories
</span>
    <span class="n">batched</span> <span class="o">=</span> <span class="n">stack_and_pad</span><span class="p">(</span><span class="n">trajectories</span><span class="p">)</span>               <span class="c1"># Shape: (T, G, ...)
</span>    <span class="n">mask</span> <span class="o">=</span> <span class="n">compute_mask</span><span class="p">(</span><span class="n">batched</span><span class="p">)</span>                        <span class="c1"># Valid position indicator
</span>    <span class="n">advantages</span> <span class="o">=</span> <span class="n">group_normalize</span><span class="p">(</span><span class="n">batched</span><span class="p">.</span><span class="n">rewards</span><span class="p">,</span> <span class="n">mask</span><span class="p">)</span> <span class="c1"># Batch comparison
</span>
    <span class="n">loss</span> <span class="o">=</span> <span class="n">grpo_loss</span><span class="p">(</span><span class="n">policy</span><span class="p">,</span> <span class="n">batched</span><span class="p">,</span> <span class="n">advantages</span><span class="p">,</span> <span class="n">mask</span><span class="p">)</span>

    <span class="n">optimizer</span><span class="p">.</span><span class="n">step</span><span class="p">(</span><span class="n">loss</span><span class="p">)</span>                                <span class="c1"># Updates policy only
</span></code></pre></div></div>

<h2 id="when-to-use-each-algorithm">When to Use Each Algorithm</h2>

<p>The obvious difference is that PPO is a general purpose policy optimization algorithm that can handle both LLM tuning as well as other RL use cases.</p>

<p>In addition, it takes the candle when dealing with limited environment samples - the value function increases sample efficiency and enables learning from fewer trajectories.</p>

<p>GRPO on the other hand has been designed specifically for LLM tuning. I took a liberty of unleashing it on a regular RL environment, however it’s not suitable to that - it’s much less sample efficient and assumes a strong policy baseline.</p>

<h2 id="conclusion">Conclusion</h2>

<p>PPO and GRPO share the same clipped surrogate objective but differ fundamentally in how they estimate advantages:</p>

<ul>
  <li>
    <p><strong>PPO</strong> learns a value function for temporal credit assignment, adding model complexity but reducing sample requirements</p>
  </li>
  <li>
    <p><strong>GRPO</strong> uses group-relative normalization, trading sample efficiency for architectural simplicity</p>
  </li>
</ul>

<p>The shared policy interface in our implementation means you can benchmark both algorithms with identical network architectures—only the training loop and loss computation differ.</p>

<p>Thank you for reading !</p>]]></content><author><name>Piotr Trochim</name></author><category term="ml" /><category term="rl" /><summary type="html"><![CDATA[A Deep Dive into Two Policy Gradient Implementations]]></summary></entry><entry><title type="html">Teaching Claude to Create Pixel Art</title><link href="https://piotrtrochim.net/2026/01/25/teaching-claude-to-create-pixel-art/" rel="alternate" type="text/html" title="Teaching Claude to Create Pixel Art" /><published>2026-01-25T23:27:28+00:00</published><updated>2026-01-25T23:27:28+00:00</updated><id>https://piotrtrochim.net/2026/01/25/teaching-claude-to-create-pixel-art</id><content type="html" xml:base="https://piotrtrochim.net/2026/01/25/teaching-claude-to-create-pixel-art/"><![CDATA[<blockquote>
  <p>Are Agentic Frameworks just slow, imperfect Finetuning frameworks ?</p>
</blockquote>

<h2 id="the-challenge">The Challenge</h2>

<p>Like many developers working with AI assistants, I’ve noticed that Claude is quite good at understanding concepts, analyzing patterns, and explaining techniques—but there’s a significant gap between knowledge and capability.</p>

<p>With the recent conversations about the ever improving LLM capabilities, and the agentic frameworks approaching singularity, I thought I’d put Claude to the test.</p>

<h3 id="generating-a-pixel-art-character-using-claude-opus-45">Generating a pixel art character using Claude Opus 4.5</h3>

<p>The challenge I selected was something that lies slightly outside Claude models area of expertise. I wanted to see how the model itself, and The agentic framework built around it - Claude code - can handle such a task and potentially self improve in order to execute it.</p>

<p>I started with this image <a href="https://kagi.com/proxy/pixel-art-haunted-house-spooky-night-with-full-moon-bats-graveyard-perfect-halloween-designs-games-illustrations_612609-2851.jpg?c=q4aTKI-vwxsHxiY44K7HIgHTp64x0XTwqKl3Cxm8W3LDmC7NOcmY0r8tCG1ekRkl94xBUm0FRiqU4oE6OkopSEbe3rxflH2pdmBpLkukAQhk52OMltSr5qqsp-KuolhmtFYlaZXBB_9lcsroaxtpWgabN_1yb6w-5pL7UTyNmkkThNe5cYHC8yB3cGTcftHuiHd7-ppIGqURzsaNewxkDgfeVqUyM1MMwWmGqkKSWfUYJQqtx1A2NgIJbPKnRDxK">downloaded from the internet</a> and this prompt:</p>

<p><a href="https://substackcdn.com/image/fetch/$s_!pRrA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4731ebf-2516-43b6-8811-511ee085abb1_626x626.jpeg"><img src="https://substackcdn.com/image/fetch/$s_!pRrA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4731ebf-2516-43b6-8811-511ee085abb1_626x626.jpeg" alt="" /></a></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Here's a reference image. Can you generate a 64x64 pixel art image of a character, a detective in a coat with the extracted style and palette.
</code></pre></div></div>

<p>The result I got was:</p>

<p><a href="https://substackcdn.com/image/fetch/$s_!9TS8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3064cd2-ff18-4418-9d2a-86a17c15a100_1016x1016.png"><img src="https://substackcdn.com/image/fetch/$s_!9TS8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3064cd2-ff18-4418-9d2a-86a17c15a100_1016x1016.png" alt="" /></a></p>

<p>This is not bad, however it’s far from some of the pixel art characters one can see in video games. Here’s an example:</p>

<p><a href="https://substackcdn.com/image/fetch/$s_!k9_C!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8e5a89a-ee72-4f90-bd51-58fa74e668d2_665x240.png"><img src="https://substackcdn.com/image/fetch/$s_!k9_C!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8e5a89a-ee72-4f90-bd51-58fa74e668d2_665x240.png" alt="" /></a></p>

<h3 id="claude-code--skills">Claude Code + Skills</h3>

<p>I started with the following prompt</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>You are not great at creating pixel art graphics. 

I want you to create a skill that would allow you to: analyze reference pictures, understand the style/composition/color palette/how to represent game characters/buildings/scenery/mood, and render pixel art images based on prompts.=
</code></pre></div></div>

<p>Leveraging the Claude Skills, the model generated three skills that, given a reference image, would analyze the style &amp; color palette and rendering a pixel art image.</p>

<p>Claude successfully analyzed the style (”Dark atmospheric pixel art with warm lighting accents”), extracted a 64-color palette organized by function (sky gradients, house structure, warm lighting), and identified key techniques like dithering and color banding - so far so good.</p>

<p>But then came the rendering test. I prompted Claude to create:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>A character, a detective in a coat with the extracted style and palette, 32x32
</code></pre></div></div>

<p>… and another with a slightly higer resolution. The result was… disappointing.</p>

<p><a href="https://substackcdn.com/image/fetch/$s_!WuOD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb92ead74-0351-45b8-a49b-cfee8388fc32_1740x878.png"><img src="https://substackcdn.com/image/fetch/$s_!WuOD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb92ead74-0351-45b8-a49b-cfee8388fc32_1740x878.png" alt="" /></a></p>

<p>Strange shapes, flat colors, and a character that looked nothing like a detective. The results remained poor. The approach was fundamentally flawed.</p>

<h3 id="the-retrospective">The Retrospective</h3>

<p>I prompted Claude to analyze its own failure:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>The results are disappointing - flat colors, strange shapes, bad design (not to mention very strange look of the character). Analyze your approach and explain why the skill you generated doesn’t give you ability to generate good pixel art
</code></pre></div></div>

<p>Some of Claude’s self-analysis was remarkably insightful. The response it generated explained that it had extensive knowledge about pixel art theory, but knowing these principles doesn’t translate to being able to apply them.</p>

<p>…Some however were outright hallucinations. It claimed the core problem was no visual feedback. As I mentioned at the beginning, it’s a multi-modal model that takes images as input. One might potentially consider this to be true - after all, an .svg document is an XML - however it is also a widely recognized vector graphics image, and I would assume that if the authors of the model went to lenghts of making it the output format of choice, they would go to pains of training the model with this format as input.</p>

<h3 id="pivoting-to-blender-3d">Pivoting to Blender 3D</h3>

<p>I decided to give the approach one more go - this time by pivoting to <a href="https://www.blender.org/">Blender 3D</a>. There were a few reasons:</p>

<ul>
  <li>
    <p>there are MCP servers for Blender 3D on GitHub</p>
  </li>
  <li>
    <p>there is <strong>ample</strong> documentation on Blender 3D API</p>
  </li>
  <li>
    <p>… Claude generated that suggestion itself !</p>
  </li>
</ul>

<p>That last part is particularily important - but I don’t want to focus on it now, so let’s put a pin in it.</p>

<p>Claude generated the new skills and started rendering my character. This is the result:</p>

<p><a href="https://substackcdn.com/image/fetch/$s_!VgEr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa691d3b0-e54e-4ea9-b2b9-e32431351162_1004x1200.png"><img src="https://substackcdn.com/image/fetch/$s_!VgEr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa691d3b0-e54e-4ea9-b2b9-e32431351162_1004x1200.png" alt="" /></a></p>

<h2 id="what-was-this-experiment-really-about">What was this experiment really about</h2>

<p><a href="https://substackcdn.com/image/fetch/$s_!jSJe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58fdf988-ebff-465e-b39b-46885a11a37a_1892x892.png"><img src="https://substackcdn.com/image/fetch/$s_!jSJe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58fdf988-ebff-465e-b39b-46885a11a37a_1892x892.png" alt="" /></a></p>

<p>This was an example of In-Context Learning - the key technique Agentic Frameworks like Claude Code use to perform tasks and improve over time.</p>

<p>The agent was tasked with exploring an unknown domain, learning how to solve tasks in that domain, and ultimately demonstrating that newly acquired ability.</p>

<p>To achieve that it had at its disposal:</p>

<ul>
  <li>
    <p>one of the most powerful LLMs (Anthropic Claude Opus 4.5) capable of reasoning , planning and tool execution</p>
  </li>
  <li>
    <p>access to all of the built-in tools Claude Code comes with</p>
  </li>
  <li>
    <p>unlimited thinking time</p>
  </li>
</ul>

<h3 id="finetuning-perspective">Finetuning perspective</h3>

<p>The agent went ahead and build up a training dataset (skills, human feedback, reference images). It then used that information as input to an autoregressive model, effectively changing the model’s original output token distribution given the same prompt.</p>

<p>If we denoted the prompt to generate the character as X, all of the prompts that led to the model building skills as Y, the generated image as Z and the ideal (expected) image as Z’, then we would expect that:</p>

<p>and thus that the model would learn how to better generate to our expectations.</p>

<p>But the results we obtained do not indicate any learning whatsoever. Quite the contrary - each step introduced a significant degradation wrt. the baseline.</p>

<h3 id="is-this-form-of-training-effective">Is this form of training effective?</h3>

<p>The key observation while working with Claude Code was that didn’t actually attempt to learn</p>

<ul>
  <li>
    <p>It generated the skills - as in hallucinated them. It didn’t search for any online resources.</p>
  </li>
  <li>
    <p>It didn’t attempt to test the skills it generated - it didn’t generate any images of its own volition, as a part of the process. I needed to do it myself, and then provide feedback</p>
  </li>
</ul>

<p>It therefore didn’t follow any learning process in the classical sense of word - create a train set, modify its beliefs, evaluate them on the test set, lather rince and repeat.</p>

<p>The actions it took also didn’t make it easy to understand what kind of feedback would be most useful to inject:</p>

<ul>
  <li>
    <p>how to “refactor” the skills it generated to make them better.</p>
  </li>
  <li>
    <p>what data to inject</p>
  </li>
</ul>

<p>In addition, the framework used context window compression which degraded the quality of the information it learned. The compression ran just before the Blender 3D skill refactor.</p>

<h3 id="skills-as-latent-knowledge-representation">Skills as latent knowledge representation</h3>

<p>The Skill definitions themselves are quite an interesting thing. It’s completely unclear how to write them or what to change in order to make them better.</p>

<p>Or more specifically - if a random word / sentence was removed or added to a Skill definition, to what degree would it affect the generation result.</p>

<h2 id="conclusion">Conclusion</h2>

<p>This experiment in my opinion shows three large drawbacks of agentic frameworks:</p>

<ul>
  <li>
    <p>they are learning systems, but the learning techniques that apply to them are not well documented</p>
  </li>
  <li>
    <p>they build on top of LLMs and it looks that they can only achieve results as good as the underlying LLM</p>
  </li>
  <li>
    <p>The systems thenselves do not readily help the developer build a working training framework to improve on a particular skill - this still seems to belong to the aesotheric “prompt engineering”, with little science explaining how to author prompts to achieve desired effects.</p>
  </li>
</ul>

<p>I’m convinced that, given enough time and effort one can coin sets of Skills that will aid an LLM to perform an arbitrary task - but is it worth it given the powerful finetuning techniques we have at our disposal ?</p>

<p>Thank you for reading !</p>]]></content><author><name>Piotr Trochim</name></author><category term="ml" /><category term="art" /><summary type="html"><![CDATA[Are Agentic Frameworks just slow, imperfect Finetuning frameworks ?]]></summary></entry><entry><title type="html">Where RL training algorithm ends and a Policy starts</title><link href="https://piotrtrochim.net/2026/01/20/where-rl-training-algorithm-ends/" rel="alternate" type="text/html" title="Where RL training algorithm ends and a Policy starts" /><published>2026-01-20T14:34:11+00:00</published><updated>2026-01-20T14:34:11+00:00</updated><id>https://piotrtrochim.net/2026/01/20/where-rl-training-algorithm-ends</id><content type="html" xml:base="https://piotrtrochim.net/2026/01/20/where-rl-training-algorithm-ends/"><![CDATA[<blockquote>
  <p>From MLP to LLM: Refactoring PPO Policy implementations and introducing modular policy design</p>
</blockquote>

<p><a href="https://substack.com/home/post/p-183151495">Previous post in the series</a></p>

<p>In the previous post we implemented the vanilla PPO policy optimization algorithm. In this one we’re going to play around with one crucial component of that vanilla implementation - the policy definition.</p>

<p>Specifically, we’re going to swap out the MLP based policy for one based around one of popular, open-source LLMs (we’ll use <a href="https://huggingface.co/Qwen/Qwen2-0.5B">Qwen2-0.5B</a> baseline model).</p>

<p>The main goal however is to gain a better understanding of the areas of responsibility of the components that make up a vanilla RL algorithm like that. I admit to often coming away from a whitepaper lecture with an impression that the descriptions of those algorithms conflates a lot of components. This post is an attempt to segregate them into exchangable parts.</p>

<p>This will be a pure software engineering post - we’ll do a fair bit of refactoring and design analysis.</p>

<h3 id="vanilla-ppo-architecture-refresher">Vanilla PPO architecture refresher</h3>

<p><a href="https://substackcdn.com/image/fetch/$s_!Jh08!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81d8fc93-9f65-4dcb-bbb6-1eb07a07c8a4_2578x1190.png"><img src="https://substackcdn.com/image/fetch/$s_!Jh08!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81d8fc93-9f65-4dcb-bbb6-1eb07a07c8a4_2578x1190.png" alt="" /></a></p>

<p>Figure 1. Vanilla PPO architecture in two contexts - when a policy is trained using PPO, and when a trained policy is used in an application.</p>

<p>After coming away from the lecture of the whitepaper and the code, I have to admit that the concept of PPO in my mind span three separate entities - training algorithm, the trained policy and the value function. All three were written to complement and aid the training process.</p>

<p>But if we take a step back and consider other contexts - such as when the policy would be used after training, we notice that we can jetison all of those entities but the policy itself.</p>

<p>The natural question then is how much of what PPOPolicy represents is related to PPO itself? Could we turn it into an entity that’s unrelated to PPO, one that could be trained with other approaches? We’ll complete this exploration in the next post, when we train it with GRPO, but here we’ll try nudging this concept from a different direction.</p>

<h3 id="llm-based-policy-implementation">LLM based policy implementation</h3>

<p><a href="https://substackcdn.com/image/fetch/$s_!IDEd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7d8a44c-10d2-45e1-ac9b-9328c846d312_1392x1652.png"><img src="https://substackcdn.com/image/fetch/$s_!IDEd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7d8a44c-10d2-45e1-ac9b-9328c846d312_1392x1652.png" alt="" /></a></p>

<p>Figure 2. Policy architecture that uses QWEN2-0.5B base model</p>

<p>Our current policy implementations are simple MLPs. Let’s say we wanted to replace it with a foundational language model - what would need to change.</p>

<p>There’s a few questions we need to answer:</p>

<p><strong>Q: Which model to pick - should it be one of large online LLMs, or something small?</strong></p>

<p>The key factors are - our ability to train such a model, speed of execution and cost. I chose QWEN2-0.5B, because I can host it locally and its fairly fast for this little project.</p>

<p><strong>Q: How to even train such a model?</strong></p>

<p>I consider <a href="https://huggingface.co/">HuggingFace</a> my main model repository. It saves me the trouble of adapting models that were written using different coding best practices.</p>

<p>That commodity hides an important aspect of a model though - access to its parameters. The other confusing aspect is the need to use a Tokenizer to generate input. Tokenizer takes strings as inputs - but we operate with states that are vectors of floating point values rather than strings.</p>

<p>The other issue is the training method - should we train the entire model, or append a few layers to it and train only those, freezing the weights of the underlying model (known as Parameter Efficient Fine Tuning - or PEFT for short)?</p>

<p>Generally speaking, training the entire model can lead to undesired effects, such as loss of generalization. But more importantly - even for a small model like QWEN2-0.5B, it would be very slow and very memory inefficient.</p>

<p>To solve both problems we will therefore use PEFT training method. And lucky for us, HuggingFace offers a neat PEFT implementation in its <a href="https://huggingface.co/docs/peft/en/index">peft</a> library. This library solves our first issue - access to parameters.</p>

<p><strong>Q: How to bypass the tokenizer?</strong></p>

<p><a href="https://substackcdn.com/image/fetch/$s_!NaFB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F096f36d0-0ff5-4e55-b115-8e5fb557363c_1942x1742.png"><img src="https://substackcdn.com/image/fetch/$s_!NaFB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F096f36d0-0ff5-4e55-b115-8e5fb557363c_1942x1742.png" alt="" /></a></p>

<p>Figure 3. part “a” shows the regular setup in which Tokenizer is used as a string encoder/decoder. part “b” shows the bespoke encoder and decoder we need to introduce to encode the state tensor and decode the action tensor.</p>

<p>Tokenizer acts as an input encoder and an output decoder. It is either jointly trained with the model, or a pretrained tokenizer is used to train an LLM. Either way, the LLM depends on how its tokenizer works.</p>

<p>We will therefore need to train our own encoder and decoder. In this way, we’ll train those new layers to represent the inputs and decode the outputs in a way that best align with the LLM.</p>

<p>The final construct looks like this:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">peft</span>
<span class="kn">import</span> <span class="nn">transformers</span> <span class="k">as</span> <span class="n">tr</span>
<span class="kn">import</span> <span class="nn">torch</span>
<span class="kn">import</span> <span class="nn">torch.nn</span> <span class="k">as</span> <span class="n">nn</span>


<span class="n">llm_id</span> <span class="o">=</span> <span class="s">"Qwen/Qwen2-0.5B"</span>
<span class="n">llm</span> <span class="o">=</span> <span class="n">tr</span><span class="p">.</span><span class="n">AutoModelForCausalLM</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="n">llm_id</span><span class="p">)</span>
<span class="n">peft_config</span> <span class="o">=</span> <span class="n">peft</span><span class="p">.</span><span class="n">LoraConfig</span><span class="p">(</span>
    <span class="n">r</span><span class="o">=</span><span class="mi">16</span><span class="p">,</span>  <span class="c1"># rank
</span>    <span class="n">lora_alpha</span><span class="o">=</span><span class="mi">32</span><span class="p">,</span>  <span class="c1"># scaling factor
</span>    <span class="n">target_modules</span><span class="o">=</span><span class="p">[</span>
        <span class="s">"q_proj"</span><span class="p">,</span>
        <span class="s">"k_proj"</span><span class="p">,</span>
        <span class="s">"v_proj"</span><span class="p">,</span>
        <span class="s">"o_proj"</span><span class="p">,</span>
        <span class="s">"gate_proj"</span><span class="p">,</span>
        <span class="s">"up_proj"</span><span class="p">,</span>
        <span class="s">"down_proj"</span><span class="p">,</span>
    <span class="p">],</span>
    <span class="n">lora_dropout</span><span class="o">=</span><span class="mf">0.05</span><span class="p">,</span>
<span class="p">)</span>
<span class="n">llm</span> <span class="o">=</span> <span class="n">peft</span><span class="p">.</span><span class="n">get_peft_model</span><span class="p">(</span><span class="n">llm</span><span class="p">,</span> <span class="n">peft_config</span><span class="p">)</span>

<span class="n">llm_hidden_dim</span> <span class="o">=</span> <span class="mi">896</span>
<span class="n">state_encoder</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Sequential</span><span class="p">(</span>
    <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">state_dim</span><span class="p">,</span> <span class="mi">256</span><span class="p">),</span> 
    <span class="n">nn</span><span class="p">.</span><span class="n">ReLU</span><span class="p">(),</span> 
    <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="mi">256</span><span class="p">,</span> <span class="n">llm_hidden_dim</span><span class="p">)</span>
<span class="p">)</span>
<span class="n">action_head</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Sequential</span><span class="p">(</span>
    <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">llm_hidden_dim</span><span class="p">,</span> <span class="mi">128</span><span class="p">),</span>
    <span class="n">nn</span><span class="p">.</span><span class="n">ReLU</span><span class="p">(),</span>
    <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="mi">128</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">action_hidden_dim</span><span class="p">),</span>
<span class="p">)</span>
</code></pre></div></div>

<p>At the end of this code we have 3 entities:</p>

<ul>
  <li>
    <p><code class="language-plaintext highlighter-rouge">state_encoder</code> which converts our state to input embeddings to be passed to the LLM</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">llm</code> which is a LORA wrapper around the base QWEN2-0.5B model</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">action_head</code> which converts the LLM output to action logits</p>
  </li>
</ul>

<p>The following is the code that executes that conversion:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">states_embeddings</span> <span class="o">=</span> <span class="n">state_encoder</span><span class="p">(</span><span class="n">states</span><span class="p">)</span>
<span class="n">states_embeddings</span> <span class="o">=</span> <span class="n">states_embeddings</span><span class="p">.</span><span class="n">unsqueeze</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="n">llm_outputs</span> <span class="o">=</span> <span class="n">llm</span><span class="p">(</span>
    <span class="n">inputs_embeds</span><span class="o">=</span><span class="n">states_embeddings</span><span class="p">,</span> 
    <span class="n">output_hidden_states</span><span class="o">=</span><span class="bp">True</span>
<span class="p">)</span>
<span class="n">last_hidden</span> <span class="o">=</span> <span class="n">llm_outputs</span><span class="p">.</span><span class="n">hidden_states</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">][:,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="p">:]</span>
<span class="n">action_logits</span> <span class="o">=</span> <span class="n">action_head</span><span class="p">(</span><span class="n">last_hidden</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="mlp-based-policy-refresher">MLP based policy refresher</h3>

<p>We could express the code above with this pseudocode:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">get_action_logits</span><span class="p">(</span><span class="n">state</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span>
</code></pre></div></div>

<p>Let’s see how this compares to our ML Policy implementation:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">PPODiscretePolicy</span><span class="p">(</span><span class="n">PPOPolicy</span><span class="p">):</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">state_dim</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span> <span class="n">action_dim</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="bp">None</span><span class="p">:</span>
        <span class="nb">super</span><span class="p">().</span><span class="n">__init__</span><span class="p">(</span>
             <span class="c1"># ...
</span>        <span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">policy</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Sequential</span><span class="p">(</span>
             <span class="c1"># ...
</span>        <span class="p">)</span>

    <span class="k">def</span> <span class="nf">_get_action_distribution</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">state</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">dist</span><span class="p">.</span><span class="n">Distribution</span><span class="p">:</span>
        <span class="n">action_logits</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">policy</span><span class="p">(</span><span class="n">state</span><span class="p">)</span>
        <span class="n">action_logits_positive</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">functional</span><span class="p">.</span><span class="n">softplus</span><span class="p">(</span><span class="n">action_logits</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">torch_utils</span><span class="p">.</span><span class="n">CategoricalUnsqueezed</span><span class="p">(</span><span class="n">action_logits_positive</span><span class="p">)</span>

   <span class="k">def</span> <span class="nf">get_action</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">states</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">tuple</span><span class="p">[</span><span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">,</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">]:</span>
     <span class="n">actions_dist</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_get_action_distribution</span><span class="p">(</span><span class="n">states</span><span class="p">)</span>
     <span class="c1"># ...
</span>
   <span class="k">def</span> <span class="nf">evaluate_actions</span><span class="p">(</span>
        <span class="bp">self</span><span class="p">,</span> <span class="n">states</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">,</span> <span class="n">actions</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span>
   <span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">tuple</span><span class="p">[</span><span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">,</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">]:</span>
     <span class="n">actions_dist</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_get_action_distribution</span><span class="p">(</span><span class="n">states</span><span class="p">)</span>
     <span class="c1"># ...
</span></code></pre></div></div>

<p>Notice the patterns in this code:</p>

<ul>
  <li>
    <p><code class="language-plaintext highlighter-rouge">__init__</code> is where we initialize our network</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">_get_action_distribution</code> calculates the logits and then wraps them in a distribution</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">get_action</code> and <code class="language-plaintext highlighter-rouge">evaluate_actions</code> use <code class="language-plaintext highlighter-rouge">_get_action_distribution</code></p>
  </li>
</ul>

<p>So for our purpose, we would need to replace the contents of <code class="language-plaintext highlighter-rouge">__init__ </code>and the contents of<code class="language-plaintext highlighter-rouge"> _get_action_distribution.</code></p>

<p>Let’s go one step further though - we have another version of this class that represents a continuous policy (<a href="https://piotrtrochim.substack.com/i/183151495/complete-code">link to full code from the previous post</a>) - how does it differ from this code?</p>

<ul>
  <li>
    <p><code class="language-plaintext highlighter-rouge">log_probs</code> tensor shape is different, so the continuous policy needs to reduce that shape</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">torch.distributions.Normal</code> is used instead of <code class="language-plaintext highlighter-rouge">torch.distributions.Categorical</code></p>
  </li>
</ul>

<h3 id="policy-classes-refactored">Policy classes refactored</h3>

<p><a href="https://substackcdn.com/image/fetch/$s_!ZnpB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc44d8c0d-5242-4deb-9c73-300630e69cae_2332x948.png"><img src="https://substackcdn.com/image/fetch/$s_!ZnpB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc44d8c0d-5242-4deb-9c73-300630e69cae_2332x948.png" alt="" /></a></p>

<p>Figure 4. Refactored Policy is now a final class, strategized with a PolicyNetwork implementation.</p>

<p>What changes is network creation &amp; distribution calculation What stays the same is - <code class="language-plaintext highlighter-rouge">get_action</code> and <code class="language-plaintext highlighter-rouge">evaluate_actions</code>.</p>

<p>This leads to refactoring shown in Figure 4 - where <code class="language-plaintext highlighter-rouge">Policy</code> class will be closed and finalized, owning the methods for getting an action and evaluating actions coming from other policies.</p>

<p>The responsibility for caclulating the distribution of actions given input states will fall to different implementations of <code class="language-plaintext highlighter-rouge">PolicyNetwork</code>.</p>

<p>Let’s consider for a moment what Policy class now represents though. It works with distributions of actions, and it samples those distributions to generate actions. It is a <code class="language-plaintext highlighter-rouge">StochasticPolicy</code>, as opposed to a deterministic policy that would employ a non-probabilistic mechanism for generating actions.</p>

<h3 id="summary">Summary</h3>

<p>As Figure 4 also shows, we have now, to a degree, isolated the concept of Policy from PPO.</p>

<p>Is that separation complete ? That depends on:</p>

<ul>
  <li>
    <p>whether PPO would work with deterministic policies</p>
  </li>
  <li>
    <p>whether out Stochastic policy could be trained using algorithms other than PPO (e.g. MPO, TRPO, GRPO etc.)</p>
  </li>
</ul>

<p>In the next posts I’ll try to answer some of those questions. Thank you for reading !</p>]]></content><author><name>Piotr Trochim</name></author><category term="ml" /><category term="rl" /><summary type="html"><![CDATA[From MLP to LLM: Refactoring PPO Policy implementations and introducing modular policy design]]></summary></entry><entry><title type="html">Implementing an RL algorithm (PPO) from a whitepaper</title><link href="https://piotrtrochim.net/2026/01/07/implementing-an-rl-algorithm-ppo/" rel="alternate" type="text/html" title="Implementing an RL algorithm (PPO) from a whitepaper" /><published>2026-01-07T14:54:45+00:00</published><updated>2026-01-07T14:54:45+00:00</updated><id>https://piotrtrochim.net/2026/01/07/implementing-an-rl-algorithm-ppo</id><content type="html" xml:base="https://piotrtrochim.net/2026/01/07/implementing-an-rl-algorithm-ppo/"><![CDATA[<blockquote>
  <p>How difficult is it really?</p>
</blockquote>

<p>Deep Learning whitepapers are not straightforward to implement. They require multiple readings, reading related literature, multiple iterations and lots of trial and error.</p>

<p>It makes the task of implementing and engineering research breakthroughs an arduous tasks, especially for less experienced engineers.</p>

<p>I would like to show how that might look like on a random paper I selected.</p>

<h2 id="proximal-policy-optimization-algorithms-ppo">Proximal Policy Optimization Algorithms (PPO)</h2>

<p>…is an on-policy policy gradient reinforcement learning algorithm <a href="https://arxiv.org/abs/1707.06347">arXiv:1707.06347</a></p>

<p>I chose this one for a few reasons - it stood the test of time, is widely used and generic and several baseline implementations exist (e.g. <a href="https://docs.cleanrl.dev/rl-algorithms/ppo/">ClearRL PPO</a>) - these qualities make it a very worthwhile piece of code to implement and understand.</p>

<p>But most important of all - despite the being a high quality paper, I found the experience of reading it very similar to many other Deep Learning whitepapers. So it’s not an outlier, but a good representative on the class of problems I want to focus on.</p>

<h2 id="first-read-of-the-paper">First read of the paper</h2>

<p>I want to take you, my dear reader, on a journey. And that means you need to get your feet dirty to fully appreciate the experience.</p>

<p>So please click on the link below and read the original whitepaper. Approach it with a sense of curiosity that you need knowing that in a moment you will need to implement it. Please note your observations somewhere.</p>

<p>You may or may not be familiar with Reinforcement Learning. No matter your experience, I will assume you have no clue about it this is your first experience with it. I will however assume that you are a software engineer - you have a working knowledge of a select programming language (we’ll be using Python here), basic data structures and algorithms, code complexity management, testing your code.</p>

<p>Go ahead - <a href="https://arxiv.org/abs/1707.06347">arXiv:1707.06347</a> - come back here when you’re done.</p>

<h2 id="engineering-approach-to-implementation">Engineering approach to implementation</h2>

<p>The first order questions I will attempt to answer are:</p>

<ul>
  <li>
    <p>what exactly am I implementing</p>
  </li>
  <li>
    <p>how should I validate my implementation</p>
  </li>
</ul>

<p>Note that we are translating a paper to an implementation, rather than designing our own algorithm/system - therefore we don’t need to worry about defining and fulfilling functional and non-functional requirements.</p>

<h3 id="evaluation">Evaluation</h3>

<p>In terms of evaluation we can find references to various RL environments the original paper was evaluated on in sections <strong>Experiments</strong> and <strong>Appendix B:</strong></p>

<ul>
  <li>
    <p>environments that have been moved to <a href="https://gymnasium.farama.org/environments/mujoco/">OpenAI Gymnasium</a> since the publication of this paper</p>
  </li>
  <li>
    <p><a href="https://github.com/openai/gym">Roboschool</a> has been deprecated, and the link to PyBullet mentioned in the deprecation note no longer works</p>
  </li>
</ul>

<p>OpenAI Gymnasium is well documented and has a <strong>standardized environment API</strong> which allows to interact with any environment using the same code.</p>

<p>Cumulative score from each environment is used as the <strong>metric</strong>. Notice that the scores are specific to each environment (different Y axis scales):</p>

<p><a href="https://substackcdn.com/image/fetch/$s_!lVy5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3364bf2d-0e60-4d69-8a39-f93eee3faead_2402x1182.png"><img src="https://substackcdn.com/image/fetch/$s_!lVy5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3364bf2d-0e60-4d69-8a39-f93eee3faead_2402x1182.png" alt="" /></a></p>

<p>Figure 1. Example PPO evaluation plots</p>

<p>The X axis on the plots refers to the evaluation scores obtained after x training steps.</p>

<p>Based on this information, we can develop the following framework:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">logging</span>
<span class="kn">import</span> <span class="nn">gymnasium</span> <span class="k">as</span> <span class="n">gym</span>

<span class="k">def</span> <span class="nf">evaluate</span><span class="p">(</span><span class="o">**</span><span class="n">evaluated_algo</span><span class="p">:</span> <span class="err">???</span><span class="o">**</span> <span class="p">,</span> <span class="n">training_step</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="bp">None</span><span class="p">:</span>
  <span class="n">eval_environment_ids</span> <span class="o">=</span> <span class="p">[</span><span class="s">"HalfCheetah-v1"</span><span class="p">,</span> <span class="s">"Hopper-v1"</span><span class="p">,</span> <span class="s">"Swimmer-v1"</span><span class="p">,</span> <span class="p">]</span> <span class="c1"># ... add others
</span>
  <span class="k">for</span> <span class="n">env_id</span> <span class="ow">in</span> <span class="n">eval_environment_ids</span><span class="p">:</span>
    <span class="n">env</span> <span class="o">=</span> <span class="n">gym</span><span class="p">.</span><span class="n">make</span><span class="p">(</span><span class="n">env_id</span><span class="p">)</span>

    <span class="c1"># Use the standardized OpenAI Gumnasium API to run the environments
</span>    <span class="n">observation</span> <span class="o">=</span> <span class="n">env</span><span class="p">.</span><span class="n">reset</span><span class="p">()</span>
    <span class="n">episode_finished</span> <span class="o">=</span> <span class="bp">False</span>
    <span class="n">score</span> <span class="o">=</span> <span class="mi">0</span>
    <span class="k">while</span> <span class="ow">not</span> <span class="n">episode_finished</span><span class="p">:</span>
       <span class="o">**</span><span class="n">action</span> <span class="o">=</span> <span class="n">evaluated_algo</span><span class="p">(</span><span class="n">observation</span><span class="p">)</span><span class="o">**</span>       <span class="n">observation</span><span class="p">,</span> <span class="n">reward</span><span class="p">,</span> <span class="n">episode_finished</span><span class="p">,</span> <span class="n">_</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">env</span><span class="p">.</span><span class="n">step</span><span class="p">(</span><span class="n">action</span><span class="p">)</span>
       <span class="n">score</span> <span class="o">+=</span> <span class="n">reward</span>

    <span class="n">logging</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="s">"Score for %s after training step %d is %f"</span><span class="p">,</span> <span class="n">env_id</span><span class="p">,</span> <span class="n">training_step</span><span class="p">,</span> <span class="n">score</span><span class="p">)</span>
</code></pre></div></div>

<p>I highlighted the two open questions:</p>

<ul>
  <li>
    <p>what exactly is the implemented algorithm is still unclear</p>
  </li>
  <li>
    <p>we can technically figure out that the implemented algo is supposed to convert observations returned by the environment into actions that will be fed back to the environment - this information is NOT mentioned in the paper however.</p>
  </li>
</ul>

<h3 id="arcane-knowledge">Arcane knowledge</h3>

<p>The second point specifically requires the reader to scale the ladder of references and build an understanding of what is the essence of Reinforcement Learning algorithms.</p>

<p>Today, with ample resources, RL is not a mystery it used to be in year 2017, when this paper was originally published. Back then however, this was arcane knowledge, with few materials and fewer available example implementations. What was worse - the knowledge that existed introduced such vast terminology that it was very difficult to piece a cohesive understanding of these algorithms easily.</p>

<p>This led (in my case, and perhaps in yours too) to <strong>loosing the forest for the trees</strong> - instead of being able to implement e.g. PPO, one had to first undestand “on-policy”, “off-policy”, “gradient policy”, “policy functions”, “value functions”, “Bellman equation”, “losses”, “rewards - sparse and dense” …</p>

<p>Just take a look for yourself:</p>

<p><a href="https://substackcdn.com/image/fetch/$s_!0jjF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F350f8f2b-e85f-4ee5-bd7d-49617bba8c86_685x368.png"><img src="https://substackcdn.com/image/fetch/$s_!0jjF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F350f8f2b-e85f-4ee5-bd7d-49617bba8c86_685x368.png" alt="" /></a></p>

<p>Figure 2. <a href="https://link.springer.com/chapter/10.1007/978-981-15-4095-0_3">Taxonomy of Reinforcement learning models</a>, Springer</p>

<p>And here’s a very good overview paper - <a href="https://arxiv.org/pdf/2209.14940">arXiv:2209.14940v</a></p>

<p>This approach is employed by 99% of the whitepapers I’ve read during my career, and makes each of them a <strong>delta</strong> that builds on top of other information, rather than a self contained thing.</p>

<h3 id="arcane-knowledge-of-reinforcement-learning-demistified">Arcane knowledge of Reinforcement Learning demistified</h3>

<p><a href="https://substackcdn.com/image/fetch/$s_!rnkZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fec0d51-bfdb-4960-b7a8-25131e9c0a26_1532x448.png"><img src="https://substackcdn.com/image/fetch/$s_!rnkZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fec0d51-bfdb-4960-b7a8-25131e9c0a26_1532x448.png" alt="" /></a></p>

<p>Figure 3. Reinforcement learning system, simplified view</p>

<p>Reinforcement Learning describes a system of 2 entities - an environment and a policy, that trade 3 key pieces of data between themselves: observations (sometimes referred to as state), actions and rewards.</p>

<p>Both can be thought of as functions:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">environment</span><span class="p">(</span><span class="n">action</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">tuple</span><span class="p">[</span><span class="n">observation</span><span class="p">,</span> <span class="n">reward</span><span class="p">]:</span>
   <span class="k">pass</span>

<span class="k">def</span> <span class="nf">policy</span><span class="p">(</span><span class="n">observation</span><span class="p">,</span> <span class="n">reward</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">action</span><span class="p">:</span>
   <span class="k">pass</span>
</code></pre></div></div>

<p>The “learning” part of Reinforcement Learning system pertains to that reward signal we are passign to the policy - we are stating that the policy learns and changes its definition based on that signal. Environment on the other hand never changes its definition and stays constant.</p>

<p>How the policy learns - that is the real reason behind such diversity of methods shown in Figure 2 - some versions learn from each action, some require massive amounts of accumulated data, and some use auxiliary entities.</p>

<p>Enter PPO.</p>

<h3 id="ppo-demistified">PPO demistified</h3>

<p><a href="https://substackcdn.com/image/fetch/$s_!kZ4C!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1795ae23-cf76-4d15-983b-8b98036a916b_3236x1242.png"><img src="https://substackcdn.com/image/fetch/$s_!kZ4C!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1795ae23-cf76-4d15-983b-8b98036a916b_3236x1242.png" alt="" /></a></p>

<p>Figure 4. PPO overlayed on top of the training system. System from Figure 3 colored purple.</p>

<p>I arrived at the diagram in Figure 4 by stitching the information scattered across chapters 2, 3, 5 and 6. That information has to be grafted onto a framework that the paper does not mention but assumes familiarity with.</p>

<h4 id="step-1-overall-ppo-training-algorithm">Step 1 overall PPO training algorithm</h4>

<p><a href="https://substackcdn.com/image/fetch/$s_!RXc_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ba05ba8-0a78-48b2-accc-3d4fbbe7c2e9_2856x1744.png"><img src="https://substackcdn.com/image/fetch/$s_!RXc_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ba05ba8-0a78-48b2-accc-3d4fbbe7c2e9_2856x1744.png" alt="" /></a></p>

<p>Figure 5. Relationship between the PPO training algorithm and the RL system design that uses PPO.</p>

<p>We can find it in section 5. In Figure 5 I colored the relevant components from Figure 4 to highlight the role they play in the algorithm</p>

<h4 id="step-2-computing-advantages">Step 2 Computing advantages</h4>

<p>You will find references to those concepts showing up in Sections 2-5, with the definition of generalized advantage estimation presented in equation 11 and 12:</p>

<p><a href="https://substackcdn.com/image/fetch/$s_!_M_F!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c2ea7d-91ee-43d8-94d6-67441adbeeed_2342x392.png"><img src="https://substackcdn.com/image/fetch/$s_!_M_F!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c2ea7d-91ee-43d8-94d6-67441adbeeed_2342x392.png" alt="" /></a></p>

<p>Notice the use of V(s_{t+1}) and V(s_{t}) - these are values obtained from the <strong>Value function,</strong> an auxiliary neural function that PPO algorithm introduces.</p>

<p>The code, implemented in pytorch, looks like this:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">compute_gae</span><span class="p">(</span>
    <span class="n">rewards</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">,</span>
    <span class="n">values</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">,</span>
    <span class="n">dones</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">,</span>
    <span class="n">discount</span><span class="p">:</span> <span class="nb">float</span><span class="p">,</span>
    <span class="n">gae_lambda</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mf">0.95</span><span class="p">,</span>
<span class="p">)</span> <span class="o">-&gt;</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">:</span>
    <span class="s">"""Compute Generalized Advantage Estimation."""</span>
    <span class="n">advantages</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">]</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="n">gae</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">(</span><span class="mf">0.0</span><span class="p">)</span>

    <span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="nb">reversed</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">rewards</span><span class="p">))):</span>
        <span class="k">if</span> <span class="n">t</span> <span class="o">==</span> <span class="nb">len</span><span class="p">(</span><span class="n">rewards</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">:</span>
            <span class="n">next_value</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">(</span>
                <span class="mf">0.0</span>
            <span class="p">)</span>  <span class="c1"># Bootstrap value (or get V(s_T) if not terminal)
</span>        <span class="k">else</span><span class="p">:</span>
            <span class="n">next_value</span> <span class="o">=</span> <span class="n">values</span><span class="p">[</span><span class="n">t</span> <span class="o">+</span> <span class="mi">1</span><span class="p">]</span>

        <span class="c1"># Mask next_value if episode ended
</span>        <span class="n">next_value</span> <span class="o">=</span> <span class="n">next_value</span> <span class="o">*</span> <span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">dones</span><span class="p">[</span><span class="n">t</span><span class="p">])</span>

        <span class="c1"># δ_t = r_t + γV(s_{t+1}) - V(s_t)
</span>        <span class="n">delta</span> <span class="o">=</span> <span class="n">rewards</span><span class="p">[</span><span class="n">t</span><span class="p">]</span> <span class="o">+</span> <span class="n">discount</span> <span class="o">*</span> <span class="n">next_value</span> <span class="o">-</span> <span class="n">values</span><span class="p">[</span><span class="n">t</span><span class="p">]</span>

        <span class="c1"># A_t = δ_t + (γλ) δ_{t+1} + ... + (γλ)_{T-t+1}δ_{T-1}
</span>        <span class="n">gae</span> <span class="o">=</span> <span class="n">delta</span> <span class="o">+</span> <span class="n">discount</span> <span class="o">*</span> <span class="n">gae_lambda</span> <span class="o">*</span> <span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">dones</span><span class="p">[</span><span class="n">t</span><span class="p">])</span> <span class="o">*</span> <span class="n">gae</span>
        <span class="n">advantages</span><span class="p">.</span><span class="n">insert</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">gae</span><span class="p">)</span>

    <span class="k">return</span> <span class="n">torch</span><span class="p">.</span><span class="n">FloatTensor</span><span class="p">(</span><span class="n">advantages</span><span class="p">)</span>
</code></pre></div></div>

<h4 id="step-3-optimize-surrogate-l-loss">Step 3 Optimize surrogate L (loss)</h4>

<p>Loss is defined in chapters 3 and 5</p>

<p><strong>Clipped policy loss</strong> (chapter 3):</p>

<p><a href="https://substackcdn.com/image/fetch/$s_!rHRx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12737b7c-6284-49f4-a9af-ced255b8cafb_2324x244.png"><img src="https://substackcdn.com/image/fetch/$s_!rHRx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12737b7c-6284-49f4-a9af-ced255b8cafb_2324x244.png" alt="" /></a></p>

<p>Full loss equation that combines the <strong>clipped policy loss</strong> with <strong>value function loss</strong> and <strong>entropy bonus</strong> (chapter 5):</p>

<p><a href="https://substackcdn.com/image/fetch/$s_!773x!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc546ba53-767c-4ddb-acf7-b7372f4578fe_2330x420.png"><img src="https://substackcdn.com/image/fetch/$s_!773x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc546ba53-767c-4ddb-acf7-b7372f4578fe_2330x420.png" alt="" /></a></p>

<p>PyTorch implementation of the loss looks like this:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Trajectory</span><span class="p">(</span><span class="n">typing</span><span class="p">.</span><span class="n">NamedTuple</span><span class="p">):</span>
    <span class="n">states</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span>
    <span class="n">actions</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span>
    <span class="n">rewards</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span>
    <span class="n">log_probs</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span>
    <span class="n">dones</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span>


<span class="k">def</span> <span class="nf">ppo_loss</span><span class="p">(</span>
    <span class="n">agent</span><span class="p">:</span> <span class="n">_PPOAgent</span><span class="p">,</span> <span class="n">trajectory</span><span class="p">:</span> <span class="n">Trajectory</span><span class="p">,</span> <span class="n">discount</span><span class="p">:</span> <span class="nb">float</span><span class="p">,</span> <span class="n">clip_eps</span><span class="p">:</span> <span class="nb">float</span>
<span class="p">)</span> <span class="o">-&gt;</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">:</span>
    <span class="c1"># LCLIP + L_VF + L_St(θ) = E_t[LCLIP_t(θ) − c_1 L_t^VF(θ) + c_2 S[πθ](st)]
</span>    <span class="n">log_probs</span><span class="p">,</span> <span class="n">entropies</span> <span class="o">=</span> <span class="n">agent</span><span class="p">.</span><span class="n">policy</span><span class="p">.</span><span class="n">evaluate_actions</span><span class="p">(</span>
        <span class="n">trajectory</span><span class="p">.</span><span class="n">states</span><span class="p">,</span> <span class="n">trajectory</span><span class="p">.</span><span class="n">actions</span>
    <span class="p">)</span>
    <span class="n">values</span> <span class="o">=</span> <span class="n">agent</span><span class="p">.</span><span class="n">value_fn</span><span class="p">(</span><span class="n">trajectory</span><span class="p">.</span><span class="n">states</span><span class="p">)</span>
    <span class="n">advantages</span> <span class="o">=</span> <span class="n">compute_gae</span><span class="p">(</span><span class="n">trajectory</span><span class="p">.</span><span class="n">rewards</span><span class="p">,</span> <span class="n">values</span><span class="p">,</span> <span class="n">trajectory</span><span class="p">.</span><span class="n">dones</span><span class="p">,</span> <span class="n">discount</span><span class="p">)</span>
    <span class="n">returns</span> <span class="o">=</span> <span class="n">advantages</span> <span class="o">+</span> <span class="n">values</span>
    <span class="c1"># Normalize advantages to stabilize training
</span>    <span class="n">advantages</span> <span class="o">=</span> <span class="p">(</span><span class="n">advantages</span> <span class="o">-</span> <span class="n">advantages</span><span class="p">.</span><span class="n">mean</span><span class="p">())</span> <span class="o">/</span> <span class="p">(</span><span class="n">advantages</span><span class="p">.</span><span class="n">std</span><span class="p">()</span> <span class="o">+</span> <span class="mf">1e-8</span><span class="p">)</span>

    <span class="k">assert</span> <span class="n">log_probs</span><span class="p">.</span><span class="n">shape</span> <span class="o">==</span> <span class="p">(</span><span class="n">trajectory</span><span class="p">.</span><span class="n">states</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],)</span>
    <span class="k">assert</span> <span class="n">trajectory</span><span class="p">.</span><span class="n">log_probs</span><span class="p">.</span><span class="n">shape</span> <span class="o">==</span> <span class="p">(</span><span class="n">trajectory</span><span class="p">.</span><span class="n">states</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],)</span>
    <span class="n">ratio</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="n">log_probs</span> <span class="o">-</span> <span class="n">trajectory</span><span class="p">.</span><span class="n">log_probs</span><span class="p">)</span>
    <span class="n">policy_loss</span> <span class="o">=</span> <span class="o">-</span><span class="n">torch</span><span class="p">.</span><span class="nb">min</span><span class="p">(</span>
        <span class="n">ratio</span> <span class="o">*</span> <span class="n">advantages</span><span class="p">,</span>
        <span class="n">torch</span><span class="p">.</span><span class="n">clip</span><span class="p">(</span><span class="n">ratio</span><span class="p">,</span> <span class="mf">1.0</span> <span class="o">-</span> <span class="n">clip_eps</span><span class="p">,</span> <span class="mf">1.0</span> <span class="o">+</span> <span class="n">clip_eps</span><span class="p">)</span> <span class="o">*</span> <span class="n">advantages</span><span class="p">,</span>
    <span class="p">).</span><span class="n">mean</span><span class="p">()</span>
    <span class="n">value_loss</span> <span class="o">=</span> <span class="mf">0.5</span> <span class="o">*</span> <span class="p">(</span><span class="n">returns</span> <span class="o">-</span> <span class="n">values</span><span class="p">).</span><span class="nb">pow</span><span class="p">(</span><span class="mi">2</span><span class="p">).</span><span class="n">mean</span><span class="p">()</span>
    <span class="n">c_2</span> <span class="o">=</span> <span class="mf">0.01</span>
    <span class="n">entropy_loss</span> <span class="o">=</span> <span class="n">c_2</span> <span class="o">*</span> <span class="n">entropies</span><span class="p">.</span><span class="n">mean</span><span class="p">()</span>
    <span class="n">loss</span> <span class="o">=</span> <span class="n">policy_loss</span> <span class="o">+</span> <span class="n">value_loss</span> <span class="o">-</span> <span class="n">entropy_loss</span>

    <span class="k">return</span> <span class="n">loss</span>
</code></pre></div></div>

<p>I am on purpose omitting the implementations of PPOPolicy and _PPOValue for the time being in order not to distract from the main objective of this section - the loss function.</p>

<p>Note that the loss function expects several values to be provided either as input or as the function of the policy, among them <strong>log probabilities</strong> and <strong>entropies</strong>.</p>

<p>Also notice that there are 2 sets of log probabilities - those calculated by the <code class="language-plaintext highlighter-rouge">policy.evaluate_actions</code>, and those provided as a part of the <code class="language-plaintext highlighter-rouge">trajectory</code> input.</p>

<p>Picking up on this implementation nuance requires jumping to the beginning of chapter 3 of the paper and looking at the definition of <strong>probability ratio</strong> that is used in <strong>clipped policy loss</strong>.</p>

<p><a href="https://substackcdn.com/image/fetch/$s_!w4gM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1785999-9317-44ed-b457-ce05307a529f_886x150.png"><img src="https://substackcdn.com/image/fetch/$s_!w4gM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1785999-9317-44ed-b457-ce05307a529f_886x150.png" alt="" /></a></p>

<p>pi_{theta} and pi_{theta}_old refer to <strong>probabilities</strong> (but not the log probabilities), of actions generated by a new (in this context - trained) policy, and the old (in this context - the policy used to collect the trajectory).</p>

<p>An algorithmic trick that inolves taking a logarithm of this equation allows us to use <strong>log probabilities</strong> , which packages such as pytorch readily provide from their distribution implementations.</p>

<h4 id="step-4-policy-and-value-function-network-architecture">Step 4 Policy and Value function network architecture</h4>

<p>This detail is described all the way in chapter 6</p>

<p><a href="https://substackcdn.com/image/fetch/$s_!qxdR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F53f7dc4e-822d-4b5f-8cd7-274ba034d175_2348x282.png"><img src="https://substackcdn.com/image/fetch/$s_!qxdR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F53f7dc4e-822d-4b5f-8cd7-274ba034d175_2348x282.png" alt="" /></a></p>

<h4 id="step-5-implementing-the-policy-function">Step 5 Implementing the Policy function</h4>

<p>At this point we can go ahead and implement the networks themselves.</p>

<p>To make the long story short - depending on the types of actions an environment works with - continuous or discrete - we need to interpret the logits returned by a model differently, and slightly change the architecture of the network</p>

<p>Policy base class</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">PPOPolicy</span><span class="p">(</span><span class="n">nn</span><span class="p">.</span><span class="n">Module</span><span class="p">,</span> <span class="n">abc</span><span class="p">.</span><span class="n">ABC</span><span class="p">):</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">state_dim</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span> <span class="n">action_dim</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="bp">None</span><span class="p">:</span>
        <span class="nb">super</span><span class="p">().</span><span class="n">__init__</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">state_dim</span> <span class="o">=</span> <span class="n">state_dim</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">action_dim</span> <span class="o">=</span> <span class="n">action_dim</span>

    <span class="o">@</span><span class="n">abc</span><span class="p">.</span><span class="n">abstractmethod</span>
    <span class="k">def</span> <span class="nf">get_action</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">state</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">tuple</span><span class="p">[</span><span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">,</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">]:</span>
        <span class="k">pass</span>

    <span class="o">@</span><span class="n">abc</span><span class="p">.</span><span class="n">abstractmethod</span>
    <span class="k">def</span> <span class="nf">evaluate_actions</span><span class="p">(</span>
        <span class="bp">self</span><span class="p">,</span> <span class="n">states</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">,</span> <span class="n">actions</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span>
    <span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">tuple</span><span class="p">[</span><span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">,</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">]:</span>
        <span class="k">pass</span>
</code></pre></div></div>

<p>Implementation for the continuous action spaces trains the policy model to return parameters (1-st moments) of a gaussian distribution, and then samples those to return action values. An action is a vector of floating points - they may for example represent the speeds of motors rotating a robot’s arm:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">PPOPolicyContinuous</span><span class="p">(</span><span class="n">PPOPolicy</span><span class="p">):</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">state_dim</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span> <span class="n">action_dim</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="bp">None</span><span class="p">:</span>
        <span class="nb">super</span><span class="p">().</span><span class="n">__init__</span><span class="p">(</span><span class="n">state_dim</span><span class="p">,</span> <span class="n">action_dim</span><span class="p">)</span>

        <span class="bp">self</span><span class="p">.</span><span class="n">policy</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Sequential</span><span class="p">(</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">state_dim</span><span class="p">,</span> <span class="mi">64</span><span class="p">),</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">ReLU</span><span class="p">(),</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="mi">64</span><span class="p">,</span> <span class="mi">64</span><span class="p">),</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">Tanh</span><span class="p">(),</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="mi">64</span><span class="p">,</span> <span class="mi">2</span> <span class="o">*</span> <span class="n">action_dim</span><span class="p">),</span>
        <span class="p">)</span>

    <span class="k">def</span> <span class="nf">_get_action_distribution</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">state</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">dist</span><span class="p">.</span><span class="n">Distribution</span><span class="p">:</span>
        <span class="n">action_gaussian_params</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">policy</span><span class="p">(</span><span class="n">state</span><span class="p">)</span>
        <span class="n">means</span><span class="p">,</span> <span class="n">log_stds</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">chunk</span><span class="p">(</span><span class="n">action_gaussian_params</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="n">dim</span><span class="o">=-</span><span class="mi">1</span><span class="p">)</span>
        <span class="n">stds</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="n">log_stds</span><span class="p">.</span><span class="n">clamp</span><span class="p">(</span><span class="o">-</span><span class="mi">20</span><span class="p">,</span> <span class="mi">2</span><span class="p">))</span>

        <span class="k">return</span> <span class="n">dist</span><span class="p">.</span><span class="n">Normal</span><span class="p">(</span><span class="n">loc</span><span class="o">=</span><span class="n">means</span><span class="p">,</span> <span class="n">scale</span><span class="o">=</span><span class="n">stds</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">get_action</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">states</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">tuple</span><span class="p">[</span><span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">,</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">]:</span>
        <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">states</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span> <span class="o">!=</span> <span class="mi">2</span> <span class="ow">or</span> <span class="n">states</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">!=</span> <span class="bp">self</span><span class="p">.</span><span class="n">state_dim</span><span class="p">:</span>
            <span class="k">raise</span> <span class="nb">ValueError</span><span class="p">(</span>
                <span class="sa">f</span><span class="s">"state should have shape (T, </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">state_dim</span><span class="si">}</span><span class="s">) != </span><span class="si">{</span><span class="n">states</span><span class="p">.</span><span class="n">shape</span><span class="si">}</span><span class="s">"</span>
            <span class="p">)</span>

        <span class="n">actions_dist</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_get_action_distribution</span><span class="p">(</span><span class="n">states</span><span class="p">)</span>
        <span class="n">action</span> <span class="o">=</span> <span class="n">actions_dist</span><span class="p">.</span><span class="n">sample</span><span class="p">()</span>

        <span class="c1"># aggregate across the last dimension - we want the probability
</span>        <span class="c1"># of the joint distribution of actions for a given timestep
</span>        <span class="n">log_prob</span> <span class="o">=</span> <span class="n">actions_dist</span><span class="p">.</span><span class="n">log_prob</span><span class="p">(</span><span class="n">action</span><span class="p">).</span><span class="nb">sum</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>

        <span class="k">assert</span> <span class="n">action</span><span class="p">.</span><span class="n">shape</span> <span class="o">==</span> <span class="p">(</span><span class="n">states</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="bp">self</span><span class="p">.</span><span class="n">action_dim</span><span class="p">)</span>
        <span class="k">assert</span> <span class="n">log_prob</span><span class="p">.</span><span class="n">shape</span> <span class="o">==</span> <span class="p">(</span><span class="n">states</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],)</span>

        <span class="k">return</span> <span class="n">action</span><span class="p">,</span> <span class="n">log_prob</span>

    <span class="k">def</span> <span class="nf">evaluate_actions</span><span class="p">(</span>
        <span class="bp">self</span><span class="p">,</span> <span class="n">states</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">,</span> <span class="n">actions</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span>
    <span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">tuple</span><span class="p">[</span><span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">,</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">]:</span>
        <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">states</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span> <span class="o">!=</span> <span class="mi">2</span> <span class="ow">or</span> <span class="n">states</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">!=</span> <span class="bp">self</span><span class="p">.</span><span class="n">state_dim</span><span class="p">:</span>
            <span class="k">raise</span> <span class="nb">ValueError</span><span class="p">(</span>
                <span class="sa">f</span><span class="s">"states should have shape (T, </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">state_dim</span><span class="si">}</span><span class="s">) != </span><span class="si">{</span><span class="n">states</span><span class="p">.</span><span class="n">shape</span><span class="si">}</span><span class="s">"</span>
            <span class="p">)</span>
        <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">actions</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span> <span class="o">!=</span> <span class="mi">2</span> <span class="ow">or</span> <span class="n">actions</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">!=</span> <span class="bp">self</span><span class="p">.</span><span class="n">action_dim</span><span class="p">:</span>
            <span class="k">raise</span> <span class="nb">ValueError</span><span class="p">(</span>
                <span class="sa">f</span><span class="s">"actions should have shape (T, </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">action_dim</span><span class="si">}</span><span class="s">) != </span><span class="si">{</span><span class="n">actions</span><span class="p">.</span><span class="n">shape</span><span class="si">}</span><span class="s">"</span>
            <span class="p">)</span>

        <span class="n">actions_dist</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_get_action_distribution</span><span class="p">(</span><span class="n">states</span><span class="p">)</span>

        <span class="n">log_probs</span> <span class="o">=</span> <span class="n">actions_dist</span><span class="p">.</span><span class="n">log_prob</span><span class="p">(</span><span class="n">actions</span><span class="p">)</span>
        <span class="n">entropy</span> <span class="o">=</span> <span class="n">actions_dist</span><span class="p">.</span><span class="n">entropy</span><span class="p">()</span>

        <span class="c1"># aggregate across the last dimension - we want the probability
</span>        <span class="c1"># and the entropy of the joint distribution of actions for a given timestep
</span>        <span class="n">log_probs</span> <span class="o">=</span> <span class="n">log_probs</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>
        <span class="n">entropy</span> <span class="o">=</span> <span class="n">entropy</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>

        <span class="k">assert</span> <span class="n">log_probs</span><span class="p">.</span><span class="n">shape</span> <span class="o">==</span> <span class="p">(</span><span class="n">states</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],)</span>
        <span class="k">assert</span> <span class="n">entropy</span><span class="p">.</span><span class="n">shape</span> <span class="o">==</span> <span class="p">(</span><span class="n">states</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],)</span>

        <span class="k">return</span> <span class="n">log_probs</span><span class="p">,</span> <span class="n">entropy</span>
</code></pre></div></div>

<p>Policy for discrete action spaces on the other hand assumes that the policy returns a single integer number that represents a discrete action to be taken by an environment. An example of such action is the index of an arrow key on a keyboard when we play one of atari games.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">PPOPolicyDiscrete</span><span class="p">(</span><span class="n">PPOPolicy</span><span class="p">):</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">state_dim</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span> <span class="n">action_dim</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="bp">None</span><span class="p">:</span>
        <span class="nb">super</span><span class="p">().</span><span class="n">__init__</span><span class="p">(</span><span class="n">state_dim</span><span class="p">,</span> <span class="n">action_dim</span><span class="p">)</span>

        <span class="bp">self</span><span class="p">.</span><span class="n">policy</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Sequential</span><span class="p">(</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">state_dim</span><span class="p">,</span> <span class="mi">64</span><span class="p">),</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">ReLU</span><span class="p">(),</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="mi">64</span><span class="p">,</span> <span class="mi">64</span><span class="p">),</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">Tanh</span><span class="p">(),</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="mi">64</span><span class="p">,</span> <span class="n">action_dim</span><span class="p">),</span>
        <span class="p">)</span>

    <span class="k">def</span> <span class="nf">_get_action_distribution</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">state</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">dist</span><span class="p">.</span><span class="n">Distribution</span><span class="p">:</span>
        <span class="n">logits</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">policy</span><span class="p">(</span><span class="n">state</span><span class="p">)</span>
        <span class="n">logits</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">functional</span><span class="p">.</span><span class="n">softplus</span><span class="p">(</span><span class="n">logits</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">dist</span><span class="p">.</span><span class="n">Categorical</span><span class="p">(</span><span class="n">logits</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">get_action</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">states</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">tuple</span><span class="p">[</span><span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">,</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">]:</span>
        <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">states</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span> <span class="o">!=</span> <span class="mi">2</span> <span class="ow">or</span> <span class="n">states</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">!=</span> <span class="bp">self</span><span class="p">.</span><span class="n">state_dim</span><span class="p">:</span>
            <span class="k">raise</span> <span class="nb">ValueError</span><span class="p">(</span>
                <span class="sa">f</span><span class="s">"State should have shape (T, </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">state_dim</span><span class="si">}</span><span class="s">) != </span><span class="si">{</span><span class="n">states</span><span class="p">.</span><span class="n">shape</span><span class="si">}</span><span class="s">"</span>
            <span class="p">)</span>

        <span class="n">actions_dist</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_get_action_distribution</span><span class="p">(</span><span class="n">states</span><span class="p">)</span>
        <span class="n">action</span> <span class="o">=</span> <span class="n">actions_dist</span><span class="p">.</span><span class="n">sample</span><span class="p">()</span>
        <span class="n">log_prob</span> <span class="o">=</span> <span class="n">actions_dist</span><span class="p">.</span><span class="n">log_prob</span><span class="p">(</span><span class="n">action</span><span class="p">)</span>

        <span class="k">assert</span> <span class="n">action</span><span class="p">.</span><span class="n">shape</span> <span class="o">==</span> <span class="p">(</span><span class="n">states</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],)</span>
        <span class="k">assert</span> <span class="n">log_prob</span><span class="p">.</span><span class="n">shape</span> <span class="o">==</span> <span class="p">(</span><span class="n">states</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],)</span>

        <span class="k">return</span> <span class="n">action</span><span class="p">,</span> <span class="n">log_prob</span>

    <span class="k">def</span> <span class="nf">evaluate_actions</span><span class="p">(</span>
        <span class="bp">self</span><span class="p">,</span> <span class="n">states</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">,</span> <span class="n">actions</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span>
    <span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">tuple</span><span class="p">[</span><span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">,</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">]:</span>
        <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">states</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span> <span class="o">!=</span> <span class="mi">2</span> <span class="ow">or</span> <span class="n">states</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">!=</span> <span class="bp">self</span><span class="p">.</span><span class="n">state_dim</span><span class="p">:</span>
            <span class="k">raise</span> <span class="nb">ValueError</span><span class="p">(</span>
                <span class="sa">f</span><span class="s">"states should have shape (T, </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">state_dim</span><span class="si">}</span><span class="s">) != </span><span class="si">{</span><span class="n">states</span><span class="p">.</span><span class="n">shape</span><span class="si">}</span><span class="s">"</span>
            <span class="p">)</span>
        <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">actions</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span> <span class="o">!=</span> <span class="mi">1</span><span class="p">:</span>
            <span class="k">raise</span> <span class="nb">ValueError</span><span class="p">(</span><span class="sa">f</span><span class="s">"actions should have shape (T,) != </span><span class="si">{</span><span class="n">actions</span><span class="p">.</span><span class="n">shape</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>

        <span class="n">actions_dist</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_get_action_distribution</span><span class="p">(</span><span class="n">states</span><span class="p">)</span>

        <span class="n">log_probs</span> <span class="o">=</span> <span class="n">actions_dist</span><span class="p">.</span><span class="n">log_prob</span><span class="p">(</span><span class="n">actions</span><span class="p">)</span>
        <span class="n">entropy</span> <span class="o">=</span> <span class="n">actions_dist</span><span class="p">.</span><span class="n">entropy</span><span class="p">()</span>

        <span class="k">assert</span> <span class="n">log_probs</span><span class="p">.</span><span class="n">shape</span> <span class="o">==</span> <span class="p">(</span><span class="n">states</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],)</span>
        <span class="k">assert</span> <span class="n">entropy</span><span class="p">.</span><span class="n">shape</span> <span class="o">==</span> <span class="p">(</span><span class="n">states</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],)</span>

        <span class="k">return</span> <span class="n">log_probs</span><span class="p">,</span> <span class="n">entropy</span>
</code></pre></div></div>

<h4 id="step-6-implementing-the-value-function">Step 6 Implementing the Value function</h4>

<p>The value function uses the same network architecture, with the difference of returning a single floating point value - the value the function would assign to the state (observation) of an environment.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">_PPOValue</span><span class="p">(</span><span class="n">nn</span><span class="p">.</span><span class="n">Module</span><span class="p">):</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">state_dim</span><span class="p">:</span> <span class="nb">int</span><span class="p">):</span>
        <span class="nb">super</span><span class="p">().</span><span class="n">__init__</span><span class="p">()</span>

        <span class="bp">self</span><span class="p">.</span><span class="n">value</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Sequential</span><span class="p">(</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">state_dim</span><span class="p">,</span> <span class="mi">64</span><span class="p">),</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">ReLU</span><span class="p">(),</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="mi">64</span><span class="p">,</span> <span class="mi">64</span><span class="p">),</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">Tanh</span><span class="p">(),</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="mi">64</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span>
        <span class="p">)</span>

    <span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">states</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">:</span>
        <span class="n">values</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">value</span><span class="p">(</span><span class="n">states</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">values</span><span class="p">.</span><span class="n">squeeze</span><span class="p">(</span>
            <span class="o">-</span><span class="mi">1</span>
        <span class="p">)</span>  <span class="c1"># squeeze returns a tensor that's shaped just like the rewards tensor
</span></code></pre></div></div>

<h3 id="complete-code">Complete code</h3>

<p>Here’s the complete solution:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">abc</span>
<span class="kn">import</span> <span class="nn">logging</span>
<span class="kn">import</span> <span class="nn">typing</span>

<span class="kn">import</span> <span class="nn">gymnasium</span> <span class="k">as</span> <span class="n">gym</span>
<span class="kn">from</span> <span class="nn">gymnasium.wrappers</span> <span class="kn">import</span> <span class="n">RecordVideo</span>
<span class="kn">import</span> <span class="nn">mlflow</span>
<span class="kn">import</span> <span class="nn">torch</span>
<span class="kn">import</span> <span class="nn">torch.distributions</span> <span class="k">as</span> <span class="n">dist</span>
<span class="kn">import</span> <span class="nn">torch.nn</span> <span class="k">as</span> <span class="n">nn</span>
<span class="kn">from</span> <span class="nn">tqdm</span> <span class="kn">import</span> <span class="n">trange</span>  <span class="c1"># type: ignore
</span>

<span class="k">def</span> <span class="nf">compute_gae</span><span class="p">(</span>
    <span class="n">rewards</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">,</span>
    <span class="n">values</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">,</span>
    <span class="n">dones</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">,</span>
    <span class="n">discount</span><span class="p">:</span> <span class="nb">float</span><span class="p">,</span>
    <span class="n">gae_lambda</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mf">0.95</span><span class="p">,</span>
<span class="p">)</span> <span class="o">-&gt;</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">:</span>
    <span class="s">"""Compute Generalized Advantage Estimation."""</span>
    <span class="n">advantages</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">]</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="n">gae</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">(</span><span class="mf">0.0</span><span class="p">)</span>

    <span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="nb">reversed</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">rewards</span><span class="p">))):</span>
        <span class="k">if</span> <span class="n">t</span> <span class="o">==</span> <span class="nb">len</span><span class="p">(</span><span class="n">rewards</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">:</span>
            <span class="n">next_value</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">(</span>
                <span class="mf">0.0</span>
            <span class="p">)</span>  <span class="c1"># Bootstrap value (or get V(s_T) if not terminal)
</span>        <span class="k">else</span><span class="p">:</span>
            <span class="n">next_value</span> <span class="o">=</span> <span class="n">values</span><span class="p">[</span><span class="n">t</span> <span class="o">+</span> <span class="mi">1</span><span class="p">]</span>

        <span class="c1"># Mask next_value if episode ended
</span>        <span class="n">next_value</span> <span class="o">=</span> <span class="n">next_value</span> <span class="o">*</span> <span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">dones</span><span class="p">[</span><span class="n">t</span><span class="p">])</span>

        <span class="c1"># δ_t = r_t + γV(s_{t+1}) - V(s_t)
</span>        <span class="n">delta</span> <span class="o">=</span> <span class="n">rewards</span><span class="p">[</span><span class="n">t</span><span class="p">]</span> <span class="o">+</span> <span class="n">discount</span> <span class="o">*</span> <span class="n">next_value</span> <span class="o">-</span> <span class="n">values</span><span class="p">[</span><span class="n">t</span><span class="p">]</span>

        <span class="c1"># A_t = δ_t + (γλ) δ_{t+1} + ... + (γλ)_{T-t+1}δ_{T-1}
</span>        <span class="n">gae</span> <span class="o">=</span> <span class="n">delta</span> <span class="o">+</span> <span class="n">discount</span> <span class="o">*</span> <span class="n">gae_lambda</span> <span class="o">*</span> <span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">dones</span><span class="p">[</span><span class="n">t</span><span class="p">])</span> <span class="o">*</span> <span class="n">gae</span>
        <span class="n">advantages</span><span class="p">.</span><span class="n">insert</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">gae</span><span class="p">)</span>

    <span class="k">return</span> <span class="n">torch</span><span class="p">.</span><span class="n">FloatTensor</span><span class="p">(</span><span class="n">advantages</span><span class="p">)</span>


<span class="k">class</span> <span class="nc">PPOPolicy</span><span class="p">(</span><span class="n">nn</span><span class="p">.</span><span class="n">Module</span><span class="p">,</span> <span class="n">abc</span><span class="p">.</span><span class="n">ABC</span><span class="p">):</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">state_dim</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span> <span class="n">action_dim</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="bp">None</span><span class="p">:</span>
        <span class="nb">super</span><span class="p">().</span><span class="n">__init__</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">state_dim</span> <span class="o">=</span> <span class="n">state_dim</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">action_dim</span> <span class="o">=</span> <span class="n">action_dim</span>

    <span class="o">@</span><span class="n">abc</span><span class="p">.</span><span class="n">abstractmethod</span>
    <span class="k">def</span> <span class="nf">get_action</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">state</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">tuple</span><span class="p">[</span><span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">,</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">]:</span>
        <span class="k">pass</span>

    <span class="o">@</span><span class="n">abc</span><span class="p">.</span><span class="n">abstractmethod</span>
    <span class="k">def</span> <span class="nf">evaluate_actions</span><span class="p">(</span>
        <span class="bp">self</span><span class="p">,</span> <span class="n">states</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">,</span> <span class="n">actions</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span>
    <span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">tuple</span><span class="p">[</span><span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">,</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">]:</span>
        <span class="k">pass</span>


<span class="k">class</span> <span class="nc">PPOPolicyContinuous</span><span class="p">(</span><span class="n">PPOPolicy</span><span class="p">):</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">state_dim</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span> <span class="n">action_dim</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="bp">None</span><span class="p">:</span>
        <span class="nb">super</span><span class="p">().</span><span class="n">__init__</span><span class="p">(</span><span class="n">state_dim</span><span class="p">,</span> <span class="n">action_dim</span><span class="p">)</span>

        <span class="bp">self</span><span class="p">.</span><span class="n">policy</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Sequential</span><span class="p">(</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">state_dim</span><span class="p">,</span> <span class="mi">64</span><span class="p">),</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">ReLU</span><span class="p">(),</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="mi">64</span><span class="p">,</span> <span class="mi">64</span><span class="p">),</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">Tanh</span><span class="p">(),</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="mi">64</span><span class="p">,</span> <span class="mi">2</span> <span class="o">*</span> <span class="n">action_dim</span><span class="p">),</span>
        <span class="p">)</span>

    <span class="k">def</span> <span class="nf">_get_action_distribution</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">state</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">dist</span><span class="p">.</span><span class="n">Distribution</span><span class="p">:</span>
        <span class="n">action_gaussian_params</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">policy</span><span class="p">(</span><span class="n">state</span><span class="p">)</span>
        <span class="n">means</span><span class="p">,</span> <span class="n">log_stds</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">chunk</span><span class="p">(</span><span class="n">action_gaussian_params</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="n">dim</span><span class="o">=-</span><span class="mi">1</span><span class="p">)</span>
        <span class="n">stds</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="n">log_stds</span><span class="p">.</span><span class="n">clamp</span><span class="p">(</span><span class="o">-</span><span class="mi">20</span><span class="p">,</span> <span class="mi">2</span><span class="p">))</span>

        <span class="k">return</span> <span class="n">dist</span><span class="p">.</span><span class="n">Normal</span><span class="p">(</span><span class="n">loc</span><span class="o">=</span><span class="n">means</span><span class="p">,</span> <span class="n">scale</span><span class="o">=</span><span class="n">stds</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">get_action</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">states</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">tuple</span><span class="p">[</span><span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">,</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">]:</span>
        <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">states</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span> <span class="o">!=</span> <span class="mi">2</span> <span class="ow">or</span> <span class="n">states</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">!=</span> <span class="bp">self</span><span class="p">.</span><span class="n">state_dim</span><span class="p">:</span>
            <span class="k">raise</span> <span class="nb">ValueError</span><span class="p">(</span>
                <span class="sa">f</span><span class="s">"state should have shape (T, </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">state_dim</span><span class="si">}</span><span class="s">) != </span><span class="si">{</span><span class="n">states</span><span class="p">.</span><span class="n">shape</span><span class="si">}</span><span class="s">"</span>
            <span class="p">)</span>

        <span class="n">actions_dist</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_get_action_distribution</span><span class="p">(</span><span class="n">states</span><span class="p">)</span>
        <span class="n">action</span> <span class="o">=</span> <span class="n">actions_dist</span><span class="p">.</span><span class="n">sample</span><span class="p">()</span>

        <span class="c1"># aggregate across the last dimension - we want the probability
</span>        <span class="c1"># of the joint distribution of actions for a given timestep
</span>        <span class="n">log_prob</span> <span class="o">=</span> <span class="n">actions_dist</span><span class="p">.</span><span class="n">log_prob</span><span class="p">(</span><span class="n">action</span><span class="p">).</span><span class="nb">sum</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>

        <span class="k">assert</span> <span class="n">action</span><span class="p">.</span><span class="n">shape</span> <span class="o">==</span> <span class="p">(</span><span class="n">states</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="bp">self</span><span class="p">.</span><span class="n">action_dim</span><span class="p">)</span>
        <span class="k">assert</span> <span class="n">log_prob</span><span class="p">.</span><span class="n">shape</span> <span class="o">==</span> <span class="p">(</span><span class="n">states</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],)</span>

        <span class="k">return</span> <span class="n">action</span><span class="p">,</span> <span class="n">log_prob</span>

    <span class="k">def</span> <span class="nf">evaluate_actions</span><span class="p">(</span>
        <span class="bp">self</span><span class="p">,</span> <span class="n">states</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">,</span> <span class="n">actions</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span>
    <span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">tuple</span><span class="p">[</span><span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">,</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">]:</span>
        <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">states</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span> <span class="o">!=</span> <span class="mi">2</span> <span class="ow">or</span> <span class="n">states</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">!=</span> <span class="bp">self</span><span class="p">.</span><span class="n">state_dim</span><span class="p">:</span>
            <span class="k">raise</span> <span class="nb">ValueError</span><span class="p">(</span>
                <span class="sa">f</span><span class="s">"states should have shape (T, </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">state_dim</span><span class="si">}</span><span class="s">) != </span><span class="si">{</span><span class="n">states</span><span class="p">.</span><span class="n">shape</span><span class="si">}</span><span class="s">"</span>
            <span class="p">)</span>
        <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">actions</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span> <span class="o">!=</span> <span class="mi">2</span> <span class="ow">or</span> <span class="n">actions</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">!=</span> <span class="bp">self</span><span class="p">.</span><span class="n">action_dim</span><span class="p">:</span>
            <span class="k">raise</span> <span class="nb">ValueError</span><span class="p">(</span>
                <span class="sa">f</span><span class="s">"actions should have shape (T, </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">action_dim</span><span class="si">}</span><span class="s">) != </span><span class="si">{</span><span class="n">actions</span><span class="p">.</span><span class="n">shape</span><span class="si">}</span><span class="s">"</span>
            <span class="p">)</span>

        <span class="n">actions_dist</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_get_action_distribution</span><span class="p">(</span><span class="n">states</span><span class="p">)</span>

        <span class="n">log_probs</span> <span class="o">=</span> <span class="n">actions_dist</span><span class="p">.</span><span class="n">log_prob</span><span class="p">(</span><span class="n">actions</span><span class="p">)</span>
        <span class="n">entropy</span> <span class="o">=</span> <span class="n">actions_dist</span><span class="p">.</span><span class="n">entropy</span><span class="p">()</span>

        <span class="c1"># aggregate across the last dimension - we want the probability
</span>        <span class="c1"># and the entropy of the joint distribution of actions for a given timestep
</span>        <span class="n">log_probs</span> <span class="o">=</span> <span class="n">log_probs</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>
        <span class="n">entropy</span> <span class="o">=</span> <span class="n">entropy</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>

        <span class="k">assert</span> <span class="n">log_probs</span><span class="p">.</span><span class="n">shape</span> <span class="o">==</span> <span class="p">(</span><span class="n">states</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],)</span>
        <span class="k">assert</span> <span class="n">entropy</span><span class="p">.</span><span class="n">shape</span> <span class="o">==</span> <span class="p">(</span><span class="n">states</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],)</span>

        <span class="k">return</span> <span class="n">log_probs</span><span class="p">,</span> <span class="n">entropy</span>


<span class="k">class</span> <span class="nc">PPOPolicyDiscrete</span><span class="p">(</span><span class="n">PPOPolicy</span><span class="p">):</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">state_dim</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span> <span class="n">action_dim</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="bp">None</span><span class="p">:</span>
        <span class="nb">super</span><span class="p">().</span><span class="n">__init__</span><span class="p">(</span><span class="n">state_dim</span><span class="p">,</span> <span class="n">action_dim</span><span class="p">)</span>

        <span class="bp">self</span><span class="p">.</span><span class="n">policy</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Sequential</span><span class="p">(</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">state_dim</span><span class="p">,</span> <span class="mi">64</span><span class="p">),</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">ReLU</span><span class="p">(),</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="mi">64</span><span class="p">,</span> <span class="mi">64</span><span class="p">),</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">Tanh</span><span class="p">(),</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="mi">64</span><span class="p">,</span> <span class="n">action_dim</span><span class="p">),</span>
        <span class="p">)</span>

    <span class="k">def</span> <span class="nf">_get_action_distribution</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">state</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">dist</span><span class="p">.</span><span class="n">Distribution</span><span class="p">:</span>
        <span class="n">logits</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">policy</span><span class="p">(</span><span class="n">state</span><span class="p">)</span>
        <span class="n">logits</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">functional</span><span class="p">.</span><span class="n">softplus</span><span class="p">(</span><span class="n">logits</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">dist</span><span class="p">.</span><span class="n">Categorical</span><span class="p">(</span><span class="n">logits</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">get_action</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">states</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">tuple</span><span class="p">[</span><span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">,</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">]:</span>
        <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">states</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span> <span class="o">!=</span> <span class="mi">2</span> <span class="ow">or</span> <span class="n">states</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">!=</span> <span class="bp">self</span><span class="p">.</span><span class="n">state_dim</span><span class="p">:</span>
            <span class="k">raise</span> <span class="nb">ValueError</span><span class="p">(</span>
                <span class="sa">f</span><span class="s">"State should have shape (T, </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">state_dim</span><span class="si">}</span><span class="s">) != </span><span class="si">{</span><span class="n">states</span><span class="p">.</span><span class="n">shape</span><span class="si">}</span><span class="s">"</span>
            <span class="p">)</span>

        <span class="n">actions_dist</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_get_action_distribution</span><span class="p">(</span><span class="n">states</span><span class="p">)</span>
        <span class="n">action</span> <span class="o">=</span> <span class="n">actions_dist</span><span class="p">.</span><span class="n">sample</span><span class="p">()</span>
        <span class="n">log_prob</span> <span class="o">=</span> <span class="n">actions_dist</span><span class="p">.</span><span class="n">log_prob</span><span class="p">(</span><span class="n">action</span><span class="p">)</span>

        <span class="k">assert</span> <span class="n">action</span><span class="p">.</span><span class="n">shape</span> <span class="o">==</span> <span class="p">(</span><span class="n">states</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],)</span>
        <span class="k">assert</span> <span class="n">log_prob</span><span class="p">.</span><span class="n">shape</span> <span class="o">==</span> <span class="p">(</span><span class="n">states</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],)</span>

        <span class="k">return</span> <span class="n">action</span><span class="p">,</span> <span class="n">log_prob</span>

    <span class="k">def</span> <span class="nf">evaluate_actions</span><span class="p">(</span>
        <span class="bp">self</span><span class="p">,</span> <span class="n">states</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">,</span> <span class="n">actions</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span>
    <span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">tuple</span><span class="p">[</span><span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">,</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">]:</span>
        <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">states</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span> <span class="o">!=</span> <span class="mi">2</span> <span class="ow">or</span> <span class="n">states</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">!=</span> <span class="bp">self</span><span class="p">.</span><span class="n">state_dim</span><span class="p">:</span>
            <span class="k">raise</span> <span class="nb">ValueError</span><span class="p">(</span>
                <span class="sa">f</span><span class="s">"states should have shape (T, </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">state_dim</span><span class="si">}</span><span class="s">) != </span><span class="si">{</span><span class="n">states</span><span class="p">.</span><span class="n">shape</span><span class="si">}</span><span class="s">"</span>
            <span class="p">)</span>
        <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">actions</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span> <span class="o">!=</span> <span class="mi">1</span><span class="p">:</span>
            <span class="k">raise</span> <span class="nb">ValueError</span><span class="p">(</span><span class="sa">f</span><span class="s">"actions should have shape (T,) != </span><span class="si">{</span><span class="n">actions</span><span class="p">.</span><span class="n">shape</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>

        <span class="n">actions_dist</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_get_action_distribution</span><span class="p">(</span><span class="n">states</span><span class="p">)</span>

        <span class="n">log_probs</span> <span class="o">=</span> <span class="n">actions_dist</span><span class="p">.</span><span class="n">log_prob</span><span class="p">(</span><span class="n">actions</span><span class="p">)</span>
        <span class="n">entropy</span> <span class="o">=</span> <span class="n">actions_dist</span><span class="p">.</span><span class="n">entropy</span><span class="p">()</span>

        <span class="k">assert</span> <span class="n">log_probs</span><span class="p">.</span><span class="n">shape</span> <span class="o">==</span> <span class="p">(</span><span class="n">states</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],)</span>
        <span class="k">assert</span> <span class="n">entropy</span><span class="p">.</span><span class="n">shape</span> <span class="o">==</span> <span class="p">(</span><span class="n">states</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],)</span>

        <span class="k">return</span> <span class="n">log_probs</span><span class="p">,</span> <span class="n">entropy</span>


<span class="k">class</span> <span class="nc">_PPOValue</span><span class="p">(</span><span class="n">nn</span><span class="p">.</span><span class="n">Module</span><span class="p">):</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">state_dim</span><span class="p">:</span> <span class="nb">int</span><span class="p">):</span>
        <span class="nb">super</span><span class="p">().</span><span class="n">__init__</span><span class="p">()</span>

        <span class="bp">self</span><span class="p">.</span><span class="n">value</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Sequential</span><span class="p">(</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">state_dim</span><span class="p">,</span> <span class="mi">64</span><span class="p">),</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">ReLU</span><span class="p">(),</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="mi">64</span><span class="p">,</span> <span class="mi">64</span><span class="p">),</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">Tanh</span><span class="p">(),</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="mi">64</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span>
        <span class="p">)</span>

    <span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">states</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">:</span>
        <span class="n">values</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">value</span><span class="p">(</span><span class="n">states</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">values</span><span class="p">.</span><span class="n">squeeze</span><span class="p">(</span>
            <span class="o">-</span><span class="mi">1</span>
        <span class="p">)</span>  <span class="c1"># squeeze returns a tensor that's shaped just like the rewards tensor
</span>

<span class="k">class</span> <span class="nc">Trajectory</span><span class="p">(</span><span class="n">typing</span><span class="p">.</span><span class="n">NamedTuple</span><span class="p">):</span>
    <span class="n">states</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span>
    <span class="n">actions</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span>
    <span class="n">rewards</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span>
    <span class="n">log_probs</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span>
    <span class="n">dones</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span>

    <span class="k">def</span> <span class="nf">enable_grad</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="bp">None</span><span class="p">:</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">states</span><span class="p">.</span><span class="n">requires_grad</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">states</span><span class="p">.</span><span class="n">dtype</span><span class="p">.</span><span class="n">is_floating_point</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">actions</span><span class="p">.</span><span class="n">requires_grad</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">actions</span><span class="p">.</span><span class="n">dtype</span><span class="p">.</span><span class="n">is_floating_point</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">log_probs</span><span class="p">.</span><span class="n">requires_grad</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">log_probs</span><span class="p">.</span><span class="n">dtype</span><span class="p">.</span><span class="n">is_floating_point</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">rewards</span><span class="p">.</span><span class="n">requires_grad</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">rewards</span><span class="p">.</span><span class="n">dtype</span><span class="p">.</span><span class="n">is_floating_point</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">dones</span><span class="p">.</span><span class="n">requires_grad</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">dones</span><span class="p">.</span><span class="n">dtype</span><span class="p">.</span><span class="n">is_floating_point</span>

    <span class="k">def</span> <span class="nf">__len__</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">int</span><span class="p">:</span>
        <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">states</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span> <span class="ow">and</span> <span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">states</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span> <span class="o">&gt;=</span> <span class="mi">1</span><span class="p">:</span>
            <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">states</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
        <span class="k">else</span><span class="p">:</span>
            <span class="k">return</span> <span class="mi">0</span>

    <span class="o">@</span><span class="nb">staticmethod</span>
    <span class="k">def</span> <span class="nf">concat</span><span class="p">(</span><span class="n">lhs</span><span class="p">:</span> <span class="n">typing</span><span class="p">.</span><span class="n">Optional</span><span class="p">[</span><span class="s">"Trajectory"</span><span class="p">],</span> <span class="n">rhs</span><span class="p">:</span> <span class="s">"Trajectory"</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="s">"Trajectory"</span><span class="p">:</span>
        <span class="k">if</span> <span class="n">lhs</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
            <span class="k">return</span> <span class="n">rhs</span>

        <span class="k">return</span> <span class="n">Trajectory</span><span class="p">(</span>
            <span class="n">torch</span><span class="p">.</span><span class="n">concat</span><span class="p">([</span><span class="n">lhs</span><span class="p">.</span><span class="n">states</span><span class="p">,</span> <span class="n">rhs</span><span class="p">.</span><span class="n">states</span><span class="p">]),</span>
            <span class="n">torch</span><span class="p">.</span><span class="n">concat</span><span class="p">([</span><span class="n">lhs</span><span class="p">.</span><span class="n">actions</span><span class="p">,</span> <span class="n">rhs</span><span class="p">.</span><span class="n">actions</span><span class="p">]),</span>
            <span class="n">torch</span><span class="p">.</span><span class="n">concat</span><span class="p">([</span><span class="n">lhs</span><span class="p">.</span><span class="n">rewards</span><span class="p">,</span> <span class="n">rhs</span><span class="p">.</span><span class="n">rewards</span><span class="p">]),</span>
            <span class="n">torch</span><span class="p">.</span><span class="n">concat</span><span class="p">([</span><span class="n">lhs</span><span class="p">.</span><span class="n">log_probs</span><span class="p">,</span> <span class="n">rhs</span><span class="p">.</span><span class="n">log_probs</span><span class="p">]),</span>
            <span class="n">torch</span><span class="p">.</span><span class="n">concat</span><span class="p">([</span><span class="n">lhs</span><span class="p">.</span><span class="n">dones</span><span class="p">,</span> <span class="n">rhs</span><span class="p">.</span><span class="n">dones</span><span class="p">]),</span>
        <span class="p">)</span>


<span class="k">class</span> <span class="nc">_PPOAgent</span><span class="p">(</span><span class="n">nn</span><span class="p">.</span><span class="n">Module</span><span class="p">):</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">policy</span><span class="p">:</span> <span class="n">PPOPolicy</span><span class="p">,</span> <span class="n">value_fn</span><span class="p">:</span> <span class="n">_PPOValue</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="bp">None</span><span class="p">:</span>
        <span class="nb">super</span><span class="p">().</span><span class="n">__init__</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">policy</span> <span class="o">=</span> <span class="n">policy</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">value_fn</span> <span class="o">=</span> <span class="n">value_fn</span>


<span class="k">def</span> <span class="nf">ppo_loss</span><span class="p">(</span>
    <span class="n">agent</span><span class="p">:</span> <span class="n">_PPOAgent</span><span class="p">,</span> <span class="n">trajectory</span><span class="p">:</span> <span class="n">Trajectory</span><span class="p">,</span> <span class="n">discount</span><span class="p">:</span> <span class="nb">float</span><span class="p">,</span> <span class="n">clip_eps</span><span class="p">:</span> <span class="nb">float</span>
<span class="p">)</span> <span class="o">-&gt;</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">:</span>
    <span class="c1"># LCLIP + L_VF + L_St(θ) = E_t[LCLIP_t(θ) − c_1 L_t^VF(θ) + c_2 S[πθ](st)]
</span>    <span class="n">log_probs</span><span class="p">,</span> <span class="n">entropies</span> <span class="o">=</span> <span class="n">agent</span><span class="p">.</span><span class="n">policy</span><span class="p">.</span><span class="n">evaluate_actions</span><span class="p">(</span>
        <span class="n">trajectory</span><span class="p">.</span><span class="n">states</span><span class="p">,</span> <span class="n">trajectory</span><span class="p">.</span><span class="n">actions</span>
    <span class="p">)</span>
    <span class="n">values</span> <span class="o">=</span> <span class="n">agent</span><span class="p">.</span><span class="n">value_fn</span><span class="p">(</span><span class="n">trajectory</span><span class="p">.</span><span class="n">states</span><span class="p">)</span>
    <span class="n">advantages</span> <span class="o">=</span> <span class="n">compute_gae</span><span class="p">(</span><span class="n">trajectory</span><span class="p">.</span><span class="n">rewards</span><span class="p">,</span> <span class="n">values</span><span class="p">,</span> <span class="n">trajectory</span><span class="p">.</span><span class="n">dones</span><span class="p">,</span> <span class="n">discount</span><span class="p">)</span>
    <span class="n">returns</span> <span class="o">=</span> <span class="n">advantages</span> <span class="o">+</span> <span class="n">values</span>
    <span class="c1"># Normalize advantages to stabilize training
</span>    <span class="n">advantages</span> <span class="o">=</span> <span class="p">(</span><span class="n">advantages</span> <span class="o">-</span> <span class="n">advantages</span><span class="p">.</span><span class="n">mean</span><span class="p">())</span> <span class="o">/</span> <span class="p">(</span><span class="n">advantages</span><span class="p">.</span><span class="n">std</span><span class="p">()</span> <span class="o">+</span> <span class="mf">1e-8</span><span class="p">)</span>

    <span class="k">assert</span> <span class="n">log_probs</span><span class="p">.</span><span class="n">shape</span> <span class="o">==</span> <span class="p">(</span><span class="n">trajectory</span><span class="p">.</span><span class="n">states</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],)</span>
    <span class="k">assert</span> <span class="n">trajectory</span><span class="p">.</span><span class="n">log_probs</span><span class="p">.</span><span class="n">shape</span> <span class="o">==</span> <span class="p">(</span><span class="n">trajectory</span><span class="p">.</span><span class="n">states</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],)</span>
    <span class="n">ratio</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="n">log_probs</span> <span class="o">-</span> <span class="n">trajectory</span><span class="p">.</span><span class="n">log_probs</span><span class="p">)</span>
    <span class="n">policy_loss</span> <span class="o">=</span> <span class="o">-</span><span class="n">torch</span><span class="p">.</span><span class="nb">min</span><span class="p">(</span>
        <span class="n">ratio</span> <span class="o">*</span> <span class="n">advantages</span><span class="p">,</span>
        <span class="n">torch</span><span class="p">.</span><span class="n">clip</span><span class="p">(</span><span class="n">ratio</span><span class="p">,</span> <span class="mf">1.0</span> <span class="o">-</span> <span class="n">clip_eps</span><span class="p">,</span> <span class="mf">1.0</span> <span class="o">+</span> <span class="n">clip_eps</span><span class="p">)</span> <span class="o">*</span> <span class="n">advantages</span><span class="p">,</span>
    <span class="p">).</span><span class="n">mean</span><span class="p">()</span>
    <span class="n">value_loss</span> <span class="o">=</span> <span class="mf">0.5</span> <span class="o">*</span> <span class="p">(</span><span class="n">returns</span> <span class="o">-</span> <span class="n">values</span><span class="p">).</span><span class="nb">pow</span><span class="p">(</span><span class="mi">2</span><span class="p">).</span><span class="n">mean</span><span class="p">()</span>
    <span class="n">c_2</span> <span class="o">=</span> <span class="mf">0.01</span>
    <span class="n">entropy_loss</span> <span class="o">=</span> <span class="n">c_2</span> <span class="o">*</span> <span class="n">entropies</span><span class="p">.</span><span class="n">mean</span><span class="p">()</span>
    <span class="n">loss</span> <span class="o">=</span> <span class="n">policy_loss</span> <span class="o">+</span> <span class="n">value_loss</span> <span class="o">-</span> <span class="n">entropy_loss</span>

    <span class="k">return</span> <span class="n">loss</span>


<span class="k">def</span> <span class="nf">train_one_trajectory</span><span class="p">(</span>
    <span class="n">agent</span><span class="p">:</span> <span class="n">_PPOAgent</span><span class="p">,</span>
    <span class="n">optimizer</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">optim</span><span class="p">.</span><span class="n">Optimizer</span><span class="p">,</span>
    <span class="n">trajectory</span><span class="p">:</span> <span class="n">Trajectory</span><span class="p">,</span>
    <span class="n">num_updates</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span>
    <span class="n">discount</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mf">0.95</span><span class="p">,</span>
    <span class="n">clip_eps</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mf">0.2</span><span class="p">,</span>
<span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">list</span><span class="p">[</span><span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">]:</span>
    <span class="s">"""Trains the PPO policy and value networks on a single trajectory.

    Multiple steps of training are preformed, the number defined by `num_updates` parameter.
    After calling this function, `trajectory` should be discarded and a new trajectory should be
    sampled.
    """</span>
    <span class="n">trajectory</span><span class="p">.</span><span class="n">enable_grad</span><span class="p">()</span>

    <span class="n">losses</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">]</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="n">trange</span><span class="p">(</span><span class="n">num_updates</span><span class="p">,</span> <span class="n">desc</span><span class="o">=</span><span class="s">"PPO update step"</span><span class="p">,</span> <span class="n">leave</span><span class="o">=</span><span class="bp">False</span><span class="p">):</span>
        <span class="n">optimizer</span><span class="p">.</span><span class="n">zero_grad</span><span class="p">()</span>

        <span class="n">loss</span> <span class="o">=</span> <span class="n">ppo_loss</span><span class="p">(</span><span class="n">agent</span><span class="p">,</span> <span class="n">trajectory</span><span class="p">,</span> <span class="n">discount</span><span class="p">,</span> <span class="n">clip_eps</span><span class="p">)</span>
        <span class="n">losses</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">loss</span><span class="p">)</span>

        <span class="n">loss</span><span class="p">.</span><span class="n">backward</span><span class="p">()</span>
        <span class="c1"># TODO: does gradient clipping fix training?
</span>        <span class="c1"># nn.utils.clip_grad_norm_(agent.parameters(), args.max_grad_norm)
</span>        <span class="n">optimizer</span><span class="p">.</span><span class="n">step</span><span class="p">()</span>

    <span class="k">return</span> <span class="n">losses</span>


<span class="k">def</span> <span class="nf">rollout_episode</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">policy</span><span class="p">:</span> <span class="n">PPOPolicy</span><span class="p">,</span> <span class="n">max_trajectory_len</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Trajectory</span><span class="p">:</span>
    <span class="n">states</span><span class="p">,</span> <span class="n">rewards</span><span class="p">,</span> <span class="n">actions</span><span class="p">,</span> <span class="n">log_probs</span> <span class="o">=</span> <span class="p">[],</span> <span class="p">[],</span> <span class="p">[],</span> <span class="p">[]</span>
    <span class="n">dones</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">int</span><span class="p">]</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="n">state</span> <span class="o">=</span> <span class="n">env</span><span class="p">.</span><span class="n">reset</span><span class="p">()[</span><span class="mi">0</span><span class="p">]</span>
    <span class="n">done</span> <span class="o">=</span> <span class="bp">False</span>

    <span class="n">num_steps</span> <span class="o">=</span> <span class="mi">0</span>
    <span class="k">with</span> <span class="n">torch</span><span class="p">.</span><span class="n">no_grad</span><span class="p">():</span>
        <span class="k">while</span> <span class="ow">not</span> <span class="n">done</span> <span class="ow">and</span> <span class="n">num_steps</span> <span class="o">&lt;</span> <span class="n">max_trajectory_len</span><span class="p">:</span>
            <span class="n">state_t</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">(</span><span class="n">state</span><span class="p">)</span>

            <span class="n">one_state_in_batch_t</span> <span class="o">=</span> <span class="n">state_t</span><span class="p">.</span><span class="n">unsqueeze</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>

            <span class="n">action_t</span><span class="p">,</span> <span class="n">log_prob_t</span> <span class="o">=</span> <span class="n">policy</span><span class="p">.</span><span class="n">get_action</span><span class="p">(</span><span class="n">one_state_in_batch_t</span><span class="p">)</span>
            <span class="c1"># Extract the only action and log probability from the batch tensor
</span>            <span class="n">log_prob_f</span> <span class="o">=</span> <span class="n">log_prob_t</span><span class="p">.</span><span class="n">detach</span><span class="p">()[</span><span class="mi">0</span><span class="p">]</span>
            <span class="n">action_f</span> <span class="o">=</span> <span class="n">action_t</span><span class="p">.</span><span class="n">detach</span><span class="p">()[</span><span class="mi">0</span><span class="p">]</span>

            <span class="n">next_state</span><span class="p">,</span> <span class="n">reward</span><span class="p">,</span> <span class="n">done</span><span class="p">,</span> <span class="n">_</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">env</span><span class="p">.</span><span class="n">step</span><span class="p">(</span><span class="n">action_f</span><span class="p">.</span><span class="n">numpy</span><span class="p">())</span>
            <span class="n">states</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">state_t</span><span class="p">)</span>
            <span class="n">actions</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">action_f</span><span class="p">)</span>
            <span class="n">rewards</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">reward</span><span class="p">)</span>
            <span class="n">log_probs</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">log_prob_f</span><span class="p">)</span>
            <span class="n">dones</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="mi">1</span> <span class="k">if</span> <span class="n">done</span> <span class="k">else</span> <span class="mi">0</span><span class="p">)</span>
            <span class="n">state</span> <span class="o">=</span> <span class="n">next_state</span>

            <span class="n">num_steps</span> <span class="o">+=</span> <span class="mi">1</span>

    <span class="k">if</span> <span class="ow">not</span> <span class="n">states</span><span class="p">:</span>
        <span class="k">raise</span> <span class="nb">RuntimeError</span><span class="p">(</span><span class="s">"No trajectory rolled out"</span><span class="p">)</span>

    <span class="n">states_t</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">stack</span><span class="p">(</span><span class="n">states</span><span class="p">)</span>
    <span class="n">actions_t</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">stack</span><span class="p">(</span><span class="n">actions</span><span class="p">)</span>
    <span class="n">rewards_t</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">(</span><span class="n">rewards</span><span class="p">)</span>
    <span class="n">log_probs_t</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">stack</span><span class="p">(</span><span class="n">log_probs</span><span class="p">)</span>
    <span class="n">dones_t</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">(</span><span class="n">dones</span><span class="p">)</span>

    <span class="k">assert</span> <span class="n">states_t</span><span class="p">.</span><span class="n">shape</span> <span class="o">==</span> <span class="p">(</span><span class="n">num_steps</span><span class="p">,</span> <span class="n">policy</span><span class="p">.</span><span class="n">state_dim</span><span class="p">)</span>
    <span class="k">assert</span> <span class="n">rewards_t</span><span class="p">.</span><span class="n">shape</span> <span class="o">==</span> <span class="p">(</span><span class="n">num_steps</span><span class="p">,)</span>
    <span class="k">assert</span> <span class="n">log_probs_t</span><span class="p">.</span><span class="n">shape</span> <span class="o">==</span> <span class="p">(</span><span class="n">num_steps</span><span class="p">,)</span>
    <span class="k">assert</span> <span class="n">dones_t</span><span class="p">.</span><span class="n">shape</span> <span class="o">==</span> <span class="p">(</span><span class="n">num_steps</span><span class="p">,)</span>
    <span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">policy</span><span class="p">,</span> <span class="n">PPOPolicyContinuous</span><span class="p">):</span>
        <span class="k">assert</span> <span class="n">actions_t</span><span class="p">.</span><span class="n">shape</span> <span class="o">==</span> <span class="p">(</span><span class="n">num_steps</span><span class="p">,</span> <span class="n">policy</span><span class="p">.</span><span class="n">action_dim</span><span class="p">)</span>
    <span class="k">elif</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">policy</span><span class="p">,</span> <span class="n">PPOPolicyDiscrete</span><span class="p">):</span>
        <span class="k">assert</span> <span class="n">actions_t</span><span class="p">.</span><span class="n">shape</span> <span class="o">==</span> <span class="p">(</span><span class="n">num_steps</span><span class="p">,)</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="k">raise</span> <span class="nb">TypeError</span><span class="p">(</span><span class="sa">f</span><span class="s">"Unsupported policy type </span><span class="si">{</span><span class="nb">type</span><span class="p">(</span><span class="n">policy</span><span class="p">)</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>

    <span class="k">return</span> <span class="n">Trajectory</span><span class="p">(</span><span class="n">states_t</span><span class="p">,</span> <span class="n">actions_t</span><span class="p">,</span> <span class="n">rewards_t</span><span class="p">,</span> <span class="n">log_probs_t</span><span class="p">,</span> <span class="n">dones_t</span><span class="p">)</span>


<span class="k">def</span> <span class="nf">train</span><span class="p">(</span>
    <span class="n">env</span><span class="p">,</span>
    <span class="n">policy</span><span class="p">:</span> <span class="n">PPOPolicy</span><span class="p">,</span>
    <span class="n">num_updates_per_epoch</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span>
    <span class="n">num_episodes_per_epoch</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span>
    <span class="n">num_epochs</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span>
    <span class="n">max_trajectory_len</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span>
    <span class="n">discount</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mf">0.95</span><span class="p">,</span>
    <span class="n">clip_eps</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mf">0.2</span><span class="p">,</span>
<span class="p">)</span> <span class="o">-&gt;</span> <span class="bp">None</span><span class="p">:</span>

    <span class="c1"># TODO: how to implement learning rate decay?
</span>
    <span class="n">value_fn</span> <span class="o">=</span> <span class="n">_PPOValue</span><span class="p">(</span><span class="n">policy</span><span class="p">.</span><span class="n">state_dim</span><span class="p">)</span>
    <span class="n">agent</span> <span class="o">=</span> <span class="n">_PPOAgent</span><span class="p">(</span><span class="n">policy</span><span class="p">,</span> <span class="n">value_fn</span><span class="p">)</span>
    <span class="n">optimizer</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">optim</span><span class="p">.</span><span class="n">Adam</span><span class="p">(</span><span class="n">agent</span><span class="p">.</span><span class="n">parameters</span><span class="p">(),</span> <span class="n">lr</span><span class="o">=</span><span class="mf">3e-4</span><span class="p">)</span>

    <span class="n">experiment</span> <span class="o">=</span> <span class="n">mlflow</span><span class="p">.</span><span class="n">set_experiment</span><span class="p">(</span><span class="s">"PPO training"</span><span class="p">)</span>
    <span class="k">with</span> <span class="n">mlflow</span><span class="p">.</span><span class="n">start_run</span><span class="p">(</span><span class="n">experiment_id</span><span class="o">=</span><span class="n">experiment</span><span class="p">.</span><span class="n">experiment_id</span><span class="p">):</span>
        <span class="n">mlflow</span><span class="p">.</span><span class="n">log_param</span><span class="p">(</span><span class="s">"torch_version"</span><span class="p">,</span> <span class="n">torch</span><span class="p">.</span><span class="n">__version__</span><span class="p">)</span>
        <span class="n">mlflow</span><span class="p">.</span><span class="n">log_param</span><span class="p">(</span><span class="s">"cuda_available"</span><span class="p">,</span> <span class="n">torch</span><span class="p">.</span><span class="n">cuda</span><span class="p">.</span><span class="n">is_available</span><span class="p">())</span>
        <span class="n">mlflow</span><span class="p">.</span><span class="n">log_param</span><span class="p">(</span><span class="s">"num_updates_per_epoch"</span><span class="p">,</span> <span class="n">num_updates_per_epoch</span><span class="p">)</span>
        <span class="n">mlflow</span><span class="p">.</span><span class="n">log_param</span><span class="p">(</span><span class="s">"num_episodes_per_epoch"</span><span class="p">,</span> <span class="n">num_episodes_per_epoch</span><span class="p">)</span>
        <span class="n">mlflow</span><span class="p">.</span><span class="n">log_param</span><span class="p">(</span><span class="s">"num_epochs"</span><span class="p">,</span> <span class="n">num_epochs</span><span class="p">)</span>
        <span class="n">mlflow</span><span class="p">.</span><span class="n">log_param</span><span class="p">(</span><span class="s">"max_trajectory_len"</span><span class="p">,</span> <span class="n">max_trajectory_len</span><span class="p">)</span>
        <span class="n">mlflow</span><span class="p">.</span><span class="n">log_param</span><span class="p">(</span><span class="s">"discount"</span><span class="p">,</span> <span class="n">discount</span><span class="p">)</span>
        <span class="n">mlflow</span><span class="p">.</span><span class="n">log_param</span><span class="p">(</span><span class="s">"clip_eps"</span><span class="p">,</span> <span class="n">clip_eps</span><span class="p">)</span>
        <span class="k">if</span> <span class="n">torch</span><span class="p">.</span><span class="n">cuda</span><span class="p">.</span><span class="n">is_available</span><span class="p">():</span>
            <span class="n">mlflow</span><span class="p">.</span><span class="n">log_param</span><span class="p">(</span><span class="s">"cuda_device_count"</span><span class="p">,</span> <span class="n">torch</span><span class="p">.</span><span class="n">cuda</span><span class="p">.</span><span class="n">device_count</span><span class="p">())</span>
            <span class="k">try</span><span class="p">:</span>
                <span class="n">mlflow</span><span class="p">.</span><span class="n">log_param</span><span class="p">(</span><span class="s">"cuda_device_name"</span><span class="p">,</span> <span class="n">torch</span><span class="p">.</span><span class="n">cuda</span><span class="p">.</span><span class="n">get_device_name</span><span class="p">(</span><span class="mi">0</span><span class="p">))</span>
            <span class="k">except</span> <span class="nb">Exception</span><span class="p">:</span>
                <span class="k">pass</span>

        <span class="n">total_params</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">(</span><span class="n">p</span><span class="p">.</span><span class="n">numel</span><span class="p">()</span> <span class="k">for</span> <span class="n">p</span> <span class="ow">in</span> <span class="n">agent</span><span class="p">.</span><span class="n">parameters</span><span class="p">())</span>
        <span class="n">trainable_params</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">(</span><span class="n">p</span><span class="p">.</span><span class="n">numel</span><span class="p">()</span> <span class="k">for</span> <span class="n">p</span> <span class="ow">in</span> <span class="n">agent</span><span class="p">.</span><span class="n">parameters</span><span class="p">()</span> <span class="k">if</span> <span class="n">p</span><span class="p">.</span><span class="n">requires_grad</span><span class="p">)</span>
        <span class="n">mlflow</span><span class="p">.</span><span class="n">log_param</span><span class="p">(</span><span class="s">"total_parameters"</span><span class="p">,</span> <span class="n">total_params</span><span class="p">)</span>
        <span class="n">mlflow</span><span class="p">.</span><span class="n">log_param</span><span class="p">(</span><span class="s">"trainable_parameters"</span><span class="p">,</span> <span class="n">trainable_params</span><span class="p">)</span>
        <span class="n">mlflow</span><span class="p">.</span><span class="n">log_param</span><span class="p">(</span>
            <span class="s">"trainable_percentage"</span><span class="p">,</span> <span class="mf">100.0</span> <span class="o">*</span> <span class="n">trainable_params</span> <span class="o">/</span> <span class="n">total_params</span>
        <span class="p">)</span>

        <span class="k">for</span> <span class="n">epoch</span> <span class="ow">in</span> <span class="n">trange</span><span class="p">(</span><span class="n">num_epochs</span><span class="p">,</span> <span class="n">desc</span><span class="o">=</span><span class="s">"train"</span><span class="p">,</span> <span class="n">leave</span><span class="o">=</span><span class="bp">False</span><span class="p">):</span>

            <span class="n">trajectory</span> <span class="o">=</span> <span class="bp">None</span>
            <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="n">trange</span><span class="p">(</span>
                <span class="n">num_episodes_per_epoch</span><span class="p">,</span> <span class="n">desc</span><span class="o">=</span><span class="s">"Collecting trajectory"</span><span class="p">,</span> <span class="n">leave</span><span class="o">=</span><span class="bp">False</span>
            <span class="p">):</span>
                <span class="n">traj_step</span> <span class="o">=</span> <span class="n">rollout_episode</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">policy</span><span class="p">,</span> <span class="n">max_trajectory_len</span><span class="p">)</span>
                <span class="n">trajectory</span> <span class="o">=</span> <span class="n">Trajectory</span><span class="p">.</span><span class="n">concat</span><span class="p">(</span><span class="n">trajectory</span><span class="p">,</span> <span class="n">traj_step</span><span class="p">)</span>

                <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">trajectory</span><span class="p">)</span> <span class="o">&gt;</span> <span class="n">max_trajectory_len</span><span class="p">:</span>
                    <span class="k">break</span>

            <span class="k">if</span> <span class="n">trajectory</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
                <span class="n">logging</span><span class="p">.</span><span class="n">warning</span><span class="p">(</span><span class="s">"Trajectory not collected"</span><span class="p">)</span>
                <span class="k">return</span>

            <span class="n">mlflow</span><span class="p">.</span><span class="n">log_metric</span><span class="p">(</span><span class="s">"traj_len"</span><span class="p">,</span> <span class="n">trajectory</span><span class="p">.</span><span class="n">states</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">step</span><span class="o">=</span><span class="n">epoch</span><span class="p">)</span>
            <span class="n">losses</span> <span class="o">=</span> <span class="n">train_one_trajectory</span><span class="p">(</span>
                <span class="n">agent</span><span class="p">,</span> <span class="n">optimizer</span><span class="p">,</span> <span class="n">trajectory</span><span class="p">,</span> <span class="n">num_updates_per_epoch</span><span class="p">,</span> <span class="n">discount</span><span class="p">,</span> <span class="n">clip_eps</span>
            <span class="p">)</span>

            <span class="n">losses_t</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">stack</span><span class="p">(</span><span class="n">losses</span><span class="p">)</span>
            <span class="n">mlflow</span><span class="p">.</span><span class="n">log_metric</span><span class="p">(</span><span class="s">"loss"</span><span class="p">,</span> <span class="nb">float</span><span class="p">(</span><span class="n">losses_t</span><span class="p">.</span><span class="n">mean</span><span class="p">()),</span> <span class="n">step</span><span class="o">=</span><span class="n">epoch</span><span class="p">)</span>

        <span class="c1"># evaluation
</span>        <span class="c1"># TODO: extract to a separate function - but it will require extracting mlflow initialization
</span>        <span class="c1"># to log artifacts to the same run
</span>        <span class="n">env_eval</span><span class="p">:</span> <span class="n">RecordVideo</span> <span class="o">=</span> <span class="n">RecordVideo</span><span class="p">(</span>
            <span class="n">env</span><span class="p">,</span>
            <span class="n">video_folder</span><span class="o">=</span><span class="s">"./videos/"</span><span class="p">,</span>
            <span class="n">episode_trigger</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="p">(</span><span class="n">x</span> <span class="o">%</span> <span class="mi">10</span> <span class="o">==</span> <span class="mi">0</span><span class="p">),</span>  <span class="c1"># record every 10th episode
</span>            <span class="n">name_prefix</span><span class="o">=</span><span class="s">"ppo-lunarlander"</span><span class="p">,</span>
        <span class="p">)</span>
        <span class="k">with</span> <span class="n">torch</span><span class="p">.</span><span class="n">no_grad</span><span class="p">():</span>
            <span class="k">for</span> <span class="n">eval_epoch</span> <span class="ow">in</span> <span class="n">trange</span><span class="p">(</span><span class="n">num_epochs</span><span class="p">,</span> <span class="n">desc</span><span class="o">=</span><span class="s">"eval"</span><span class="p">,</span> <span class="n">leave</span><span class="o">=</span><span class="bp">False</span><span class="p">):</span>
                <span class="n">trajectory</span> <span class="o">=</span> <span class="n">rollout_episode</span><span class="p">(</span><span class="n">env_eval</span><span class="p">,</span> <span class="n">policy</span><span class="p">,</span> <span class="n">max_trajectory_len</span><span class="p">)</span>
                <span class="n">loss</span> <span class="o">=</span> <span class="n">ppo_loss</span><span class="p">(</span><span class="n">agent</span><span class="p">,</span> <span class="n">trajectory</span><span class="p">,</span> <span class="n">discount</span><span class="p">,</span> <span class="n">clip_eps</span><span class="p">)</span>
                <span class="n">mlflow</span><span class="p">.</span><span class="n">log_metric</span><span class="p">(</span><span class="s">"eval loss"</span><span class="p">,</span> <span class="nb">float</span><span class="p">(</span><span class="n">loss</span><span class="p">),</span> <span class="n">step</span><span class="o">=</span><span class="n">eval_epoch</span><span class="p">)</span>



<span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
    <span class="n">env</span> <span class="o">=</span> <span class="n">gym</span><span class="p">.</span><span class="n">make</span><span class="p">(</span><span class="s">"LunarLander-v3"</span><span class="p">,</span> <span class="n">render_mode</span><span class="o">=</span><span class="s">"rgb_array"</span><span class="p">)</span>
    <span class="n">action_dim</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">env</span><span class="p">.</span><span class="n">action_space</span><span class="p">.</span><span class="n">n</span><span class="p">)</span>
    <span class="n">state_dim</span> <span class="o">=</span> <span class="n">env</span><span class="p">.</span><span class="n">reset</span><span class="p">()[</span><span class="mi">0</span><span class="p">].</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>

    <span class="n">policy</span> <span class="o">=</span> <span class="n">PPOPolicyDiscrete</span><span class="p">(</span><span class="n">state_dim</span><span class="o">=</span><span class="n">state_dim</span><span class="p">,</span> <span class="n">action_dim</span><span class="o">=</span><span class="n">action_dim</span><span class="p">)</span>

    <span class="n">train</span><span class="p">(</span>
        <span class="n">env</span><span class="p">,</span>
        <span class="n">policy</span><span class="p">,</span>
        <span class="n">num_updates_per_epoch</span><span class="o">=</span><span class="mi">20</span><span class="p">,</span>
        <span class="n">num_episodes_per_epoch</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span>
        <span class="n">num_epochs</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span>
        <span class="n">max_trajectory_len</span><span class="o">=</span><span class="mi">1000</span><span class="p">,</span>
    <span class="p">)</span>


<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
    <span class="n">main</span><span class="p">()</span>
</code></pre></div></div>

<p>It was written and tested with Python 3.14 and uses the following dependencies:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">gymnasium</span><span class="p">[</span><span class="n">box2d</span><span class="p">,</span><span class="n">other</span><span class="p">]</span><span class="o">&gt;=</span><span class="mf">1.2</span><span class="p">.</span><span class="mi">3</span>
<span class="n">mlflow</span><span class="o">&gt;=</span><span class="mf">3.3</span><span class="p">.</span><span class="mi">1</span>
<span class="n">pydantic</span><span class="o">&gt;=</span><span class="mf">2.11</span><span class="p">.</span><span class="mi">7</span>
<span class="n">swig</span><span class="o">&gt;=</span><span class="mf">4.4</span><span class="p">.</span><span class="mi">1</span>
<span class="n">torch</span><span class="o">&gt;=</span><span class="mf">2.9</span><span class="p">.</span><span class="mi">0</span>
<span class="n">tqdm</span><span class="o">&gt;=</span><span class="mf">4.67</span><span class="p">.</span><span class="mi">1</span>
</code></pre></div></div>

<h2 id="conclusion">Conclusion</h2>

<p>I’d love to hear you comments and thoughts on the topic.</p>

<p>The point here was to show that implementing a Deep Learning algorithm from a whitepaper is not a straightforward challenge and to encourage you to do it.</p>]]></content><author><name>Piotr Trochim</name></author><category term="ml" /><category term="rl" /><summary type="html"><![CDATA[How difficult is it really?]]></summary></entry><entry><title type="html">What Language Leaves Behind</title><link href="https://piotrtrochim.net/2025/12/12/what-language-leaves-behind/" rel="alternate" type="text/html" title="What Language Leaves Behind" /><published>2025-12-12T12:38:36+00:00</published><updated>2025-12-12T12:38:36+00:00</updated><id>https://piotrtrochim.net/2025/12/12/what-language-leaves-behind</id><content type="html" xml:base="https://piotrtrochim.net/2025/12/12/what-language-leaves-behind/"><![CDATA[<blockquote>
  <p>Language compresses thought. What gets lost?</p>
</blockquote>

<p><a href="https://substackcdn.com/image/fetch/$s_!ZMgV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cc90630-e7fa-47bc-b02c-79acde8b1cde_2572x1165.jpeg"><img src="https://substackcdn.com/image/fetch/$s_!ZMgV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cc90630-e7fa-47bc-b02c-79acde8b1cde_2572x1165.jpeg" alt="" /></a></p>

<p>Every thought you’ve ever communicated has been compressed.</p>

<p>Written and spoken language is a representation of human thought—one of many, alongside images, music, architecture, gesture, clothing. But unlike a painting or a melody, language carries an illusion of precision. We treat words as if they perfectly capture what we mean.</p>

<p>Piotr’s Substack is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p>

<p>They don’t.</p>

<p>Language is a codec: it encodes the high-dimensional, tangled, contextual thing happening in your mind into a low-dimensional stream of symbols. And like every codec, it’s lossy. Information is discarded. Structure is flattened. Nuance bleeds away.</p>

<p>Worse, the decompression happens on the other end—in someone else’s mind, using their priors, their context, their understanding of what your words were supposed to mean.</p>

<p>This isn’t a flaw in how we use language. It’s a flaw in what language <em>is</em>.</p>

<h2 id="watching-the-compression-happen">Watching the Compression Happen</h2>

<p>Summarization makes this visible.</p>

<p>Take the <a href="https://en.wikipedia.org/wiki/Addition">Wikipedia article on </a><em><a href="https://en.wikipedia.org/wiki/Addition">addition</a></em>. If I asked you to summarize it in two sentences, you’d have to choose: which parts of your understanding deserve to survive the compression?</p>

<p>Here’s my instinctive attempt:</p>

<blockquote>
  <p><strong>Summary 1:</strong> Addition is an arithmetic operation that combines two or more numbers to produce their total or sum, typically denoted with the + sign.</p>
</blockquote>

<p>And here’s what I wrote when I deliberately reached for different information:</p>

<blockquote>
  <p><strong>Summary 2:</strong> Addition is an operation defined for all kinds of numbers, is associative and commutative, and 0 is its identity element.</p>
</blockquote>

<p>Same source. Same length constraint. Completely different outputs.</p>

<p>The difference isn’t in the facts—both are accurate. The difference is in <em>which thoughts I chose to encode</em>. My first summary emerged from resonance: the overlap between the webpage and my existing mental model. It felt like the “main point” because it matched what I already believed about addition.</p>

<p>My second summary required effort. I had to override my instincts and select information that wouldn’t naturally surface.</p>

<p>This is the compression in action. The thought “what addition means to me” is vast and associative. The two-sentence summary is a narrow channel. Something has to be thrown away.</p>

<h2 id="the-illusion-of-consensus">The Illusion of Consensus</h2>

<p>What’s remarkable is how reliably people throw away the same things.</p>

<p>I ran the same prompt through several LLMs:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Prompt: Summarize addition defined on this website https://en.wikipedia.org/wiki/Addition in 2 sentences
</code></pre></div></div>

<p><strong>Claude 4.5:</strong> Addition is one of the four basic operations of arithmetic that combines two or more numbers (called addends or summands) to produce their total or sum, typically denoted with the plus sign (+).</p>

<p><strong>GPT 5.0:</strong> Addition is a mathematical operation that combines two numbers, called addends, to produce a sum. It is one of the basic arithmetic operations and is widely used in various fields of mathematics, science, and everyday life.</p>

<p><strong>Gemini 3.0:</strong> Addition, typically symbolized by the plus sign (+), is one of the four core arithmetic operations, where the sum of two whole numbers represents the total when those values are combined. It is a fundamental concept in mathematics that is both commutative and associative.</p>

<p>All three converge on essentially the same framing as my instinctive summary. They share a prior—trained into them by human preferences—about what “important” means, what “summary” means, what deserves to survive the compression.</p>

<p>This feels like agreement. It’s actually shared bias.</p>

<h2 id="the-information-that-doesnt-exist">The Information That Doesn’t Exist</h2>

<p>Here are two more summaries of that same Wikipedia page:</p>

<blockquote>
  <p><strong>Summary 3:</strong> The webpage describes a mathematical operation that children as young as 5 months, and animals of some species, are capable of performing.</p>

  <p><strong>Summary 4:</strong> The webpage mentions the word “three” 16 times and the word “addition” 181 times.</p>
</blockquote>

<p>Both are accurate. Both are valid compressions. And both probably feel <em>wrong</em> to you—like they’ve violated some unspoken rule about what a summary should contain.</p>

<p>That feeling of wrongness? That’s your prior asserting itself. Your mental codec has a built-in definition of “main fact,” shaped by culture, education, and a thousand past encounters with the word “summary.”</p>

<p>The LLMs share that prior because they were trained on human outputs. The result is a systematic pattern in what gets discarded: not random noise, but <em>structured absence</em>. Statisticians call this “missing not at random.” The gaps in a summary aren’t accidents. They’re artifacts of the compression algorithm—which is to say, artifacts of how we collectively decided to encode thought into language.</p>

<h2 id="the-receivers-problem">The Receiver’s Problem</h2>

<p>But compression is only half the story.</p>

<p>When you read my summary of addition, you’re not recovering my original thought. You’re <em>reconstructing</em> something in your own mind, using your own priors. The same sentence means different things to a mathematician, a kindergarten teacher, and someone who’s never heard the word “commutative.”</p>

<p>Language doesn’t transmit meaning. It transmits symbols that <em>prompt</em> the receiver to construct meaning locally. The fidelity of that reconstruction depends on how well the sender’s and receiver’s codecs align.</p>

<p>Usually, we don’t notice the drift. Shared culture, shared context, shared assumptions—these create enough overlap that communication feels seamless. But the overlap is never complete. Every sentence you speak arrives slightly different than it left.</p>

<h2 id="what-this-means">What This Means</h2>

<p>Language is not a window into thought. It’s a compression artifact.</p>

<p>When we summarize, explain, argue, or describe, we’re not transferring our mental states intact. We’re lossy-encoding them into a symbolic stream, hoping the receiver’s decompression produces something close enough to be useful.</p>

<p>Sometimes it does. Sometimes what you meant and what I understood are close enough that we never notice the gap.</p>

<p>But the gap is always there.</p>

<p>Every act of communication is an act of interpretation—twice over. First by the speaker, who must choose what survives the encoding. Then by the listener, who must reconstruct meaning from the symbols that remain.</p>

<p>The question is never “did I say it clearly?” The question is: whose priors are doing the work?</p>

<p>Piotr’s Substack is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p>]]></content><author><name>Piotr Trochim</name></author><category term="psychology" /><summary type="html"><![CDATA[Language compresses thought. What gets lost?]]></summary></entry><entry><title type="html">Puer Aeternus (forever-child) and Procrastination</title><link href="https://piotrtrochim.net/2025/12/10/puer-aeternus-forever-child-and-procrastination/" rel="alternate" type="text/html" title="Puer Aeternus (forever-child) and Procrastination" /><published>2025-12-10T12:42:46+00:00</published><updated>2025-12-10T12:42:46+00:00</updated><id>https://piotrtrochim.net/2025/12/10/puer-aeternus-forever-child-and-procrastination</id><content type="html" xml:base="https://piotrtrochim.net/2025/12/10/puer-aeternus-forever-child-and-procrastination/"><![CDATA[<blockquote>
  <p>The Hidden Mental Traps Behind Procrastination</p>
</blockquote>

<p><a href="https://substackcdn.com/image/fetch/$s_!xuBS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff18e3462-712a-47a6-a4a4-7fb16627e65f_5299x3533.jpeg"><img src="https://substackcdn.com/image/fetch/$s_!xuBS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff18e3462-712a-47a6-a4a4-7fb16627e65f_5299x3533.jpeg" alt="" /></a></p>

<p>Procrastination isn’t just about laziness—it’s a complex web of mental traps that keep us stuck. Understanding these patterns can help us break free from them.</p>

<h2 id="the-six-faces-of-procrastination">The Six Faces of Procrastination</h2>

<p>When we procrastinate, we typically fall into one of these mental traps:</p>

<p>Piotr’s Substack is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p>

<p><strong>The “Not Enough” Trap</strong>: We convince ourselves that taking action on this particular thing isn’t sufficient. For example, we might think, “ <em>I can ’t improve LLM summarization because I’d first need to research the right metric.</em>” We create arbitrary prerequisites that prevent us from starting.</p>

<p><strong>The “Why Bother” Trap</strong>: Past failures loom large and convince us that starting again is pointless. “ <em>I ’ve already tried writing a book so many times and failed—I’ll probably fail this time too, so why bother?</em>” This trap transforms previous attempts into evidence against future success.</p>

<p><strong>The Efficiency Trap</strong> : We postpone action under the guise of strategic timing. “ <em>Why bother finding a partner now when I don ’t have a job? If I wait until I’m financially stable, it will be much easier.</em>” We tell ourselves we’re being smart when we’re actually avoiding discomfort.</p>

<p><strong>The Guilt Spiral</strong> : After finally making progress, instead of celebrating, we punish ourselves: “ <em>I should have done that years ago.</em> ” This guilt actually strengthens the procrastination pattern by making achievement feel bad rather than good.</p>

<p><strong>The Devaluation Trap</strong> : We minimize our accomplishments to avoid recognizing progress: “ <em>This was so easy —it wasn’t real progress at all.</em>” By dismissing what we’ve done, we rob ourselves of the momentum that success creates.</p>

<p><strong>The Comparison Trap</strong> : We measure our progress against others and come up short: “ <em>In the same time, I did so little while this other person did so much. Why even bother?</em> ” This trap ensures that any progress we make feels inadequate.</p>

<h2 id="the-puer-aeternus-connection">The Puer Aeternus Connection</h2>

<p>The concept of Puer Aeternus—Latin for “ <em>eternal child</em> ”—describes a psychological pattern of avoiding commitment and maintaining unlimited potential. Three mental traps define this pattern:</p>

<p><strong>Failure to Constellate</strong> : The inability to commit to a cohesive identity or expertise—being a jack of all trades but master of none. There’s no clear narrative, no ability to say “ <em>I am an expert in this particular thing.</em> ”</p>

<p><strong>Fear of Wasting Time</strong> : An overwhelming anxiety about investing energy in something that might not be “ <em>good enough</em> ” or worthy of one’s talents.</p>

<p><strong>Focus on the Loss</strong> : Obsessive attention to what might be lost or sacrificed by committing to a particular path, rather than what might be gained.</p>

<h2 id="how-these-patterns-feed-each-other">How These Patterns Feed Each Other</h2>

<p>Procrastination becomes the perfect outlet for the Puer Aeternus mindset. Someone caught in this pattern will avoid engaging with tasks because they fear wasting time on something below their standards. They focus relentlessly on potential losses—the opportunities foreclosed, the paths not taken—rather than the value of actually doing the work.</p>

<p>This creates a vicious cycle. Procrastination inevitably leads to failed or never-started projects. These failures reinforce the belief that committing to any particular career area isn’t worthwhile, causing the person to drift between different fields. They remain forever potential, never actual—perpetually failing to constellate into something concrete.</p>

<p>The tragedy is that the very behavior meant to preserve unlimited potential actually destroys it. By avoiding commitment to protect ourselves from wasting time or making the wrong choice, we ensure that no choice ever bears fruit.</p>

<h2 id="references">References</h2>

<ul>
  <li>
    <p>“The Problem of the Puer Aeternus” Marie-Louise von Franz</p>
  </li>
  <li>
    <p><a href="https://www.youtube.com/watch?v=e0ec2-E5Xq8&amp;list=WL&amp;index=14">Dr. Alok Kanojia “Why You Still Haven’t Grown Up” vlog (Healthy GamerGG)</a></p>
  </li>
  <li>
    <p><a href="https://www.youtube.com/watch?v=ztoA0NpguT0&amp;t=1646s">Dr. Alok Kanojia “The Reason You Can Never Progress” vlog (HealthyGamerGG)</a></p>
  </li>
</ul>

<p>Piotr’s Substack is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p>]]></content><author><name>Piotr Trochim</name></author><category term="psychology" /><summary type="html"><![CDATA[The Hidden Mental Traps Behind Procrastination]]></summary></entry><entry><title type="html">Ubuntu 22.04 setup on Razer Blade 15 2022</title><link href="https://piotrtrochim.net/2023/07/10/razer-ubuntu-setup/" rel="alternate" type="text/html" title="Ubuntu 22.04 setup on Razer Blade 15 2022" /><published>2023-07-10T00:00:00+00:00</published><updated>2023-07-10T00:00:00+00:00</updated><id>https://piotrtrochim.net/2023/07/10/razer-ubuntu-setup</id><content type="html" xml:base="https://piotrtrochim.net/2023/07/10/razer-ubuntu-setup/"><![CDATA[<h2 id="closing-the-lid-doesnt-suspend-the-laptop">Closing the lid doesn’t suspend the laptop</h2>

<ol>
  <li>create a new file: <code class="language-plaintext highlighter-rouge">/etc/systemd/system/acpi-wake-andy.service</code></li>
</ol>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[Unit]
Description=ACPI Wake Service
 
[Service]
Type=oneshot
ExecStart=/bin/sh -c "echo RP05 | sudo tee /proc/acpi/wakeup"
 
[Install]
WantedBy=multi-user.target
</code></pre></div></div>

<ol>
  <li>Enable that service</li>
</ol>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sudo systemctl start acpi-wake-andy.service
sudo systemctl enable acpi-wake-andy.service
sudo systemctl status acpi-wake-andy.service # check status
</code></pre></div></div>

<ol>
  <li>Create a new file <code class="language-plaintext highlighter-rouge">/etc/modprobe.d/nvidia-s2idle.conf</code></li>
</ol>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>options nvidia NVreg_EnableS0ixPowerManagement=1
NVreg_S0ixPowerManagementVideoMemoryThreshold=10000
</code></pre></div></div>

<ol>
  <li>Check status <code class="language-plaintext highlighter-rouge">cat /sys/power/mem_sleep</code></li>
</ol>

<p>If the contents are: <code class="language-plaintext highlighter-rouge">[s2idle] deep</code>, then you’re done - reboot and test it that the reboot worked.
If the contents are <code class="language-plaintext highlighter-rouge">s2idle [deep]</code>, repeat step 5.</p>

<ol>
  <li><em>Perform this step only if the above command output is</em> <code class="language-plaintext highlighter-rouge">s2idle [deep]</code></li>
</ol>

<p>Edit file <code class="language-plaintext highlighter-rouge">/etc/default/grub</code>, adding <code class="language-plaintext highlighter-rouge">mem_sleep_default=s2idle</code> to <code class="language-plaintext highlighter-rouge">GRUB_CMDLINE_LINUX_DEFAULT</code></p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>GRUB_CMDLINE_LINUX_DEFAULT="quiet splash mem_sleep_default=s2idle" 
</code></pre></div></div>

<ol>
  <li>
    <p>run <code class="language-plaintext highlighter-rouge">sudo update-grub</code></p>
  </li>
  <li>
    <p>reboot, you’re done test changes</p>
  </li>
</ol>]]></content><author><name>ptrochim</name></author><category term="linux" /><summary type="html"><![CDATA[Closing the lid doesn’t suspend the laptop]]></summary></entry><entry><title type="html">Useful libraries</title><link href="https://piotrtrochim.net/2023/07/02/useful-libraries/" rel="alternate" type="text/html" title="Useful libraries" /><published>2023-07-02T00:00:00+00:00</published><updated>2023-07-02T00:00:00+00:00</updated><id>https://piotrtrochim.net/2023/07/02/useful-libraries</id><content type="html" xml:base="https://piotrtrochim.net/2023/07/02/useful-libraries/"><![CDATA[<h1 id="ml">ML</h1>

<table>
  <thead>
    <tr>
      <th>Name and link</th>
      <th>Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><a href="https://www.pytorchlightning.ai/index.html">PyTorch Lighting</a></td>
      <td>Simplifies building scalable research workflows and production pipelines in PyTorch</td>
    </tr>
    <tr>
      <td><a href="https://omegaconf.readthedocs.io/en/2.3_branch/">OmegaConf</a></td>
      <td>Hierarchical configuration system, supports merging configurations</td>
    </tr>
    <tr>
      <td><a href="https://wandb.ai/">Weights &amp; Biases</a></td>
      <td>Online visualization, better than Tensorboard</td>
    </tr>
  </tbody>
</table>]]></content><author><name>ptrochim</name></author><category term="ml" /><summary type="html"><![CDATA[ML]]></summary></entry><entry><title type="html">Emotion of a memory</title><link href="https://piotrtrochim.net/2020/08/14/emotion-memory/" rel="alternate" type="text/html" title="Emotion of a memory" /><published>2020-08-14T00:00:00+00:00</published><updated>2020-08-14T00:00:00+00:00</updated><id>https://piotrtrochim.net/2020/08/14/emotion-memory</id><content type="html" xml:base="https://piotrtrochim.net/2020/08/14/emotion-memory/"><![CDATA[<p>Just the other day I was watching “Star Trek: Deep Space Nine”, when one of the characters seemed familiar to me - <a href="https://en.wikipedia.org/wiki/Julian_Bashir">Dr Julian Bashir</a>.</p>

<p>As it turned out, the actor who played Dr Bashir - <a href="https://en.wikipedia.org/wiki/Alexander_Siddig">Alexander Siddig</a> - starred in <a href="https://en.wikipedia.org/wiki/The_Spy_(TV_series)">“The Spy”</a> as <a href="https://en.wikipedia.org/wiki/Ahmed_Suidani">Ahmed Suidani</a>. This dark and menacing character stands in stark contrast to the character of the always eager, friendly and unassuming Dr Bashir.</p>

<p>A couple episodes later, I noticed that my attitude towards Dr Bashir changed, and I no longer perceived his friend as friendly.
And then I asked myself - would that have changed had I first watched Deep Space Nine?</p>

<p>How and where are emotions associated with memories?</p>]]></content><author><name>ptrochim</name></author><category term="psychology" /><summary type="html"><![CDATA[Just the other day I was watching “Star Trek: Deep Space Nine”, when one of the characters seemed familiar to me - Dr Julian Bashir.]]></summary></entry></feed>