Canadian Technology Magazine: Hermes Agent Is Insane, and It Changes How You Build With AI

April 27, 2026
, 6:50 pm
, AI, IT

The Canadian Technology Magazine has covered plenty of AI tools, but every now and then something shows up that feels less like a chatbot and more like a real working system. Hermes Agent is one of those tools. It does not just answer questions. It installs, tests, iterates, remembers, and can coordinate other agents to get real work done while you sleep.

What makes this especially interesting for Canadian Technology Magazine readers is not just the novelty. It is the practicality. This is the kind of setup that can help with software prototyping, benchmark creation, automated testing, and long-horizon technical work that used to require constant babysitting.

I recently used Hermes Agent alongside GPT 5.5, Codex, and Claude Code to build something that is honestly one of the most fun AI benchmarks I have worked on: a gravity-well simulation where language models have to write code to pilot ships through orbital chaos. The result was not just a cool demo. It was proof that these agents can collaborate, improve over time, and produce work that would have taken far longer by hand.

What I Built With Hermes Agent

The benchmark is simple to explain and surprisingly hard to solve.

There are four gravity wells, basically four suns. Blue dots represent ships that an AI model must control. The model is not pressing buttons directly. It has to write code that decides how to fly those ships. Gravity is real. Fuel is limited. Thrusters matter. Momentum matters. Collisions matter. Get too close to a sun and you can lose everything.

The objective is to keep the ships inside a moving target circle for as long as possible. Every tick spent in the circle earns a point. The circle moves according to a fixed pattern, not randomly, so the best solutions are not just reactive. They need to learn the motion, anticipate it, and get in front of it.

That turns this into a great AI test.

A model receives a plain-English description of the game, including the mechanics of gravity, thrusters, and scoring. It then generates a script. That script gets dropped into the simulation and run. The score comes back. Then the model gets another chance. Not once, but over multiple iterations.

This is where things get interesting.

Why This Benchmark Is More Useful Than Typical AI Benchmarks

A lot of benchmark scores floating around AI right now are hard to trust. Some are useful. Some are overfit. Some may have been directly trained against. If you want to know whether a model is actually capable, you need tests that are harder to game and closer to real work.

This benchmark asks a very practical question:

Can a model understand instructions written in English?
Can it turn those instructions into working code?
Can it debug and improve that code over repeated attempts?
Can it generalize across many different runs and conditions?

Each model gets 20 attempts to improve. After that, the best program is tested across 100 different seeds. Each seed changes the positions of the suns and slightly alters the movement of the target circle. That helps reduce flukes and reveals whether the solution is robust or just lucky.

That is the kind of thing Canadian Technology Magazine readers should care about because it is much closer to business reality. Real AI work is not one prompt and one answer. It is repeated attempts, diagnostics, adaptation, and refinement.

What the Learning Curves Look Like

The first attempts are usually terrible.

Ships smash into suns. They burn all their fuel racing off into space. They overshoot the target. They drift out of bounds. Some early runs score almost nothing.

Then the better models begin to learn.

One strong model started low and eventually climbed to a score of 276 after enough iterations. A mid-sized model improved too, but plateaued much earlier around 78. Another early run scored just 1, and that point looked more like an accident than a strategy.

This is exactly what you want from a benchmark. You can see the intelligence curve. You can see where models hit a ceiling. You can see which ones really benefit from feedback and which ones just flail more confidently.

Why Hermes Agent Was So Useful

Technically, I did not need Hermes Agent to build all of this.

But practically, it saved me in a few tight spots and made the whole process far more efficient.

Most of the benchmark itself was built with coding agents like Codex. Hermes helped because it can act as a coordinator. It has built-in skills, can call other agent systems, and can manage repeated tasks without needing constant hand-holding. It can even launch Claude Code and Codex as tools and have them work on separate jobs.

That is where it starts feeling less like a chatbot and more like an orchestration layer.

For example, I had Hermes open instances of Claude Code and Codex and literally guide them through a game of tic-tac-toe just to test whether the workflow worked. Ridiculous? Absolutely. Useful? Also yes. It confirmed Hermes could coordinate turn-based interaction between separate agent environments.

That same idea later got applied to the benchmark itself.

The Real Point of Hermes Agent

The coolest part is not that Hermes can generate an image of a flower or answer a question in Telegram. The real point is this: it can turn your day and night into two different modes of AI work.

Daytime: collaborative building. You steer. The agent helps create, troubleshoot, and implement.

Nighttime: automated grinding. The agent runs tests, benchmarks models, repeats workflows, and comes back with results by morning.

That is exactly how I used it.

During the day, the benchmark got built through prompting, direction, and iteration. During the night, the agents ran model evaluations for hours across a stack of systems including GPT 5.4, GPT 5.5, GPT 5.5 Pro, Grok, DeepSeek, Gemini, and Anthropic models. The runs started after 2 a.m. and wrapped around 5:32 a.m.

That is the kind of labour you want to automate.

Installing Hermes Agent on a VPS

If you want Hermes Agent always available, a VPS is the cleanest option. You can also run it on a local desktop, an old laptop, a mini PC, or a Mac mini. All of those work. But a virtual private server has some big advantages:

It stays online
You do not need to keep your main machine running
Hardware maintenance becomes someone else’s problem
You can isolate the agent from your personal computer

The setup used here was an Ubuntu VPS, specifically an LTS release. That matters because long-term support versions tend to be the most stable, and Hermes is tested against them.

Basic VPS setup flow

Create a VPS with a plain Ubuntu installation.
Set a root password.
Copy the server IP address.
Open your terminal or PowerShell.
SSH into the machine as root.

ssh root@your-server-ip

Once connected, you are operating directly on the remote machine.

From there, Hermes can be installed using the installer command recommended by its setup flow. One smart habit here is to ask an AI assistant for the latest install steps for a fresh Ubuntu environment before you run anything. Even when you already know roughly what to do, that extra context can catch small but important details.

Hermes Setup: What Actually Matters

During setup, Hermes walks through a few important choices.

1. Provider selection

You can use direct providers or a model aggregator. Two options discussed were:

OpenRouter, which gives access to a huge range of models through one API key
Nuse Portal, which bundles model access plus useful tools like web search, image generation, text-to-speech, and browser automation

The convenience of a bundled option is obvious. Instead of setting up multiple third-party tool APIs one by one, you can enable a tool gateway and get moving faster.

2. Terminal backend

You will typically choose between:

Local, which runs directly on the machine
Docker, which sandboxes the agent more safely

On a fresh server, local is often the practical starting point because Docker may not yet be installed. Later, if you want more isolation, Docker is the better choice.

3. Iteration limits and compression

Hermes lets you set limits for tool calls and decide when to compress its context. A moderate number such as 90 tool iterations is a good default for substantial tasks. Compression threshold controls how full the context window gets before resetting or summarizing.

4. Messaging and tools

Hermes can be controlled through interfaces like Telegram, and it can be configured to use image generation and other tools. For image generation, one setup demonstrated using OpenAI authentication rather than an API key, which made it possible to use GPT Image tools inside Hermes.

Using GPT 5.5 Inside Hermes

One especially useful trick is running GPT 5.5 inside Hermes. The process involved updating Hermes, authenticating through OpenAI’s Codex login flow, selecting GPT 5.5 as the active model, and then launching Hermes again.

That setup matters because GPT 5.5 performed very well on long-horizon project work. It was good at building complicated systems over time, even if it occasionally needed some troubleshooting. And that is the key distinction. A model does not need to be perfect in one shot to be valuable. It needs to be capable of improving over repeated steps.

Hermes Agent as a Manager of Other Agents

This is probably the most exciting part.

Hermes can act as a controller for sub-agents like Claude Code and Codex. So instead of you manually switching between tools, Hermes can prompt them, collect their outputs, feed them diagnostics, and coordinate the loop.

For the gravity-well duel mode, the workflow looked like this:

Hermes receives the game state and instructions.
Hermes asks Claude Code and Codex to generate an initial script.
The scripts are submitted into the simulation.
The simulation returns diagnostics and scores.
Hermes passes those diagnostics back to each coding agent.
Each agent revises its script for the next round.
The loop continues for a set number of iterations.

This produced a live head-to-head benchmark where two models improve in parallel and battle each other over time. In one test, GPT 5.5 high beat Claude Opus 4.7 seven rounds to three.

That is not just fun. It is a genuinely useful framework for model evaluation.

Why Iteration Beats First Impressions

One of the strongest lessons from this whole process is that first attempts can be wildly misleading.

At iteration one, the ships were crashing, overshooting, and wasting fuel. By iteration ten, the behaviour looked disciplined. Tiny thrust adjustments. Better anticipation. Cleaner orbital control. Less panic. Less chaos.

That progression matters for how we evaluate AI systems in general.

For Canadian Technology Magazine, one of the big takeaways here is that a model’s value should not be judged only by its first output. Some models are dramatically better when allowed to reason through feedback loops and refine their work.

Security: Do Not Be Reckless With Agents

Now for the part that is a little less fun but absolutely necessary.

These agents can be run in modes that bypass frequent approval prompts. That is useful for long jobs because otherwise they stop every few minutes asking whether it is okay to create a file or a folder. For serious autonomous runs, that friction becomes unbearable.

But there is obvious risk.

If you allow an agent to operate with broad permissions, do not run it on your main personal machine. That is asking for trouble.

A better setup is:

Use a VPS
Use an old laptop or spare mini PC
Consider Docker for sandboxing
Use password managers and isolated credentials
Think in terms of blast radius if something goes wrong

That is the right mental model. Not “will it ever go wrong,” but “if it goes wrong, how contained is the damage?”

Where This Is Going Next

The gravity-well benchmark is just one test. Several more are already queued up. The broader goal is to create a battery of original, difficult evaluations that reveal how well AI models actually perform in messy, practical environments.

There is also an interesting future direction inspired by evolutionary approaches. Instead of simple iteration, you can generate many candidate solutions, keep the strong ones, discard the weak ones, and let better strategies evolve over time.

That is where things get really spicy.

Once a model becomes a strong pilot in the simulation, it can do more than chase the circle. It can perform intentional manoeuvres. Slingshot around a star. Hover near a gravity well. Execute precision movement objectives. At that point, you are no longer testing one narrow task. You are testing whether the model has developed a deeper understanding of the environment.

Why This Matters for Canadian Technology Magazine Readers

The reason this belongs in Canadian Technology Magazine is simple. This is not hype for hype’s sake. This is a working example of how AI agents are evolving from novelty tools into operational systems.

For businesses, developers, and technical teams, the implications are huge:

Faster prototyping
Cheaper experimentation
Overnight testing and analysis
More autonomous benchmark creation
Better collaboration between specialised models

There is still friction. Things still break. You still need judgment. But the amount of useful work these systems can now do is hard to ignore.

If you are still thinking AI agents are mostly toys, I get the skepticism. A lot of them have been oversold. But once you start using a setup where an agent helps build during the day and runs experiments overnight, it becomes very difficult to go back.

FAQ

What is Hermes Agent?

Hermes Agent is an open-source AI agent system designed to perform tasks, use tools, remember how it solved problems, and improve workflows over time. It can also call other agent environments such as Claude Code and Codex.

Can Hermes Agent run on a VPS?

Yes. Running Hermes on a VPS is one of the most practical options because it keeps the agent online, separates it from your main computer, and makes long autonomous runs easier to manage.

Do I need Hermes Agent to build AI benchmarks?

No. You can build benchmarks using other coding agents. Hermes becomes valuable when you want coordination, memory, tool use, and automation across longer workflows.

What makes the gravity-well benchmark useful?

It tests whether a model can understand instructions, write code, improve through feedback, and generalize across many randomized scenarios. That makes it more realistic than many static benchmarks.

Is it safe to run AI agents with broad permissions?

It can be done, but it should be approached carefully. A safer practice is to run agents on isolated machines or VPS environments, use Docker when possible, and avoid giving dangerous autonomy to systems running on your primary personal computer.

Why is this relevant to Canadian Technology Magazine readers?

Canadian Technology Magazine focuses on practical technology trends and useful tools. Hermes Agent fits that perfectly because it demonstrates how AI can move beyond chat and into real software work, testing, and automation.

Final Thoughts

The best way I can describe Hermes Agent is this: it is one of those tools that makes the near future feel very close.

You can prompt during the day, hand off the grind at night, and wake up to actual progress. Not just words. Not just ideas. Real iterations, test results, code, and systems.

For anyone building with AI right now, that is a big deal. And for Canadian Technology Magazine, it is exactly the kind of shift worth paying attention to.