<?xml version="1.0" encoding="utf-8"?><?xml-stylesheet type="text/xsl" href="atom.xsl"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <id>https://jaymaverick.github.io/blog</id>
    <title>Jay Pitkänen Blog</title>
    <updated>2026-05-13T00:00:00.000Z</updated>
    <generator>https://github.com/jpmonette/feed</generator>
    <link rel="alternate" href="https://jaymaverick.github.io/blog"/>
    <subtitle>Jay Pitkänen Blog</subtitle>
    <icon>https://jaymaverick.github.io/img/favicon.ico</icon>
    <entry>
        <title type="html"><![CDATA[Version 3: Back to OCR -This Time For Real]]></title>
        <id>https://jaymaverick.github.io/blog/v3-easyocr-llm</id>
        <link href="https://jaymaverick.github.io/blog/v3-easyocr-llm"/>
        <updated>2026-05-13T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[The coordinate grid was DOA. Back to square one.]]></summary>
        <content type="html"><![CDATA[<p>The coordinate grid was DOA. Back to square one.</p>
<p>Python powered OCR would be the only way forward.</p>
<p>I have to get this right. If you are working on a car, guessing doesn't cut it. Accuracy is everything. Get a torque spec wrong, and you snap a bolt inside an engine block.</p>
<p>But instead of trying to become a computer vision engineer overnight, I decided to use an established solution created by someone way smarter than me: EasyOCR.</p>
<p>Why reinvent the wheel, right?</p>
<!-- -->
<p>EasyOCR is a proven, open-source Python library built specifically for reading text in images. No messy algorithms, no manual contour filtering. You just feed it an image, and it hands back the text along with the exact pixel coordinates.</p>
<p>Suddenly, the pipeline became incredibly clean.</p>
<ol>
<li class="">The JetKVM grabs the screenshot.</li>
<li class="">Python hands the screenshot to EasyOCR, which extracts the text and maps out the exact pixel coordinates for every word on the screen.</li>
<li class="">The LLM acts purely as the brain, reading the text data to decide what to do.</li>
<li class="">Python commands the JetKVM to click the exact coordinates EasyOCR found.</li>
</ol>
<p>For the first time in this entire project, I looked at the workflow and smiled. This might actually work.</p>
<p>This is the exact blueprint I used for my tech demo. You can read the full breakdown here.</p>
<p>Go check it out.</p>
<p>Oh and hire me already, wtf.</p>]]></content>
        <author>
            <name>Jay Pitkänen</name>
            <uri>https://www.linkedin.com/in/jaypitkanen/</uri>
        </author>
        <category label="LLM" term="LLM"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[Version 2: Blinded By The Coordinate]]></title>
        <id>https://jaymaverick.github.io/blog/v2-llm-ocr-failure</id>
        <link href="https://jaymaverick.github.io/blog/v2-llm-ocr-failure"/>
        <updated>2026-05-12T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[The LLM-powered vision pipeline was starting to take shape.]]></summary>
        <content type="html"><![CDATA[<p>The LLM-powered vision pipeline was starting to take shape.</p>
<p>The JetKVM grabs the screenshot.</p>
<p>A quick JavaScript script sends that image over to the LLM. The LLM reads the screen and tells Python exactly where to click.</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">[JetKVM Laptop Screen] </span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">       │ </span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">       ▼ (WebRTC Stream) </span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">[Your Browser Frontend] ──(Captures Frame)──&gt; [Converts to Base64 Image String]                                                         │ ▼ (Fetch POST Request)                                                   [Ollama / Local Server] </span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                                                       │ (Runs Ministral 8B) </span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                                                       ▼ </span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                                                 [Structured JSON Data Out]</span><br></div></code></pre></div></div>
<p>Clean. Simple.</p>
<p>But there was one glaring flaw. How does the LLM describe a location? It can't just say "click the top right button." Python needs exact pixels.</p>
<p>The LLM needed a map.</p>
<!-- -->
<p>I wrote some JavaScript to overlay a bright grid coordinate system directly onto the screenshot. Like a game of Battleship. Rows A through Z, columns 1 through 100.</p>
<details class="details_lb9f alert alert--info details_b_Ee" data-collapsed="true"><summary>Click to see awful JS!</summary><div><div class="collapsibleContent_i85q"><div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">&lt;!DOCTYPE html&gt;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">&lt;html lang="en"&gt;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">&lt;head&gt;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    &lt;meta charset="UTF-8"&gt;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    &lt;meta name="viewport" content="width=device-width, initial-scale=1.0"&gt;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    &lt;title&gt;Mistral Visual Agent - Interface Demo&lt;/title&gt;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    &lt;style&gt;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        body {</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            background-color: #121212;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            color: #ffffff;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, sans-serif;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            display: flex;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            flex-direction: column;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            align-items: center;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            justify-content: center;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            min-height: 100vh;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            margin: 0;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        }</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        h2 {</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            margin-bottom: 20px;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            font-weight: 400;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            letter-spacing: 0.5px;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        }</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        .view-container {</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            position: relative;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            box-shadow: 0 10px 30px rgba(0,0,0,0.5);</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            border-radius: 4px;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            overflow: hidden;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        }</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        /* The base image simulating the incoming JetKVM feed */</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        #targetImage {</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            display: block;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            width: 1024px;   /* Constrained for demo display */</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            height: auto;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            max-width: 100%;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        }</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        /* The canvas that draws the alpha grid layer exactly over the image */</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        #gridOverlay {</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            position: absolute;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            top: 0;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            left: 0;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            width: 100%;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            height: 100%;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            pointer-events: none; /* Allows mouse clicks to pass through if needed */</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            z-index: 10;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        }</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        .controls {</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            margin-top: 20px;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        }</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        button {</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            background-color: #ff4a4a;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            color: white;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            border: none;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            padding: 10px 20px;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            font-size: 14px;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            border-radius: 4px;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            cursor: pointer;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            font-weight: bold;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            transition: background 0.2s;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        }</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        button:hover {</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            background-color: #d13232;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        }</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    &lt;/style&gt;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">&lt;/head&gt;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">&lt;body&gt;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    &lt;h2&gt;Mistral AI Agent - Visual Grid Overlay Pipeline&lt;/h2&gt;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    &lt;div class="view-container"&gt;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        &lt;!-- Demo Target Image (Emulating the WebRTC feed) --&gt;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        &lt;img id="targetImage" src="epc_screenshot.png" alt="Mercedes EPC Screenshot"&gt;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        &lt;!-- The Alpha Matrix Overlay Layer --&gt;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        &lt;canvas id="gridOverlay"&gt;&lt;/canvas&gt;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    &lt;/div&gt;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    &lt;div class="controls"&gt;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        &lt;button id="captureBtn"&gt;Capture and Compile Frame&lt;/button&gt;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    &lt;/div&gt;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    &lt;script&gt;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        const img = document.getElementById('targetImage');</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        const canvas = document.getElementById('gridOverlay');</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        const ctx = canvas.getContext('2d');</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        // Matrix Definition (10 columns, 10 rows)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        const COLS = 10;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        const ROWS = 10;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        // Ensure the canvas resolution matches the image's layout dimensions once loaded</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        img.onload = function() {</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            initializeGrid();</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        };</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        // Fallback execution if the image is already cached/loaded by the browser</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        if (img.complete) {</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            initializeGrid();</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        }</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        function initializeGrid() {</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            // Set internal drawing resolution to match natural image size</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            canvas.width = img.naturalWidth;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            canvas.height = img.naturalHeight;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            drawAlphaGrid();</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        }</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        function drawAlphaGrid() {</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            const colWidth = canvas.width / COLS;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            const rowHeight = canvas.height / ROWS;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            // Clear any previous artifacts</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            ctx.clearRect(0, 0, canvas.width, canvas.height);</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            // Styling configuration for the grid layout</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            ctx.strokeStyle = 'rgba(255, 0, 0, 0.35)'; // Red alpha lines</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            ctx.lineWidth = 2;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            ctx.fillStyle = 'rgba(255, 50, 50, 0.85)';  // High-contrast font fill</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            // Scaled typography dynamically matching resolution depth</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            const fontSize = Math.max(14, Math.floor(canvas.width / 65));</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            ctx.font = `bold ${fontSize}px monospace`;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            for (let i = 0; i &lt; COLS; i++) {</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                for (let j = 0; j &lt; ROWS; j++) {</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    const x = i * colWidth;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    const y = j * rowHeight;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    // Draw bounds bounding box matrix element</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    ctx.strokeRect(x, y, colWidth, rowHeight);</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    // Map grid markers (A0, B4, C2...)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    // Column indicator uses character ASCII offsets (65 = 'A')</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    const colLetter = String.fromCharCode(65 + i);</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    const gridLabel = `${colLetter}${j}`;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    // Position label text offset cleanly in top-left quadrant of cell</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    ctx.fillText(gridLabel, x + 8, y + fontSize + 6);</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                }</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            }</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        }</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        // Window resize compliance tracking to maintain grid mapping</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        window.addEventListener('resize', () =&gt; {</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            initializeGrid();</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        });</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        // --- COMPILATION TRIGGER (Preparing Data payload for Mistral API) ---</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        document.getElementById('captureBtn').addEventListener('click', () =&gt; {</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            // Create a hidden canvas to fuse the screenshot image and grid overlay into one stream</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            const compileCanvas = document.createElement('canvas');</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            compileCanvas.width = canvas.width;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            compileCanvas.height = canvas.height;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            const compileCtx = compileCanvas.getContext('2d');</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            // Layer 1: Draw target image payload</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            compileCtx.drawImage(img, 0, 0, compileCanvas.width, compileCanvas.height);</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            // Layer 2: Overlay active coordinate array canvas</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            compileCtx.drawImage(canvas, 0, 0, compileCanvas.width, compileCanvas.height);</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            // Export compound image payload to a compressed Base64 string stream</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            const payloadBase64 = compileCanvas.toDataURL('image/jpeg', 0.85);</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            console.log("--- BASE64 DATAREADY FOR MISTRAL ---");</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            console.log(payloadBase64.substring(0, 100) + "..."); // Truncated log verification</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            alert("Image and Grid compiled successfully into Base64 format! Inspect browser console.");</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            // Note: In your production script, you will drop this data payload right into a </span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            // fetch() payload calling Ollama's local runtime interface.</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        });</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    &lt;/script&gt;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">&lt;/body&gt;</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">&lt;/html&gt;</span><br></div></code></pre></div></div></div></div></details>
<p>I injected the grid, pulled the combined image, and fed it to the model.</p>
<p>In theory, the LLM should look at the button, see it sitting in box G42, and tell Python to click G42.</p>
<p>It worked flawlessly in my head. In practice, the LLM was completely lost.</p>
<p>It hallucinated coordinates. It guessed wildly. It would look at a button clearly resting in the middle of the screen and tell Python to click the bottom left corner.</p>
<p>Oh no. I tried to change grid colors.</p>
<p>I tried rewriting the Python server code.</p>
<p>I tried multiple LLM prompts, from obsessively pedantic to loosey goosey.</p>
<p>Nothing helped.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="hey-llm-wheres-waldo">Hey LLM, Where's Waldo?<a href="https://jaymaverick.github.io/blog/v2-llm-ocr-failure#hey-llm-wheres-waldo" class="hash-link" aria-label="Direct link to Hey LLM, Where's Waldo?" title="Direct link to Hey LLM, Where's Waldo?" translate="no">​</a></h2>
<p>I opened the processed image to see what went wrong. Then it hit me.
<img decoding="async" loading="lazy" src="https://jaymaverick.github.io/assets/images/testing-1a49073d9d998b7b33a448c6e5f9050f.png" width="1620" height="913" class="img_ev3q">
The ancient Xentry interface is already a crowded mess of tiny text, gray tables, and jagged pixel fonts. Squinting at it gives you a headache.</p>
<p>And I was adding a bright JavaScript grid on top. A good idea on paper turned into a mess of visual noise in practice. The grid lines sliced right through words, turning legible text into unreadable digital confetti. The model couldn't tell where the software ended and the grid began.</p>
<p>Instead of giving the LLM a map, I had overwhelmed it with a thousand maps.</p>
<p>Once again, the automation loop was broken.</p>
<p>...back to the drawing board.</p>]]></content>
        <author>
            <name>Jay Pitkänen</name>
            <uri>https://www.linkedin.com/in/jaypitkanen/</uri>
        </author>
        <category label="LLM" term="LLM"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[Version 1: OpenCV, How Hard Could it Be?]]></title>
        <id>https://jaymaverick.github.io/blog/v1-opencv-failure</id>
        <link href="https://jaymaverick.github.io/blog/v1-opencv-failure"/>
        <updated>2026-05-11T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[My first idea was pretty simple.]]></summary>
        <content type="html"><![CDATA[<p>My first idea was pretty simple.</p>
<ol>
<li class="">JetKVM forwards video to my browser.</li>
<li class="">Python reads the screen and finds the buttons.</li>
<li class="">Python shows content to LLM.</li>
<li class="">Python tells JetKVM what to do based on LLM reasoning.</li>
</ol>
<p>Simple, elegant.</p>
<p>OpenCV was the industry standard 6 years ago when I was actively coding Python, surely it would work fine here today, right?</p>
<p><em>Surely.</em></p>
<!-- -->
<p>I started writing Python code to process the image.</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">import cv2</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">import numpy as np</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">import pytesseract</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">import pandas as pd</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">def extract_epc_table(image_path):</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    # 1. Load image and convert to grayscale</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    img = cv2.imread(image_path)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    # 2. Thresholding to get a binary image (black and white)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    # 3. Isolate horizontal and vertical lines to detect the table grid</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    # Define a kernel length (adjust based on your image resolution)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    kernel_len = np.array(img).shape[1] // 80</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    # Horizontal lines</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    ver_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, kernel_len))</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    vertical_lines = cv2.erode(thresh, ver_kernel, iterations=3)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    vertical_lines = cv2.dilate(vertical_lines, ver_kernel, iterations=3)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    # Vertical lines</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    hor_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (kernel_len, 1))</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    horizontal_lines = cv2.erode(thresh, hor_kernel, iterations=3)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    horizontal_lines = cv2.dilate(horizontal_lines, hor_kernel, iterations=3)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    # Combine lines to form the complete grid mask</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    table_mask = cv2.addWeighted(vertical_lines, 0.5, horizontal_lines, 0.5, 0.0)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    table_mask = cv2.threshold(table_mask, 0, 255, cv2.THRESH_BINARY)[1]</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    # 4. Find contours of the table cells</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    contours, _ = cv2.findContours(table_mask, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    # Filter out very small or massive contours that aren't cells</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    bounding_boxes = [cv2.boundingRect(c) for c in contours]</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    # Sort bounding boxes: Sort by Y coordinate primarily, then X coordinate</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    # This allows us to read rows from top-to-bottom, left-to-right</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    # Note: For complex tables, you might need a bucketing algorithm for row alignment.</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    bounding_boxes = sorted(bounding_boxes, key=lambda b: (b[1], b[0]))</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    # 5. Extract Text from Each Cell</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    data = []</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    current_row = []</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    previous_y = -1</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    row_threshold = 10 # Pixels variance allowed to consider cells on the same row</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    for x, y, w, h in bounding_boxes:</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        # Ignore bounds that are too tiny to contain text</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        if w &lt; 40 or h &lt; 20:</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            continue</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        # Crop the cell from the original grayscale image</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        cell = gray[y:y+h, x:x+w]</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        # Clean up cell image slightly for better OCR accuracy</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        cell = cv2.threshold(cell, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        # Run Tesseract OCR (using --psm 6: Assume a single uniform block of text)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        text = pytesseract.image_to_string(cell, config='--psm 6').strip()</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        # Group cells into rows based on their Y position</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        if previous_y == -1 or abs(y - previous_y) &lt;= row_threshold:</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            current_row.append((x, text))</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        else:</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            # Sort current row by X position so columns are in order</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            current_row.sort(key=lambda item: item[0])</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            data.append([item[1] for item in current_row])</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            current_row = [(x, text)]</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        previous_y = y</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    # Append the final row</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    if current_row:</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        current_row.sort(key=lambda item: item[0])</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        data.append([item[1] for item in current_row])</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    # 6. Convert to Pandas DataFrame for cleaning</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    df = pd.DataFrame(data)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    return df</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"># --- Execution ---</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"># df = extract_epc_table('mercedes_epc_screenshot.png')</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"># print(df)</span><br></div></code></pre></div></div>
<p>First, convert the image to grayscale and apply a heavy binary threshold. Next, isolate the vertical and horizontal grid lines using morphology kernels, blending them together to map out the entire table grid. Finally, find the cell contours, sort them top-to-bottom, and crop out every individual box so Tesseract OCR can scrape the text inside.</p>
<p><em><strong>Even on paper, this project started sounding ridiculously messy.</strong></em></p>
<p>In reality, my garbage code made it even worse.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="letting-the-robot-do-the-hard-work">Letting The Robot Do The Hard Work<a href="https://jaymaverick.github.io/blog/v1-opencv-failure#letting-the-robot-do-the-hard-work" class="hash-link" aria-label="Direct link to Letting The Robot Do The Hard Work" title="Direct link to Letting The Robot Do The Hard Work" translate="no">​</a></h2>
<p>The Xentry software uses an ancient Windows UI. The buttons don't have clean, modern borders. They have weird gradients, tiny text, and non-standard layouts.</p>
<p>My script was either missing the buttons entirely or highlighting random blank space. Adjusting the threshold numbers felt like tuning a radio from 1940. Fix it for one screen, and it breaks on the next.</p>
<p>I was drowning in a sea of nested loops and pixel matrices. It was tedious, overcomplicated, and brittle.</p>
<p>Then it hit me.</p>
<p><em>Why am I trying to be a computer vision engineer?</em></p>
<p>I have a local Large Language Model sitting right there on my main desktop. Modern vision-language models don't just read text; they understand spatial layouts.</p>
<p>Instead of writing fifty lines of terrible OpenCV code to crop and isolate bounding boxes, I can just feed the raw screenshot straight to the model.</p>
<p>"Look at this image. Find the text that says 'Fuel System Diagnostics'. Give me its approximate X and Y coordinates on a 1920x1080 grid."</p>
<p>The LLM doesn't care about noise reduction, gray scaling, or edge contours. It just looks at the screen, understands what it sees, and hands over the coordinates.</p>
<p>Python becomes a simple mail carrier. It takes the image from the JetKVM, hands it to the LLM, gets the coordinates back, and tells the JetKVM where to click.</p>
<p>No complicated math. No brittle thresholds.</p>
<p>Just pure, lazy efficiency.</p>
<p>With version 0.1 dead on arrival, version 0.2 was starting to take shape.</p>]]></content>
        <author>
            <name>Jay Pitkänen</name>
            <uri>https://www.linkedin.com/in/jaypitkanen/</uri>
        </author>
        <category label="LLM" term="LLM"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[Fixing Enterprise Problems With a Dusty IBM Laptop]]></title>
        <id>https://jaymaverick.github.io/blog/dusty-ibm-laptop</id>
        <link href="https://jaymaverick.github.io/blog/dusty-ibm-laptop"/>
        <updated>2026-05-10T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[Ever owned an old Mercedes?]]></summary>
        <content type="html"><![CDATA[<p>Ever owned an old Mercedes?</p>
<p>Let me tell you about a regular Tuesday evening owning an old Mercedes.</p>
<p>You're lying under the car, back against the pavement, face inches away from a hot exhaust pipe, trying to fix another "tiny thing".</p>
<p>It's dark. Your eyes are full of dirt and dust. The wrench you're holding is starting to feel real heavy. Just gotta tighten this one last bolt and then it's sauna and beer time.</p>
<p>Then your mind goes blank. <em>What torque?</em></p>
<p>Was it 50Nm or 90Nm? You check your notes - nothing. Google? No way, this information is way too niche. ChatGPT? Forget it. You can't trust it to be accurate.</p>
<p>Get it wrong and the bolt might snap and cause a fuel leak.</p>
<p>The Mercedes Xentry Diagnosis Software could give the exact value, but it's locked away in a laptop in your office, some 900m away.</p>
<p><em>If only there was a way to access it remotely with voice commands...</em></p>
<!-- -->
<p>And that's where I got the idea.</p>
<p>I've been playing around with LLMs since 2023.</p>
<ol>
<li class="">
<p>My first huge milestone was to create a very rudimentary Langchain pipeline to create authoritative SEO content for Smash Digital. Interesting project, but I knew text content was on its way out.</p>
</li>
<li class="">
<p>The next milestone was local LLMs. Stable Diffusion Forge to create images. Character bibles to govern chat behavior. Local LLM coding assistants.</p>
</li>
<li class="">
<p>Time for my third milestone: a remote accessible, local LLM operated, assistant that physically controls my Mercedes Xentry laptop using a JetKVM interface.</p>
</li>
</ol>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-problem">The Problem<a href="https://jaymaverick.github.io/blog/dusty-ibm-laptop#the-problem" class="hash-link" aria-label="Direct link to The Problem" title="Direct link to The Problem" translate="no">​</a></h2>
<p>My (very real, very legitimate I promise) copy of Xentry is trapped on an ancient IBM laptop, on a custom Windows 10 installation.</p>
<ul>
<li class="">It CAN NOT access the internet, otherwise the Windows installation will update itself and implode.</li>
<li class="">It CAN NOT be modified with any data crawling software due to the very fragile software installation.</li>
<li class="">The data is trapped in unstructured PDFs and tables, within ancient proprietary software.</li>
</ul>
<p>The only way to get any repair guides, part numbers, circuit board diagrams etc is for a physical user to access the software and slowly navigate to the correct location.</p>
<p>This is not only awkward to do in a garage environment, but risky as well. Damaging the laptop would be catastrophic.</p>
<p>No. We need a way to remotely access the database device in a way that's read-only and scalable.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="possible-solutions">Possible Solutions<a href="https://jaymaverick.github.io/blog/dusty-ibm-laptop#possible-solutions" class="hash-link" aria-label="Direct link to Possible Solutions" title="Direct link to Possible Solutions" translate="no">​</a></h2>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="approach-1-manual-bulk-exporting">Approach 1: Manual Bulk Exporting<a href="https://jaymaverick.github.io/blog/dusty-ibm-laptop#approach-1-manual-bulk-exporting" class="hash-link" aria-label="Direct link to Approach 1: Manual Bulk Exporting" title="Direct link to Approach 1: Manual Bulk Exporting" translate="no">​</a></h4>
<p>One option is to click through every single catalog page, circuit diagram, and repair guide. Print them one by one to local PDF files. Then pipe those documents into a modern database.</p>
<p>It is theoretically possible. It is also completely impractical.</p>
<p>Systems like Xentry are locked down to prevent bulk scraping. Manually exporting tens of thousands of deeply nested data sheets would take months.</p>
<p>(And it's boring too.)</p>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="approach-2-jetkvm-and-semantic-routing">Approach 2: JetKVM and Semantic Routing<a href="https://jaymaverick.github.io/blog/dusty-ibm-laptop#approach-2-jetkvm-and-semantic-routing" class="hash-link" aria-label="Direct link to Approach 2: JetKVM and Semantic Routing" title="Direct link to Approach 2: JetKVM and Semantic Routing" translate="no">​</a></h4>
<p>Instead of messing with the fragile laptop, we can build an automation layer from the outside. We use a hardware device called a JetKVM.</p>
<p>You plug the JetKVM directly into the laptop's video and USB ports. Now you can hijack the raw screen output and simulate keyboard and mouse inputs.</p>
<p>The laptop has no idea it is being automated. To the operating system, it just looks like a human using a monitor and a mouse.</p>
<p>The workflow is a simple, read-only loop:</p>
<ul>
<li class="">
<p><strong>Video Capture:</strong> Python grabs the raw video feed from the JetKVM. A lightweight local OCR engine maps out all text elements and their exact coordinates.</p>
</li>
<li class="">
<p><strong>The Reasoning:</strong> A local LLM acts as a router. It reads the unstructured text from the OCR and decides which menu item matches your voice request.</p>
</li>
<li class="">
<p><strong>Execution:</strong> Once the LLM selects the correct text, Python looks up the coordinates. It tells the JetKVM to fire a pixel-perfect mouse click.</p>
</li>
</ul>
<p>It's a safe, scalable, and completely isolated bridge between a fragile legacy database and the modern world.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="from-a-garage-hack-to-the-fortune-500">From a Garage Hack to the Fortune 500<a href="https://jaymaverick.github.io/blog/dusty-ibm-laptop#from-a-garage-hack-to-the-fortune-500" class="hash-link" aria-label="Direct link to From a Garage Hack to the Fortune 500" title="Direct link to From a Garage Hack to the Fortune 500" translate="no">​</a></h2>
<p>A dusty IBM laptop sitting in a closet with a KVM taped to it sounds like a joke.</p>
<p>But bear with me for a moment.</p>
<p>This exact nightmare exists inside almost every major enterprise on the planet.</p>
<p>Walk into a multi-billion dollar insurance company, a massive legal firm, or a global bank. You won't just find a closet laptop. You will find entire data centers hosting "sacred" legacy software. We are talking about 40-year-old COBOL banking systems and fragile terminal databases from the 90s.</p>
<p>These massive companies face the exact same constraints I do with my garage setup.</p>
<ul>
<li class="">
<p><strong>The Implosion Risk:</strong> The software is too critical to turn off. It is too fragile to update. Rewriting the code from scratch would cost tens of millions of dollars and take years.</p>
</li>
<li class="">
<p><strong>The Isolation Trap:</strong> You cannot install modern APIs or data crawlers directly onto these machines without risking a catastrophic crash.</p>
</li>
<li class="">
<p><strong>The Human Bottleneck:</strong> Highly paid analysts spend thousands of hours doing exactly what I did in my garage. They manually type, click, scroll, and copy-paste data from an ancient screen into a modern spreadsheet.</p>
</li>
</ul>
<p>The problem is very real.</p>
<p>The solution is very real. I built a prototype of it in my living room.</p>
<p>In this blog, I'll share how.</p>
<p>-Jay</p>]]></content>
        <author>
            <name>Jay Pitkänen</name>
            <uri>https://www.linkedin.com/in/jaypitkanen/</uri>
        </author>
        <category label="LLM" term="LLM"/>
    </entry>
</feed>