Jay Pitkänen Blog

Version 3: Back to OCR -This Time For Real

2026-05-13T00:00:00.000Z

The coordinate grid was DOA. Back to square one.

Python powered OCR would be the only way forward.

I have to get this right. If you are working on a car, guessing doesn't cut it. Accuracy is everything. Get a torque spec wrong, and you snap a bolt inside an engine block.

But instead of trying to become a computer vision engineer overnight, I decided to use an established solution created by someone way smarter than me: EasyOCR.

Why reinvent the wheel, right?

EasyOCR is a proven, open-source Python library built specifically for reading text in images. No messy algorithms, no manual contour filtering. You just feed it an image, and it hands back the text along with the exact pixel coordinates.

Suddenly, the pipeline became incredibly clean.

The JetKVM grabs the screenshot.
Python hands the screenshot to EasyOCR, which extracts the text and maps out the exact pixel coordinates for every word on the screen.
The LLM acts purely as the brain, reading the text data to decide what to do.
Python commands the JetKVM to click the exact coordinates EasyOCR found.

For the first time in this entire project, I looked at the workflow and smiled. This might actually work.

This is the exact blueprint I used for my tech demo. You can read the full breakdown here.

Go check it out.

Oh and hire me already, wtf.

Version 2: Blinded By The Coordinate

2026-05-12T00:00:00.000Z

The LLM-powered vision pipeline was starting to take shape.

The JetKVM grabs the screenshot.

A quick JavaScript script sends that image over to the LLM. The LLM reads the screen and tells Python exactly where to click.

[JetKVM Laptop Screen] 
       │ 
       ▼ (WebRTC Stream) 
[Your Browser Frontend] ──(Captures Frame)──> [Converts to Base64 Image String]                                                         │ ▼ (Fetch POST Request)                                                   [Ollama / Local Server] 
                                                       │ (Runs Ministral 8B) 
                                                       ▼ 
                                                 [Structured JSON Data Out]

Clean. Simple.

But there was one glaring flaw. How does the LLM describe a location? It can't just say "click the top right button." Python needs exact pixels.

The LLM needed a map.

I wrote some JavaScript to overlay a bright grid coordinate system directly onto the screenshot. Like a game of Battleship. Rows A through Z, columns 1 through 100.

Click to see awful JS!

Version 1: OpenCV, How Hard Could it Be?

2026-05-11T00:00:00.000Z

My first idea was pretty simple.

JetKVM forwards video to my browser.
Python reads the screen and finds the buttons.
Python shows content to LLM.
Python tells JetKVM what to do based on LLM reasoning.

Simple, elegant.

OpenCV was the industry standard 6 years ago when I was actively coding Python, surely it would work fine here today, right?

Surely.

I started writing Python code to process the image.

import cv2
import numpy as np
import pytesseract
import pandas as pd

def extract_epc_table(image_path):
    # 1. Load image and convert to grayscale
    img = cv2.imread(image_path)
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    
    # 2. Thresholding to get a binary image (black and white)
    thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
    
    # 3. Isolate horizontal and vertical lines to detect the table grid
    # Define a kernel length (adjust based on your image resolution)
    kernel_len = np.array(img).shape[1] // 80
    
    # Horizontal lines
    ver_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, kernel_len))
    vertical_lines = cv2.erode(thresh, ver_kernel, iterations=3)
    vertical_lines = cv2.dilate(vertical_lines, ver_kernel, iterations=3)
    
    # Vertical lines
    hor_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (kernel_len, 1))
    horizontal_lines = cv2.erode(thresh, hor_kernel, iterations=3)
    horizontal_lines = cv2.dilate(horizontal_lines, hor_kernel, iterations=3)
    
    # Combine lines to form the complete grid mask
    table_mask = cv2.addWeighted(vertical_lines, 0.5, horizontal_lines, 0.5, 0.0)
    table_mask = cv2.threshold(table_mask, 0, 255, cv2.THRESH_BINARY)[1]
    
    # 4. Find contours of the table cells
    contours, _ = cv2.findContours(table_mask, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
    
    # Filter out very small or massive contours that aren't cells
    bounding_boxes = [cv2.boundingRect(c) for c in contours]
    
    # Sort bounding boxes: Sort by Y coordinate primarily, then X coordinate
    # This allows us to read rows from top-to-bottom, left-to-right
    # Note: For complex tables, you might need a bucketing algorithm for row alignment.
    bounding_boxes = sorted(bounding_boxes, key=lambda b: (b[1], b[0]))
    
    # 5. Extract Text from Each Cell
    data = []
    current_row = []
    previous_y = -1
    row_threshold = 10 # Pixels variance allowed to consider cells on the same row
    
    for x, y, w, h in bounding_boxes:
        # Ignore bounds that are too tiny to contain text
        if w < 40 or h < 20:
            continue
            
        # Crop the cell from the original grayscale image
        cell = gray[y:y+h, x:x+w]
        
        # Clean up cell image slightly for better OCR accuracy
        cell = cv2.threshold(cell, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
        
        # Run Tesseract OCR (using --psm 6: Assume a single uniform block of text)
        text = pytesseract.image_to_string(cell, config='--psm 6').strip()
        
        # Group cells into rows based on their Y position
        if previous_y == -1 or abs(y - previous_y) <= row_threshold:
            current_row.append((x, text))
        else:
            # Sort current row by X position so columns are in order
            current_row.sort(key=lambda item: item[0])
            data.append([item[1] for item in current_row])
            current_row = [(x, text)]
        previous_y = y
        
    # Append the final row
    if current_row:
        current_row.sort(key=lambda item: item[0])
        data.append([item[1] for item in current_row])
        
    # 6. Convert to Pandas DataFrame for cleaning
    df = pd.DataFrame(data)
    return df

# --- Execution ---
# df = extract_epc_table('mercedes_epc_screenshot.png')
# print(df)

First, convert the image to grayscale and apply a heavy binary threshold. Next, isolate the vertical and horizontal grid lines using morphology kernels, blending them together to map out the entire table grid. Finally, find the cell contours, sort them top-to-bottom, and crop out every individual box so Tesseract OCR can scrape the text inside.

Even on paper, this project started sounding ridiculously messy.

In reality, my garbage code made it even worse.

Letting The Robot Do The Hard Work

The Xentry software uses an ancient Windows UI. The buttons don't have clean, modern borders. They have weird gradients, tiny text, and non-standard layouts.

My script was either missing the buttons entirely or highlighting random blank space. Adjusting the threshold numbers felt like tuning a radio from 1940. Fix it for one screen, and it breaks on the next.

I was drowning in a sea of nested loops and pixel matrices. It was tedious, overcomplicated, and brittle.

Then it hit me.

Why am I trying to be a computer vision engineer?

I have a local Large Language Model sitting right there on my main desktop. Modern vision-language models don't just read text; they understand spatial layouts.

Instead of writing fifty lines of terrible OpenCV code to crop and isolate bounding boxes, I can just feed the raw screenshot straight to the model.

"Look at this image. Find the text that says 'Fuel System Diagnostics'. Give me its approximate X and Y coordinates on a 1920x1080 grid."

The LLM doesn't care about noise reduction, gray scaling, or edge contours. It just looks at the screen, understands what it sees, and hands over the coordinates.

Python becomes a simple mail carrier. It takes the image from the JetKVM, hands it to the LLM, gets the coordinates back, and tells the JetKVM where to click.

No complicated math. No brittle thresholds.

Just pure, lazy efficiency.

With version 0.1 dead on arrival, version 0.2 was starting to take shape.

Fixing Enterprise Problems With a Dusty IBM Laptop

2026-05-10T00:00:00.000Z

Ever owned an old Mercedes?

Let me tell you about a regular Tuesday evening owning an old Mercedes.

You're lying under the car, back against the pavement, face inches away from a hot exhaust pipe, trying to fix another "tiny thing".

It's dark. Your eyes are full of dirt and dust. The wrench you're holding is starting to feel real heavy. Just gotta tighten this one last bolt and then it's sauna and beer time.

Then your mind goes blank. What torque?

Was it 50Nm or 90Nm? You check your notes - nothing. Google? No way, this information is way too niche. ChatGPT? Forget it. You can't trust it to be accurate.

Get it wrong and the bolt might snap and cause a fuel leak.

The Mercedes Xentry Diagnosis Software could give the exact value, but it's locked away in a laptop in your office, some 900m away.

If only there was a way to access it remotely with voice commands...

And that's where I got the idea.

I've been playing around with LLMs since 2023.

My first huge milestone was to create a very rudimentary Langchain pipeline to create authoritative SEO content for Smash Digital. Interesting project, but I knew text content was on its way out.
The next milestone was local LLMs. Stable Diffusion Forge to create images. Character bibles to govern chat behavior. Local LLM coding assistants.
Time for my third milestone: a remote accessible, local LLM operated, assistant that physically controls my Mercedes Xentry laptop using a JetKVM interface.

The Problem

My (very real, very legitimate I promise) copy of Xentry is trapped on an ancient IBM laptop, on a custom Windows 10 installation.

It CAN NOT access the internet, otherwise the Windows installation will update itself and implode.
It CAN NOT be modified with any data crawling software due to the very fragile software installation.
The data is trapped in unstructured PDFs and tables, within ancient proprietary software.

The only way to get any repair guides, part numbers, circuit board diagrams etc is for a physical user to access the software and slowly navigate to the correct location.

This is not only awkward to do in a garage environment, but risky as well. Damaging the laptop would be catastrophic.

No. We need a way to remotely access the database device in a way that's read-only and scalable.

Possible Solutions

Approach 1: Manual Bulk Exporting

One option is to click through every single catalog page, circuit diagram, and repair guide. Print them one by one to local PDF files. Then pipe those documents into a modern database.

It is theoretically possible. It is also completely impractical.

Systems like Xentry are locked down to prevent bulk scraping. Manually exporting tens of thousands of deeply nested data sheets would take months.

(And it's boring too.)

Approach 2: JetKVM and Semantic Routing

Instead of messing with the fragile laptop, we can build an automation layer from the outside. We use a hardware device called a JetKVM.

You plug the JetKVM directly into the laptop's video and USB ports. Now you can hijack the raw screen output and simulate keyboard and mouse inputs.

The laptop has no idea it is being automated. To the operating system, it just looks like a human using a monitor and a mouse.

The workflow is a simple, read-only loop:

Video Capture: Python grabs the raw video feed from the JetKVM. A lightweight local OCR engine maps out all text elements and their exact coordinates.
The Reasoning: A local LLM acts as a router. It reads the unstructured text from the OCR and decides which menu item matches your voice request.
Execution: Once the LLM selects the correct text, Python looks up the coordinates. It tells the JetKVM to fire a pixel-perfect mouse click.

It's a safe, scalable, and completely isolated bridge between a fragile legacy database and the modern world.

From a Garage Hack to the Fortune 500

A dusty IBM laptop sitting in a closet with a KVM taped to it sounds like a joke.

But bear with me for a moment.

This exact nightmare exists inside almost every major enterprise on the planet.

Walk into a multi-billion dollar insurance company, a massive legal firm, or a global bank. You won't just find a closet laptop. You will find entire data centers hosting "sacred" legacy software. We are talking about 40-year-old COBOL banking systems and fragile terminal databases from the 90s.

These massive companies face the exact same constraints I do with my garage setup.

The Implosion Risk: The software is too critical to turn off. It is too fragile to update. Rewriting the code from scratch would cost tens of millions of dollars and take years.
The Isolation Trap: You cannot install modern APIs or data crawlers directly onto these machines without risking a catastrophic crash.
The Human Bottleneck: Highly paid analysts spend thousands of hours doing exactly what I did in my garage. They manually type, click, scroll, and copy-paste data from an ancient screen into a modern spreadsheet.

The problem is very real.

The solution is very real. I built a prototype of it in my living room.

In this blog, I'll share how.

-Jay

Jay Pitkänen Blog

Version 3: Back to OCR -This Time For Real

Version 2: Blinded By The Coordinate

Version 1: OpenCV, How Hard Could it Be?

Letting The Robot Do The Hard Work​

Fixing Enterprise Problems With a Dusty IBM Laptop

The Problem​

Possible Solutions​

Approach 1: Manual Bulk Exporting​