PortLev chevron_right Build Logs chevron_right AIHRPilot

build Build Log #003

How I Built
AIHRPilot
in a Weekend

A Portfolio Executive's walkthrough of building an HR policy engine without being a developer. The full stack, the exact prompts, the code, and the lessons.

YK

Yuri Kruman

3x CHRO · AI Trainer (OpenAI, Meta, Microsoft) · Jun 2026

0

% of tickets auto-resolved

$0

/month to run at scale

0

days to working prototype

0h

saved per HRBP per week

bolt

The 30-Second Version

AIHRPilot is an HR policy intelligence engine that classifies inbound HR tickets, answers ~80% of them automatically from a company's own policy corpus, and routes the rest to the right human with a pre-drafted reply.

report

The Problem

HR teams burning 30-40 hours a week answering the same 8-12 recurring questions.

memory

The Stack

Flask + scikit-learn (TF-IDF) + Claude API + a Lattice-inspired UI.

block

What It Doesn't Need

No vector database. No fine-tuning. No MLOps team.

timer

Build Time

One weekend for v1. Three weeks to production-grade.

If you are a non-developer executive who has ever thought "I wish something like this existed for my team," this walkthrough is for you. Read it like a build log, not a tutorial. The point is not that you should clone AIHRPilot; the point is that you should build the equivalent for your recurring pain.

Part 1

Why This Tool, and Why in a Weekend

I had been doing fractional CHRO work for a Fortune 500 client with a ~60-person HR team on Lattice. Every quarter we ran the same post-mortem on their ticketing metrics. Every quarter the same pattern emerged:

~200

inbound policy tickets per week across regions

~80%

were variations on the same 8-12 questions

10-15 min

per ticket: read, verify, draft reply, follow up

30-50 hrs

of senior HRBP time burned weekly on recurrence

The company's first instinct was classic: "Let's buy an AI HR copilot." They had demos lined up with four vendors, each quoting $60K-$180K ARR for a black-box tool that would still require custom policy ingestion and still would not integrate with Lattice the way they wanted.

My instinct was different. I'd seen enough vendor demos to know two things:

1 Nothing the vendors were offering was meaningfully harder than what Claude + a retrieval layer could already do.
2 The real work was not the model — it was the policy corpus, the classification taxonomy, and the UX that HR actually wanted to use. Vendors solve the first; the client has to solve the second two regardless.

So I proposed a two-week spike: let me build a prototype over a weekend, test it against the last 500 tickets, and if the accuracy was acceptable we'd deploy a v1 inside their existing HR workflow. If it failed, they could go back to vendor shopping two weeks later with better requirements in hand.

star The Single Most Important Design Call

Augment, Don't Replace.

I was not replacing Lattice. I was building a layer on top of Lattice. Every ticket still lived in the system of record; AIHRPilot was the intelligence layer that read, classified, drafted, and routed. Replacing a system of record is an 18-month change-management project. Augmenting one is a weekend build.

Part 2

The Stack (and Why Each Piece)

Click each layer to see why I chose it. If you're non-technical, the "why" matters more than the "what."

code

Language

Python

expand_more

Largest ecosystem for NLP. What every AI lab uses. If you're building anything with machine learning or LLM APIs, Python is the default and the right one. Every library, every tutorial, every StackOverflow answer assumes Python.

language

Web Framework

Flask

expand_more

Simplest possible Python web app. Avoid Django overhead. For an internal tool that serves 60 users, Flask gives you routing, templating and nothing else. That's the point. You don't need an ORM, admin panel, or migration system at v1.

search

Retrieval

scikit-learn TF-IDF

expand_more

Works at this corpus size. No vector DB needed. Everybody reaches for Pinecone, Weaviate, or Chroma at the start. Don't.

Rule of thumb: Under ~500 pages? TF-IDF is fine. Over 500? Move to embeddings + vector DB. Over ~50,000? Hire a real ML engineer.

psychology

Reasoning

Claude API (Sonnet)

expand_more

Best reasoning/writing quality at this price point. RAG gives you 90%+ of the benefit of fine-tuning with 10% of the engineering overhead. The policy corpus lives in a folder; Claude reads it at query time.

Why not fine-tuning? You lose general reasoning quality, gain marginal domain accuracy, and introduce a training/evaluation/retraining loop you do not want to maintain.

palette

UI

HTML + Tailwind + HTMX

expand_more

No React, no framework war. Lattice-inspired look. HTMX lets you build interactive UIs with server-rendered HTML and ~20 lines of JavaScript. A non-developer can actually read and modify HTMX code; React code requires you to know React. For this class of tool, always choose the simpler stack.

cloud

Hosting

Render + Cloudflare

expand_more

~$20/month combined. Render handles the Python backend on a $7/month hobby plan. Cloudflare provides CDN, DDoS protection and SSL for free. Total cost to keep the whole system running at a mid-sized company's ticket volume: ~$40/month including Claude API calls.

Part 3

The Six-Phase Build Sequence

Each phase is 2-10 hours. Click through the timeline below. Sequence them in order; don't try to parallelize until Phase 4.

Ingest Retrieve Reason Route UI Deploy

1

4-6 HOURS

Corpus Ingestion

Get every piece of HR policy content into a single, searchable format. I asked the client for everything they considered "authoritative policy": the PDF handbook, regional addenda, benefits summary plan descriptions, code of conduct, equity plan doc. About 180 pages total across 40 files.

The ingestion pipeline:

1

Extract text from PDFs (pypdf or pdfplumber)

2

Chunk each document into ~500-word passages with 50-word overlap

3

Tag each chunk with source document, section heading and effective date

4

Store as a JSON file for Phase 2

terminal Exact Prompt Used

"Write a Python script that reads all PDF files in a folder, extracts the text, chunks each document into 500-word passages with 50-word overlap, and outputs a JSONL file with fields: id, source_doc, section_heading, effective_date, text. Use pdfplumber. Preserve section headings by detecting lines that are all caps or bold in the source PDF."

warning

Don't skip this: The single biggest mistake non-developers make is skipping section-heading preservation. You need it for Phase 5 (citations).

3

4-6 HOURS

Claude Reasoning Layer

Given a question and the retrieved passages, generate: (a) a classification, (b) a confidence score, (c) a draft reply, and (d) a citation to the specific policy passage used.

            reasoning.py
            
          

            from anthropic import Anthropic

client = Anthropic()

def answer_ticket(question, retrieved_chunks):
    context = "\n\n".join([
        f"[Source: {c['source_doc']}, 
          §{c['section_heading']}]\n{c['text']}"
        for c in retrieved_chunks
    ])

    response = client.messages.create(
        model="claude-sonnet-4-5-20250929",
        max_tokens=1000,
        system=SYSTEM_PROMPT,
        messages=[{
            "role": "user",
            "content": f"""Policy context:
{context}

Employee question:
{question}

Return JSON: category, confidence,
draft_reply, citation."""
        }]
    )
    return parse_json(response.content[0].text)
          

Where 70% of the work lives: the system prompt

● The taxonomy of 8-12 question categories (clustered from the last 500 tickets with Claude's help)
● Rules for confidence scoring: high if policy directly answers; medium if partial; low if silent
● Reply style guide: professional, warm, concise, always cite policy section, always offer to escalate
● Refusal triggers: discrimination, harassment, legal liability → route straight to a human

terminal Exact Prompt Used

"I need a system prompt for Claude that will classify inbound HR tickets into one of [list your categories], assign a confidence score, generate a draft reply in a warm but professional tone citing the specific policy section, and refuse to answer any ticket involving potential discrimination, harassment, or legal liability by routing to a human. The system prompt should be self-contained and explicit about all three outputs."

Part 4

What I'd Do Differently Today

1

Start with Claude Projects for the reasoning layer

Only move to direct API calls once I knew the prompt was stable. Projects lets you iterate on the system prompt and corpus in the same interface. Only move to API when you're ready to embed in the workflow.

2

Use Cursor or Claude Code from Phase 1

The ability to iterate on code, tests and deployment in one environment is a 3x speedup over the workflow I actually used (copy-pasting prompts into chat).

3

Instrument the accuracy log from day one

Not Phase 6. The data from the first two weeks of real usage is the single most valuable thing you get out of the build. Don't lose it.

Part 5

Adapt This for YOUR Recurring Task

The architecture (corpus → retrieval → reasoning → routing → UI) is the template for any recurring cognitive task. Here are five adaptations built off this same pattern:

Recurring Task	Corpus	Classification	Routing
Candidate screening	Job descriptions + resume rubric	Fit tier (1-4)	Auto-reject / auto-advance / human review
Vendor RFP scoring	Past RFPs + rubric	Vendor tier	Shortlist / review / reject
Board packet pre-read	Prior board packets + company context	Question clusters	Pre-drafted responses for CEO
Deal flow triage (VC/PE)	IC memo template + thesis doc	Fit score	Pass / diligence / follow-up
Policy/compliance Q&A	Policy corpus	Question categories	Auto-reply / human review / escalate

Part 6

Starter Prompts for Claude / Cursor

If you want to start today, these are the four prompts that got me from zero to a working prototype. Copy them directly. Substitute the bracketed placeholders for your domain.

PROMPT 1 Corpus Ingestion

"I have a folder of PDFs containing [TYPE OF DOCUMENTS, e.g., HR policies]. Write a Python script that extracts text from every PDF, chunks each document into 500-word passages with 50-word overlap, preserves section headings, and outputs a JSONL file with fields: id, source_doc, section_heading, effective_date, text. Use pdfplumber."

PROMPT 2 Retrieval

"Using the JSONL corpus from the previous script, write a retrieval function retrieve(question, k=5) that returns the top k most relevant chunks using TfidfVectorizer with (1,2) n-grams, English stop words, max_df=0.8, min_df=2. Persist the fitted vectorizer and vectors with joblib."

PROMPT 3 Reasoning System Prompt

"I'm building a [YOUR TOOL TYPE] that will classify inbound [TICKETS / CANDIDATES / RFPs / DEALS] into one of these categories: [YOUR LIST]. Write a system prompt for Claude that: (a) classifies into exactly one category, (b) assigns a 0-1 confidence score, (c) generates a draft response, (d) cites the specific source passage, and (e) refuses to answer anything involving [YOUR REFUSAL TRIGGERS]. Output must be a valid JSON object with fields: category, confidence, draft_reply, citation."

PROMPT 4 UI

"Build a Flask + HTMX web app with three pages: an inbox showing [TICKETS / CANDIDATES / DEALS] color-coded by confidence tier, a detail page with draft reply and source citation, and an admin page to upload source documents and adjust thresholds. Use Tailwind for styling. Feel: clean, white, [BRAND-ADJACENT] accents. No React."

What AIHRPilot Is Not

It is not a replacement for Lattice, Workday, SAP, BambooHR, or any HRIS. It is not a replacement for a general counsel, an employment attorney, or a compliance officer. It is not a "chatbot." It does not handle PII, PHI, or anything that would require SOC 2 Type II compliance without additional hardening.

What it is: a thin intelligence layer that sits on top of your real systems of record and removes the recurring cognitive tax of answering the same questions repeatedly. The value is narrow, deep and immediate. That narrowness is the point. The tools that actually ship and stick are the ones that solve one problem for one team. The tools that die in demo are the ones that try to be platforms.

The question is not
"Can I build this?"

The question is:

"What is the one repeating task in my week that, if I removed it, would free up 10+ hours for higher-leverage work?"

If you can answer that in one sentence, you have a build. If you can't, the first hour of your weekend is answering that question. The next 40-120 are building the thing.

rocket_launch See AIHRPilot Live mail Subscribe to The Leverage Brief

This walkthrough is part of the Portfolio Leverage Co. Build Bench series. For the weekly operating brief, subscribe above. For the cohort where we build these tools together, apply here.

How I Built
AIHRPilot
in a Weekend

The 30-Second Version

Why This Tool, and Why in a Weekend

The Stack (and Why Each Piece)

The Six-Phase Build Sequence

Corpus Ingestion

Retrieval Layer

Claude Reasoning Layer

Three-Tier Routing

The UI

Deploy + Observability

What I'd Do Differently Today

Adapt This for YOUR Recurring Task

Starter Prompts for Claude / Cursor

What AIHRPilot Is Not

The question is not
"Can I build this?"

How I Built AIHRPilot in a Weekend

The 30-Second Version

Why This Tool, and Why in a Weekend

The Stack (and Why Each Piece)

The Six-Phase Build Sequence

Corpus Ingestion

Retrieval Layer

Claude Reasoning Layer

Three-Tier Routing

The UI

Deploy + Observability

What I'd Do Differently Today

Adapt This for YOUR Recurring Task

Starter Prompts for Claude / Cursor

What AIHRPilot Is Not

The question is not"Can I build this?"

How I Built
AIHRPilot
in a Weekend

The question is not
"Can I build this?"