01
Personal
Content-based PII detection: small specialized models per entity type, no column-name inference, multilingual by design, CPU-deployable. The thesis: large LLMs (7B+) are the wrong tool for structured PII classification (slow, hallucination-prone, and outperformed on the same data by fine-tuned small encoders). piifind exists to demonstrate that one mmBERT-base fine-tune (~270M params) per entity type, orchestrated by a thin pipeline, is the correct architecture.
~270M paramsmmBERT-base, one per entity type
90%+ F1per-cohort accuracy gate
CPU-onlyONNX int8, no CUDA
- Content-aware, not column-name: schema-based tools (Presidio, most enterprise products) infer PII from column names and are ~40% wrong on real-world messy data. piifind scans actual cell values, so PII in
notes / comment / JSON blobs gets found even when the column gives no hint.
- Small specialized models, not one big LLM: mmBERT-base fine-tunes (~270M params, ONNX int8) per entity type. Cheaper, faster, and more accurate on classification than a 7B LLM, and self-hostable on CPU.
- Three-class output: POSITIVE / NEGATIVE / AMBIGUOUS. Threshold-derived from a continuous score, not a learned class. PII has inherent ambiguity (
ნინო კომპანია, April Summer, Grace Hope); pretending otherwise produces dishonest accuracy numbers.
- Override registry as non-parametric memory: hash-lookup layer for known exceptions (Goldman Sachs is a company; no amount of training data fixes that without overfitting). Ships with curated defaults; user-extensible via JSON; microseconds per lookup, no retraining needed.
- Honest evaluation: per-cohort precision/recall/F1 reported separately, not just aggregate. Cohorts that fail the 90% gate ship with documented weakness OR get excluded entirely.
- Embeds in pipeline workloads: drops into Spark UDFs and dbt Python models. No service hop, no CUDA, Apache 2.0; weights published to HuggingFace Hub.
mmBERTONNX int8HuggingFace
Multilingual NLPData PrivacyCPU-onlyApache 2.0
Self-hosted household app for a two-person home (shopping list, pantry, recipes, receipts, spending) built bilingual Georgian + English at the data layer, not via an i18n shim. Runs on home hardware behind a Cloudflare Tunnel.
The receipt pipeline is the unusual bit: a fully-local OCR → VLM → visual-RAG stack tuned for Georgian receipts (a low-resource script that off-the-shelf OCR mangles) running on a single 16 GB consumer GPU via explicit model-swap windows. No cloud calls in the entire pipeline.
0 cloudfully local stack
6 models · 1 GPUorchestrated via swap windows
~3-7 minper receipt end-to-end
- Custom OCR pipeline → OSS soon: the receipt processing stack (preprocess → Surya → multi-turn VLM → visual RAG) is being extracted and open-sourced as a reusable pipeline for hard-OCR languages (Georgian, Armenian, Amharic, Khmer, and other low-resource scripts where off-the-shelf OCR mangles output).
- Multi-turn VLM re-read: per-region conversation with the VLM emitting
partial / complete / skip per turn. Handles partial-line wraps and column splits cleanly where single-shot inference hallucinates.
- Visual RAG, positive and negative: DINOv2 embeddings in pgvector. New crops match against learned canonicals; the "skip forever" button creates negative entries so payment slips, plastic-bag charges, and footer noise auto-drop on future receipts.
- Bilingual at the data layer: products carry both KA + EN name slots; voice and text commands work in either language; cross-language dedup uses semantic embeddings (no alias tables, no LLM translation step pretending to be reliable).
- Model-swap orchestration on 16 GB VRAM: explicit
model_swap.vlm_window() / dinov3_window() context managers evict Ollama, load the transformers model, yield to the worker, then restore Ollama. Interactive paths (voice, shopping, query) stay resident; only receipts pay the swap cost.
- Voice + text intent: Whisper-large-v3 vocab-primed with the household's top product names so it learns the actual dialect, including niche Georgian items it would otherwise hallucinate. qwen3:8b routes via tool-calling; a single utterance can fire multiple tools.
FastAPINext.js 15PostgreSQL
pgvectorOllamaVisual RAG
Surya OCRWhisperSelf-hosted
Phase 1 research probing whether 8B-quantized LLMs can produce sustained, model-driven behavior in self-loop conditions, or whether they're pure reactivity machines that only generate when explicitly prompted.
2,680 runs across rev 1.x produced an apparent ~1% "escape" signal from a Q/A + RAG scaffold. A close re-read of the cleanest example (run43, 72 turns) showed it was game-shaped filler with two novel moves across the entire trace. The analyst LLM had pattern-matched on form, not substance. All rev 1.x findings retracted. Rev 2 ("silence-as-continuation") drops the Q/A contract entirely and tests whether the model can sustain monologue when explicitly licensed to.
2,680 runsrev 1.x corpus (retracted)
0 LLM-grader verdictsmechanical metrics only, from rev 2 on
300 runs planned3 RAG arms × 2 models
- Retraction discipline: when run43's "sustained play" turned out to be ~20 turns of "let's start fresh with the Word Association Game" boilerplate, the entire rev 1.x finding got flagged immediately rather than silently updated. Retractions on contact, not on convenience.
- New methodology rule: LLM-reader eyeball is disallowed as primary evidence. Every "did the model do X" outcome claim now requires a mechanical operational definition committed in writing before runs start. Analyst LLMs can summarize; they cannot grade.
- Why no agent frameworks: agent loops (LangChain, AutoGPT, etc.) prompt the model with "what next?" on every turn. That's just another reactive turn; the framework's planning loop is the agent, with the model as a next-token predictor inside it. You can't measure whether an 8B model produces sustained model-driven behavior while continually being asked what to do; the scaffold becomes the source of the apparent agency. Rev 2 exists to strip that scaffold and isolate the model.
- Rev 2: silence-as-continuation. Seed tells the model it's teaching a subject; each subsequent input is a single space
" " meaning "continue." No question, no user role, no reply expected. Tests model-driven behavior under explicit permission to monologue.
- Three RAG arms, two models: A1 (no RAG, pure stress), A2 (self-RAG of prior answers), A3 (pre-loaded subject corpus). Llama 3.1 8B base + Instruct, Q4_K_M.
- Pre-committed metrics: cosine distance to seed embedding (topic-stay), max-similarity-against-prior-turns (novelty), regex frame-break detection. Thresholds
S, K, F, T committed in writing before any rev 2 run.
- Phase 2 (sketched). Inference-level intervention: KV cache persistence across turns, EOS suppression, hidden-state injection. Requires transformers + accelerate; independent of phase 1 outcome.
Llama 3.18B QuantizedOllama
RAGEmbeddingsResearchReproducibility
Open-source proxy that logs every data operation an AI agent performs on a warehouse. Answers "what personal data did our agents access last month?" for GDPR and EU AI Act compliance. Sits between agents (via MCP server or direct DB connection) and Snowflake / BigQuery / Postgres, intercepts queries, parses with sqlglot, and writes structured audit events tagged with agent_id, tables, columns, and PII flags.
LLM observability tools (Langfuse, Arize) track tokens and latency but don't see the data layer. Data-governance tools (DataHub, OpenMetadata) track human access but don't understand agent access patterns. Zero open-source competition for the gap between them as of April 2026.
0OSS alternatives (Apr 2026)
49 layersv0.0.1 → v1.0 build plan
~300 LOCv0.0.1 MVP target
- Fills the gap between LLM observability and data governance: nothing on the market currently tracks which DB rows an AI agent touched via MCP. agent-ledger is the missing audit layer.
- Driven by real regulation, not theoretical risk: GDPR Article 15 already requires knowing who accessed personal data; the EU AI Act (rolling out 2025-2026) adds audit-trail requirements for AI systems. Most companies running agentic AI today cannot answer either.
- Layered build plan written after a failed one-shot attempt: 49 versioned layers from v0.0.1 (Postgres + JSONL, ~300 LOC) through v1.0 (Snowflake / BigQuery / MCP middleware / PII flags / policy enforcement / reports / OTel). Each layer has an explicit line budget, a mechanical verification command, and a "do not skip ahead" discipline. The plan exists because the previous "build it all at once" Codex attempt didn't even run.
- Transparent integration: wraps any SQLAlchemy engine with one line (
AuditLogger("audit.jsonl").attach(engine)); psycopg2, asyncpg, Snowflake / BigQuery clients, and MCP servers all addressable through the same model.
- Phase 2: policy enforcement. The same intercept layer can block queries before execution, not just log them.
Policy(deny_tables=[...], deny_operations=[DELETE, DROP]) raises PolicyViolation from inside the listener, aborting the query.
- Apache 2.0, self-hosted by design: no SaaS, no telemetry; phoning home would defeat the purpose of an audit tool.
PythonsqlglotMCP
Data GovernanceGDPREU AI ActApache 2.0
02
Work
Same stack (MinIO + Polars + ClickHouse), different shape problem. 1,800 Bitrix tables collapsed into 10 wide, agent-friendly tables in the gold layer, feeding the same AI BI agent as the SAP B1 warehouse so cross-system questions resolve against one consistent semantic model.
1,800 → 10tables denormalized
Shared stackMinIO + Polars + ClickHouse
<30 minend-to-end SLA (targeting 10)
- 180:1 collapse: Bitrix's relational sprawl flattened into ~10 wide entities the agent can join cheaply without learning the full upstream graph.
- Shared agent surface: feeds the same gold layer / BI agent as SAP B1, so Bitrix and SAP data appear as one consistent semantic model rather than two siloed warehouses.
- Shared PII masking policy: the same seven-rule per-column classification that governs the SAP B1 gold layer applies here too, so cross-system queries from the AI BI agent see consistent privacy semantics regardless of which warehouse a value came from. Specifics under NDA.
BitrixMinIOPolars
ClickHouseMedallion ArchitectureDenormalizationAI BISemantic Layer
Bronze / silver / gold warehouse with planned domain marts, built on MinIO + Polars + ClickHouse. Ingests ~140,000 tables across 49 SAP Business One company instances, parses and normalizes them into one consistent schema, and lands a gold layer ready to be consumed by an AI BI agent for natural-language cross-company analytics.
~140ktables ingested
49SAP B1 company instances
<30 minend-to-end SLA (targeting 10)
- 49-company reconciliation: silver layer collapses 49 slightly-different versions of the same business object into one canonical model. The hard part isn't the layering, it's the reconciliation.
- Gold is agent-ready, not just analyst-ready: tables, columns, and relationships are described and tagged so an LLM-driven BI agent can query the warehouse without a human in the loop.
- Per-column PII masking on the gold layer: a rules-based policy classifies and masks personal data before the AI BI agent (or any other consumer) sees it, so downstream code never needs its own PII logic. Seven internal rules govern the classification and masking behavior; specifics under NDA.
- SLA discipline: currently landing the full refresh in under 30 minutes; actively tuning toward a 10-minute target via Polars-side optimization and ClickHouse insert tuning.
SAP B1MinIOPolars
ClickHouseMedallion ArchitectureAI BISemantic Layer