Privacy-First AI for Document Digitization
An on-premise AI system automating the digitization of physical government records containing mixed English and Devanagari scripts — built for strict data sovereignty with zero cloud dependency.

Client
Regional Public Sector Authority (NDA)
Role
AI Lead & Backend Engineer
Timeline
1 week
Team
1 dev
Overview
A regional government authority needed to digitize thousands of physical archival records containing mixed English and Devanagari scripts. The challenge: data privacy laws required all processing to stay entirely on-premise, ruling out cloud OCR or LLM APIs. We built a fully local AI pipeline delivering high-accuracy extraction, automated script filtering, and structured database output.
Process
Designed a multi-stage pipeline: image enhancement via OpenCV → OCR text extraction → local LLM script filtering via Ollama → structured SQL storage → Streamlit dashboard for staff review. Each stage was optimized for accuracy and speed on government-spec hardware.
Key Features
Challenges & Solutions
Built custom OpenCV pipeline with adaptive thresholding, morphological transformations, and contrast normalization — increasing OCR confidence scores by 35%.
Prompt-engineered a local Llama-3 model via Ollama as a linguistic classifier, achieving high-speed and accurate script separation without external APIs.
Architected the entire system to run on-premise using local LLMs and local OCR engines — eliminating all cloud dependencies (OpenAI, Google Vision, etc.).
Deployed a Streamlit UI allowing staff to upload scanned images, monitor processing, and review extracted database entries in real-time — no technical knowledge required.
Results
OCR Confidence
after preprocessing
Pilot Documents Digitized
searchable database
Manual Language Sorting
eliminated
Cloud Dependency
fully on-premise
Staff Onboarding Time
Streamlit UI
Data Privacy Compliance
sovereign pipeline
Goals
- •Create a scalable, secure pipeline for digitizing thousands of physical archives
- •Achieve high OCR accuracy on poor-quality legacy scans
- •Automatically separate Devanagari and English content without manual tagging
- •Maintain full data sovereignty with zero external API exposure
Tech Stack
- •Python
- •OpenCV
- •Ollama
- •FastAPI
- •Streamlit
- •Tesseract
- •SQL
Target Users
- •Government administrative officers
- •Data entry departments
- •Records management staff
Key Learnings
- •Local LLMs are highly capable for niche linguistic tasks like script classification
- •Image preprocessing quality is the single biggest factor in OCR accuracy
- •Privacy-first architecture is achievable without sacrificing AI capability
- •Simple interfaces (Streamlit) are transformative for non-technical government users
Future Plans
- •Add Devanagari-to-English translation module
- •Expand to handwriting recognition (HTR) for older archival records
- •Build batch processing queue for bulk archive digitization
- •Implement searchable full-text index across the entire document database