Privacy-First AI for Document Digitization_
An on-premise AI system automating the digitization of physical government records containing mixed English and Devanagari scripts — built for strict data sovereignty with zero cloud dependency.

Entity_Client
Regional Public Sector Authority (NDA)
Primary_Role
AI Lead & Backend Engineer
Duration_Log
1 week
Resource_Team
1 dev
Project_Overview
A regional government authority needed to digitize thousands of physical archival records containing mixed English and Devanagari scripts. The challenge: data privacy laws required all processing to stay entirely on-premise, ruling out cloud OCR or LLM APIs. We built a fully local AI pipeline delivering high-accuracy extraction, automated script filtering, and structured database output.
Operational_Process
Designed a multi-stage pipeline: image enhancement via OpenCV → OCR text extraction → local LLM script filtering via Ollama → structured SQL storage → Streamlit dashboard for staff review. Each stage was optimized for accuracy and speed on government-spec hardware.
Core_Capabilities
Performance_Metrics
OCR Confidence
DATA_POINT: after preprocessing
Pilot Documents Digitized
DATA_POINT: searchable database
Manual Language Sorting
DATA_POINT: eliminated
Cloud Dependency
DATA_POINT: fully on-premise
Staff Onboarding Time
DATA_POINT: Streamlit UI
Data Privacy Compliance
DATA_POINT: sovereign pipeline
Conflict_Resolution
Built custom OpenCV pipeline with adaptive thresholding, morphological transformations, and contrast normalization — increasing OCR confidence scores by 35%.
Prompt-engineered a local Llama-3 model via Ollama as a linguistic classifier, achieving high-speed and accurate script separation without external APIs.
Architected the entire system to run on-premise using local LLMs and local OCR engines — eliminating all cloud dependencies (OpenAI, Google Vision, etc.).
Deployed a Streamlit UI allowing staff to upload scanned images, monitor processing, and review extracted database entries in real-time — no technical knowledge required.