Project_File // PRIVACY_FIRST-AI-DOCUMENT-DIGITIZATION

Privacy-First AI for Document Digitization_

An on-premise AI system automating the digitization of physical government records containing mixed English and Devanagari scripts — built for strict data sovereignty with zero cloud dependency.

Industry_SectorGovernment
Core_ClassificationAI & Automation
Deployment_Year2024
Privacy-First AI for Document Digitization

Entity_Client

Regional Public Sector Authority (NDA)

Primary_Role

AI Lead & Backend Engineer

Duration_Log

1 week

Resource_Team

1 dev

Project_Overview

A regional government authority needed to digitize thousands of physical archival records containing mixed English and Devanagari scripts. The challenge: data privacy laws required all processing to stay entirely on-premise, ruling out cloud OCR or LLM APIs. We built a fully local AI pipeline delivering high-accuracy extraction, automated script filtering, and structured database output.

Operational_Process

Designed a multi-stage pipeline: image enhancement via OpenCV → OCR text extraction → local LLM script filtering via Ollama → structured SQL storage → Streamlit dashboard for staff review. Each stage was optimized for accuracy and speed on government-spec hardware.

Core_Capabilities

Custom OpenCV image preprocessing (denoising, deskewing, thresholding)
Multi-engine OCR using Tesseract and EasyOCR for maximum accuracy
Local LLM (Llama-3 via Ollama) for Devanagari/English script separation
FastAPI pipeline orchestrating end-to-end document processing
Automated schema mapping from raw OCR to structured SQL database
Streamlit dashboard for upload, review, and verification by staff
Fully air-gapped operation — zero external API calls

Performance_Metrics

OCR Confidence

baseline+35%

DATA_POINT: after preprocessing

Pilot Documents Digitized

0%100%

DATA_POINT: searchable database

Manual Language Sorting

100% manualautomated

DATA_POINT: eliminated

Cloud Dependency

requiredzero

DATA_POINT: fully on-premise

Staff Onboarding Time

weeks of training2 hours

DATA_POINT: Streamlit UI

Data Privacy Compliance

at risk100%

DATA_POINT: sovereign pipeline

Conflict_Resolution

Solution

Built custom OpenCV pipeline with adaptive thresholding, morphological transformations, and contrast normalization — increasing OCR confidence scores by 35%.

Resolution_Status: OKProtocol: Direct_Intervention
Solution

Prompt-engineered a local Llama-3 model via Ollama as a linguistic classifier, achieving high-speed and accurate script separation without external APIs.

Resolution_Status: OKProtocol: Direct_Intervention
Solution

Architected the entire system to run on-premise using local LLMs and local OCR engines — eliminating all cloud dependencies (OpenAI, Google Vision, etc.).

Resolution_Status: OKProtocol: Direct_Intervention
Solution

Deployed a Streamlit UI allowing staff to upload scanned images, monitor processing, and review extracted database entries in real-time — no technical knowledge required.

Resolution_Status: OKProtocol: Direct_Intervention