A Django MVP for multi-file upload, OCR + text extraction, per-document (and batch) processing, JSON results storage, protected download, and search/filter by keywords and presets.
✅ Current focus: a navigable document database (e.g., HR uploads 50 resumes and filters by keywords).
🔜 Next: stronger extraction rules, better classification, and production hardening.
Este sistema permite:
- Upload de vários documentos (PDF) de uma vez
- Cada arquivo vira uma linha na tabela de documentos
- Processamento extrai texto do PDF; se necessário, usa OCR (Tesseract)
- Resultado é salvo em um JSON limpo (apenas dados extraídos)
- Debug fica no log, com eventos estruturados por documento/campo
- Download dos arquivos é protegido por login
- Busca sem acento por palavras-chave e frases (
;) - Presets de filtro por palavras-chave, idade, experiencia
- Download em massa de JSON e dos arquivos originais
- Contato (telefone) extraido para link do WhatsApp
- Backend: Django (Python 3.11+)
- DB (recomendado): PostgreSQL (via Docker)
- OCR: Tesseract + Poppler (pdftoppm) +
pdf2image/pytesseract - Execução:
- ✅ Docker + Docker Compose (ambiente replicável)
- Alternativo: venv +
python manage.py runserver
- Docker
- Docker Compose
- Python 3.11+
- pip
- (Opcional) deps do OCR no sistema:
tesseract-ocr+poppler-utils
Crie um arquivo .env na raiz do projeto:
DEBUG=1
SECRET_KEY=change-me
ALLOWED_HOSTS=127.0.0.1,localhost
CSRF_TRUSTED_ORIGINS=http://localhost:8000,http://127.0.0.1:8000
# Postgres (docker compose)
DATABASE_URL=postgres://automacao:automacao@db:5432/automacao_contas
# OCR (opcional)
OCR_LANG=por
#####No Docker Compose, essas variaveis ja estao no
docker-compose.yml. Use o.envpara sobrescrever.
Se
ALLOWED_HOSTSestiver bloqueando acesso na rede local, adicione o IP da máquina (ex:192.168.0.10) e/ou0.0.0.0.
docker compose up -d --buildIsso sobe web, worker, db e redis.
docker compose exec web python manage.py migratedocker compose exec web python manage.py createsuperuser- Login: http://127.0.0.1:8000/login/
- Lista: http://127.0.0.1:8000/documents/
- Upload: http://127.0.0.1:8000/documents/upload/
- Presets: http://127.0.0.1:8000/documents/presets/
Se o docker-compose.yml estiver montando volume do projeto no container (bind mount), normalmente é instantâneo (refresh no browser).
Se não estiver, ou se você preferir rebuild controlado:
docker compose up -d --build webPrecisa rebuild:
docker compose build web
docker compose up -d webdocker compose exec web python manage.py migrate
docker compose exec web python manage.py collectstatic --noinputdocker compose logs -f webdocker compose logs -f workerÚtil para iterar muito rápido. Recomendado manter o Docker como “fonte da verdade” do ambiente.
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python manage.py migrate
python manage.py createsuperuser
python manage.py runserverEm outro terminal, rode o worker:
celery -A automacao_contas worker -l INFO --concurrency=1Para rodar local, suba o Redis e use:
CELERY_BROKER_URL=redis://localhost:6379/0O OCR é acionado quando o PDF não tem texto “selecionável”.
Dependências Python:
pdf2imagepytesseract
Dependências de sistema (Linux):
tesseract-ocrpoppler-utils(fornecepdftoppm)
Variável opcional:
OCR_LANG=por(se o pacote do idioma estiver instalado no Tesseract)
Forcar OCR (opcional):
- Envie
force_ocr=1em reprocessamento/processing para ignorar texto embutido.
- Busca normalizada: sem acento, lowercase e espacos colapsados.
- Frases: use
;para separar termos (ex:gerente geral;compras). - Presets aplicam palavras-chave + faixas de idade/experiencia.
- Idade/experiencia/contato so aparecem depois do processamento; documentos antigos podem precisar reprocessar.
- Login
- Upload de PDFs
- Processar documento (ou lote, se habilitado)
- (Opcional) Criar presets e aplicar filtros
- Visualizar JSON extraído
- Filtrar/buscar por palavras-chave na listagem
- Fazer download individual ou em massa
automacao_contas/— settings/urlsdocuments/— models/views/forms/services/extractorstemplates/— HTMLstatic/— CSS/JSstaticfiles/— saída docollectstatic(Docker/prod)media/— uploads
- O JSON salvo deve ficar limpo (somente dados extraídos).
- Debug fica no log com eventos como:
upload_documentsprocess_document_startocr_fallbackextract_ok/extract_missingprocess_document_done
- Atualize a
mainlocal
git switch main
git pull origin main- Crie uma branch de feature
git switch -c feature/minha-feature- Commit + push
git add .
git commit -m "feat: minha feature"
git push -u origin feature/minha-feature- Abra um Merge Request / Pull Request no GitLab/GitHub
- Review → Merge → apagar branch (opcional)
Garanta que está no requirements.txt e instalado.
- Local:
pip install dj-database-url- Docker:
docker compose build --no-cache web
docker compose up -d webDocumento provavelmente escaneado → precisa OCR. Veja se apareceu ocr_fallback no log.
docker compose exec web python manage.py migrateThis system provides:
- Multi-file PDF upload
- Each file becomes one row in the documents table
- Processing extracts PDF text; falls back to OCR (Tesseract) for scanned PDFs
- Results are stored as a clean JSON (only extracted fields)
- Debug/telemetry lives in structured logs
- File download is login-protected
- List page supports accent-insensitive search, phrase terms, and presets
- Presets can filter by keywords, age, experience
- Bulk download of JSON and original files
- Backend: Django (Python 3.11+)
- DB (recommended): PostgreSQL (Docker)
- OCR: Tesseract + Poppler (pdftoppm) +
pdf2image/pytesseract - Run modes:
- ✅ Docker + Docker Compose (replicable environment)
- Alternative: venv +
python manage.py runserver
- Docker
- Docker Compose
- Python 3.11+
- pip
- (Optional) OCR deps:
tesseract-ocr+poppler-utils
Create a .env file at the project root:
DEBUG=1
SECRET_KEY=change-me
ALLOWED_HOSTS=127.0.0.1,localhost
CSRF_TRUSTED_ORIGINS=http://localhost:8000,http://127.0.0.1:8000
# Postgres (docker compose)
DATABASE_URL=postgres://automacao:automacao@db:5432/automacao_contas
# OCR (optional)
OCR_LANG=porDocker Compose already sets these env vars in
docker-compose.yml. Use.envto override.
For LAN access, add your machine IP (e.g.,
192.168.0.10) and/or0.0.0.0toALLOWED_HOSTS.
docker compose up -d --buildThis starts web, worker, db, and redis.
docker compose exec web python manage.py migratedocker compose exec web python manage.py createsuperuser- Login: http://127.0.0.1:8000/login/
- Documents list: http://127.0.0.1:8000/documents/
- Upload: http://127.0.0.1:8000/documents/upload/
- Presets: http://127.0.0.1:8000/documents/presets/
If your compose uses a bind mount (project folder mapped into the container), changes are usually instant.
If not, or for a controlled rebuild:
docker compose up -d --build webRebuild:
docker compose build web
docker compose up -d webdocker compose exec web python manage.py migrate
docker compose exec web python manage.py collectstatic --noinputdocker compose logs -f webdocker compose logs -f workerpython -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python manage.py migrate
python manage.py createsuperuser
python manage.py runserverIn another terminal, start the worker:
celery -A automacao_contas worker -l INFO --concurrency=1For local runs, start Redis and set:
CELERY_BROKER_URL=redis://localhost:6379/0OCR is used when PDFs have no selectable text.
Python deps:
pdf2imagepytesseract
System deps (Linux):
tesseract-ocrpoppler-utils(pdftoppm)
Optional:
OCR_LANG=por(if language pack is installed)
Force OCR (optional):
- Send
force_ocr=1on processing/reprocessing to ignore embedded PDF text.
- Normalized search: lowercase, no accents, collapsed spaces.
- Phrases: use
;to separate terms (e.g.,gerente geral;compras). - Presets apply keywords + age/experience ranges.
- Age/experience/contact are filled during processing; older docs may need reprocessing.
git switch main
git pull origin main
git switch -c feature/my-feature
# work...
git add .
git commit -m "feat: my feature"
git push -u origin feature/my-featureThen open a Merge Request / Pull Request.
Add one (MIT/Apache-2.0/etc.) if the repo is public.