Skip to content
View johnlockejrr's full-sized avatar

Block or report johnlockejrr

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
johnlockejrr/README.md

πŸ‘‹ Hi there, I'm John Locke Jr.

🎯 Multilingual AI Researcher & Semitic Language Specialist

I'm passionate about Natural Language Processing, Semitic linguistics, and AI optimization. My work spans Hebrew, Aramaic, Syriac, and Samaritan text processing, with expertise in eGPU optimization, transformer architectures, and historical text digitization.


πŸš€ Current Projects & Research

πŸ”€ Unikud - Advanced Hebrew NLP System

Complete Hebrew text processing pipeline with advanced nikud restoration capabilities using transformer-based models.

  • Model Architecture: Custom CANINE-based models optimized for Hebrew
  • eGPU Optimization: Specialized training scripts for RTX 3090 with Thunderbolt 3.0
  • Datasets: Mishnaic, Rabbinic, and Modern Hebrew text processing (100K+ samples)
  • Performance: Memory-efficient training with GPU-cached datasets
  • Applications: Biblical text analysis, modern Hebrew processing, educational tools

πŸ•ŠοΈ Samaritan-Aramaic Translation Models

Complete pipeline for fine-tuning MarianMT models on Hebrew-Aramaic parallel texts, specifically designed for translating between Hebrew (Samaritan) and Aramaic (Targum) texts.

  • Translation Models: Hebrew ↔ Aramaic bidirectional translation
  • Custom Tokenizers: Specialized for Semitic languages
  • Dataset Engineering: Aligned corpus processing and quality analysis
  • Model Optimization: Early stopping, learning rate scheduling, mixed precision training
  • Applications: Biblical studies, linguistic research, text preservation

πŸ“š Targumic Aramaic Diacritizer

Character-level diacritization of Targumic Aramaic text using lightweight BiLSTM + Attention architecture.

  • Model Architecture: 1-layer BiLSTM encoder, LSTM decoder with Luong-style attention
  • Training Data: ~15,000 aligned verses from Targum Onkelos
  • Performance: Lightweight model suitable for deployment
  • Applications: Biblical text vocalization, linguistic research, educational tools

πŸ” Post-OCR Correction Systems

Advanced OCR post-processing for historical and medieval texts across multiple languages.

  • Multi-language Support: Swedish, medieval texts, various scripts
  • Architectures: BiLSTM, CATMuS-medieval, custom OCR correction models
  • Applications: Historical document digitization, manuscript preservation, research accessibility

πŸ–ΌοΈ Kraken HTR Training & Deployment

Handwritten Text Recognition (HTR) system using Kraken framework for historical manuscripts.

  • Model Training: Custom HTR models for specific scripts and languages
  • Segmentation: Advanced page segmentation and text recognition
  • Deployment: Web applications and API services for HTR
  • Applications: Manuscript digitization, historical research, cultural preservation

🌟 Samaritan Torah Search Platform

Modern web interface for searching Samaritan Torah text, built with React and FastAPI.

  • Search Features: Fuzzy matching, exact phrase matching, pagination
  • Responsive Design: Mobile-friendly interface with Hebrew text support
  • Backend: FastAPI with Elasticsearch integration
  • Applications: Biblical research, text study, educational platforms

πŸ› οΈ Technical Expertise

Programming Languages & Frameworks

  • Python: PyTorch, TensorFlow, FastAPI, Streamlit, Gradio
  • JavaScript/TypeScript: React, Node.js, modern web development
  • C++/Rust: Performance-critical applications and systems programming
  • SQL/NoSQL: Database design and optimization

AI/ML Technologies

  • Deep Learning: PyTorch, Transformers (Hugging Face), TensorFlow, Keras
  • NLP Models: MarianMT, CANINE, BiLSTM, Attention mechanisms
  • Computer Vision: OCR, HTR, image processing with Kraken
  • Model Optimization: Mixed precision training, gradient checkpointing, early stopping

eGPU & Hardware Optimization

  • RTX 3090 24GB optimization for large-scale training
  • Thunderbolt 3.0 bandwidth management and optimization
  • Memory-efficient training strategies for large datasets
  • GPU-cached datasets and distributed training

Semitic Language Processing

  • Hebrew: Biblical, Mishnaic, Modern Hebrew with nikud restoration
  • Aramaic: Targumic, Syriac, and various Aramaic dialects
  • Samaritan: Samaritan Hebrew script and text processing
  • Unicode normalization and text segmentation for Semitic scripts

πŸ”¬ Research Areas & Specializations

Semitic Linguistics & NLP

  • Biblical Hebrew text analysis and processing
  • Targumic Aramaic translation and diacritization
  • Samaritan Hebrew script recognition and processing
  • Syriac Aramaic language models and translation
  • Cross-lingual Semitic language processing

AI Model Optimization

  • Memory-efficient training for large-scale datasets (100K+ samples)
  • eGPU performance optimization for external GPU setups
  • Mixed precision training strategies (bfloat16, fp16)
  • Gradient checkpointing and advanced memory management

Historical Text Digitization

  • OCR post-processing for medieval and historical manuscripts
  • Handwritten Text Recognition (HTR) for various scripts
  • Text cleaning and normalization for ancient languages
  • Dataset creation for historical text corpora

Web Applications & Deployment

  • Modern web interfaces for linguistic research tools
  • API development for NLP services
  • Docker containerization and production deployment
  • Responsive design with multilingual text support

πŸ“Š Recent Achievements & Impact

  • βœ… Developed comprehensive Hebrew NLP system with eGPU optimization
  • βœ… Created bidirectional Hebrew-Aramaic translation models for biblical studies
  • βœ… Built lightweight Aramaic diacritizer using BiLSTM + Attention
  • βœ… Implemented advanced OCR correction for multiple languages and scripts
  • βœ… Deployed HTR system for historical manuscript processing
  • βœ… Created modern web platform for Samaritan Torah research
  • βœ… Optimized training pipelines for memory efficiency and speed
  • βœ… Processed large-scale datasets (100K+ samples) with custom preprocessing

🌟 Featured Projects & Repositories

Complete Hebrew text processing pipeline with advanced nikud restoration, eGPU optimization, and large-scale dataset processing.

Complete pipeline for fine-tuning MarianMT models on Hebrew-Aramaic parallel texts with custom tokenizers and optimization strategies.

Lightweight BiLSTM + Attention model for character-level diacritization of Targumic Aramaic text.

Advanced OCR post-processing for historical texts across multiple languages and scripts.

Handwritten Text Recognition system using Kraken framework for historical manuscript processing.

Modern web interface for searching Samaritan Torah text with React, FastAPI, and Elasticsearch.


πŸ”— Connect & Collaborate

  • GitHub: @johnlockejrr
  • Hugging Face: @johnlockejrr
  • Research Focus: Semitic NLP, AI Optimization, Historical Text Digitization, eGPU Computing

πŸ“ˆ GitHub Statistics

John's GitHub stats

Top Languages


🎯 Current Research Focus

Currently working on:

  • Advanced Semitic language model training with eGPU optimization
  • Large-scale historical text dataset creation and preprocessing
  • Cross-lingual Semitic language processing and translation
  • Memory-efficient training strategies for transformer models
  • Historical manuscript digitization and text recognition
  • Web platform development for linguistic research tools

πŸ’‘ Collaboration Opportunities

I'm always interested in:

  • Semitic linguistics research collaborations
  • Historical text digitization projects
  • AI model optimization for ancient languages
  • eGPU computing challenges and optimization
  • Cross-cultural linguistic research partnerships
  • Open-source NLP tool development

Feel free to reach out if you'd like to work together on Semitic language processing, historical text digitization, AI optimization, or any other exciting projects!


🌍 Impact & Vision

My work focuses on preserving and making accessible ancient Semitic texts through modern AI technology. By combining linguistic expertise with cutting-edge machine learning, I aim to:

  • Bridge ancient and modern through technology
  • Preserve cultural heritage through digital means
  • Advance linguistic research with AI tools
  • Make historical texts accessible to researchers worldwide
  • Develop sustainable solutions for text preservation

"Language is the key to understanding culture, and AI is the key to processing language at scale. When we combine both, we unlock the wisdom of the ages." πŸš€πŸ“šπŸ•ŠοΈ

Popular repositories Loading

  1. page-to-yolo-training page-to-yolo-training Public

    This repository contains scripts for converting PAGE-XML annotations to YOLO format and training a YOLO11 model for text line segmentation

    Python 6

  2. doc-ufcn doc-ufcn Public

    Python 3

  3. plutushybrid plutushybrid Public

    Forked from alwaysminingbtc/plutushybrid

    Python 2 1

  4. eynollah eynollah Public

    Forked from qurator-spk/eynollah

    Document Layout Analysis

    Python 1

  5. pylaia pylaia Public

    Python 1

  6. PyLaia-models PyLaia-models Public

    1