I'm passionate about Natural Language Processing, Semitic linguistics, and AI optimization. My work spans Hebrew, Aramaic, Syriac, and Samaritan text processing, with expertise in eGPU optimization, transformer architectures, and historical text digitization.
Complete Hebrew text processing pipeline with advanced nikud restoration capabilities using transformer-based models.
- Model Architecture: Custom CANINE-based models optimized for Hebrew
- eGPU Optimization: Specialized training scripts for RTX 3090 with Thunderbolt 3.0
- Datasets: Mishnaic, Rabbinic, and Modern Hebrew text processing (100K+ samples)
- Performance: Memory-efficient training with GPU-cached datasets
- Applications: Biblical text analysis, modern Hebrew processing, educational tools
Complete pipeline for fine-tuning MarianMT models on Hebrew-Aramaic parallel texts, specifically designed for translating between Hebrew (Samaritan) and Aramaic (Targum) texts.
- Translation Models: Hebrew β Aramaic bidirectional translation
- Custom Tokenizers: Specialized for Semitic languages
- Dataset Engineering: Aligned corpus processing and quality analysis
- Model Optimization: Early stopping, learning rate scheduling, mixed precision training
- Applications: Biblical studies, linguistic research, text preservation
Character-level diacritization of Targumic Aramaic text using lightweight BiLSTM + Attention architecture.
- Model Architecture: 1-layer BiLSTM encoder, LSTM decoder with Luong-style attention
- Training Data: ~15,000 aligned verses from Targum Onkelos
- Performance: Lightweight model suitable for deployment
- Applications: Biblical text vocalization, linguistic research, educational tools
Advanced OCR post-processing for historical and medieval texts across multiple languages.
- Multi-language Support: Swedish, medieval texts, various scripts
- Architectures: BiLSTM, CATMuS-medieval, custom OCR correction models
- Applications: Historical document digitization, manuscript preservation, research accessibility
Handwritten Text Recognition (HTR) system using Kraken framework for historical manuscripts.
- Model Training: Custom HTR models for specific scripts and languages
- Segmentation: Advanced page segmentation and text recognition
- Deployment: Web applications and API services for HTR
- Applications: Manuscript digitization, historical research, cultural preservation
Modern web interface for searching Samaritan Torah text, built with React and FastAPI.
- Search Features: Fuzzy matching, exact phrase matching, pagination
- Responsive Design: Mobile-friendly interface with Hebrew text support
- Backend: FastAPI with Elasticsearch integration
- Applications: Biblical research, text study, educational platforms
- Python: PyTorch, TensorFlow, FastAPI, Streamlit, Gradio
- JavaScript/TypeScript: React, Node.js, modern web development
- C++/Rust: Performance-critical applications and systems programming
- SQL/NoSQL: Database design and optimization
- Deep Learning: PyTorch, Transformers (Hugging Face), TensorFlow, Keras
- NLP Models: MarianMT, CANINE, BiLSTM, Attention mechanisms
- Computer Vision: OCR, HTR, image processing with Kraken
- Model Optimization: Mixed precision training, gradient checkpointing, early stopping
- RTX 3090 24GB optimization for large-scale training
- Thunderbolt 3.0 bandwidth management and optimization
- Memory-efficient training strategies for large datasets
- GPU-cached datasets and distributed training
- Hebrew: Biblical, Mishnaic, Modern Hebrew with nikud restoration
- Aramaic: Targumic, Syriac, and various Aramaic dialects
- Samaritan: Samaritan Hebrew script and text processing
- Unicode normalization and text segmentation for Semitic scripts
- Biblical Hebrew text analysis and processing
- Targumic Aramaic translation and diacritization
- Samaritan Hebrew script recognition and processing
- Syriac Aramaic language models and translation
- Cross-lingual Semitic language processing
- Memory-efficient training for large-scale datasets (100K+ samples)
- eGPU performance optimization for external GPU setups
- Mixed precision training strategies (bfloat16, fp16)
- Gradient checkpointing and advanced memory management
- OCR post-processing for medieval and historical manuscripts
- Handwritten Text Recognition (HTR) for various scripts
- Text cleaning and normalization for ancient languages
- Dataset creation for historical text corpora
- Modern web interfaces for linguistic research tools
- API development for NLP services
- Docker containerization and production deployment
- Responsive design with multilingual text support
- β Developed comprehensive Hebrew NLP system with eGPU optimization
- β Created bidirectional Hebrew-Aramaic translation models for biblical studies
- β Built lightweight Aramaic diacritizer using BiLSTM + Attention
- β Implemented advanced OCR correction for multiple languages and scripts
- β Deployed HTR system for historical manuscript processing
- β Created modern web platform for Samaritan Torah research
- β Optimized training pipelines for memory efficiency and speed
- β Processed large-scale datasets (100K+ samples) with custom preprocessing
Complete Hebrew text processing pipeline with advanced nikud restoration, eGPU optimization, and large-scale dataset processing.
Complete pipeline for fine-tuning MarianMT models on Hebrew-Aramaic parallel texts with custom tokenizers and optimization strategies.
Lightweight BiLSTM + Attention model for character-level diacritization of Targumic Aramaic text.
Advanced OCR post-processing for historical texts across multiple languages and scripts.
Handwritten Text Recognition system using Kraken framework for historical manuscript processing.
Modern web interface for searching Samaritan Torah text with React, FastAPI, and Elasticsearch.
- GitHub: @johnlockejrr
- Hugging Face: @johnlockejrr
- Research Focus: Semitic NLP, AI Optimization, Historical Text Digitization, eGPU Computing
Currently working on:
- Advanced Semitic language model training with eGPU optimization
- Large-scale historical text dataset creation and preprocessing
- Cross-lingual Semitic language processing and translation
- Memory-efficient training strategies for transformer models
- Historical manuscript digitization and text recognition
- Web platform development for linguistic research tools
I'm always interested in:
- Semitic linguistics research collaborations
- Historical text digitization projects
- AI model optimization for ancient languages
- eGPU computing challenges and optimization
- Cross-cultural linguistic research partnerships
- Open-source NLP tool development
Feel free to reach out if you'd like to work together on Semitic language processing, historical text digitization, AI optimization, or any other exciting projects!
My work focuses on preserving and making accessible ancient Semitic texts through modern AI technology. By combining linguistic expertise with cutting-edge machine learning, I aim to:
- Bridge ancient and modern through technology
- Preserve cultural heritage through digital means
- Advance linguistic research with AI tools
- Make historical texts accessible to researchers worldwide
- Develop sustainable solutions for text preservation
"Language is the key to understanding culture, and AI is the key to processing language at scale. When we combine both, we unlock the wisdom of the ages." ππποΈ