Spaces:

OliverPerrin
/

LexiMind

Running

App Files Files Community

OliverPerrin commited on Nov 12

Commit

1fbc47b

1 Parent(s): f9edbb4

chore: snapshot current refinements

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

README.md +45 -153
configs/data/datasets.yaml +26 -0
configs/model/base.yaml +6 -50
configs/model/large.yaml +6 -0
configs/model/small.yaml +6 -23
configs/training/default.yaml +12 -0
configs/training/full.yaml +12 -0
configs/training/quick_test.yaml +9 -0
docker/Dockerfile +0 -0
docker/docker-compose.yml +0 -0
docs/api.md +79 -0
docs/architecture.md +57 -0
docs/training.md +59 -0
pyproject.toml +6 -11
requirements.txt +6 -16
scripts/download_data.py +182 -0
scripts/download_data.sh +5 -0
scripts/evaluate.py +134 -0
scripts/export_model.py +69 -0
scripts/inference.py +112 -0
scripts/preprocess_data.py +321 -0
scripts/test_gpu.py +0 -27
scripts/train.py +217 -6
setup.py +16 -6
src/__init__.py +1 -0
src/api/__init__.py +1 -0
src/api/app.py +10 -0
src/api/dependencies.py +42 -0
src/api/inference/__init__.py +0 -7
src/api/inference/inference.py +0 -133
src/api/routes.py +34 -0
src/api/schemas.py +14 -0
src/data/__init__.py +1 -0
src/data/dataloader.py +117 -0
src/data/dataset.py +229 -0
src/data/download.py +39 -60
src/data/preprocessing.py +95 -225
src/data/tokenization.py +122 -0
src/inference/__init__.py +10 -5
src/inference/baseline_summarizer.py +0 -41
src/inference/factory.py +75 -0
src/inference/generation.py +14 -0
src/inference/pipeline.py +166 -0
src/inference/postprocessing.py +6 -0
src/models/factory.py +105 -0
src/models/multitask.py +48 -9
src/training/__init__.py +1 -0
src/training/callbacks.py +37 -0
src/training/losses.py +13 -0
src/training/metrics.py +36 -0

README.md CHANGED Viewed

@@ -1,175 +1,67 @@
-# LexiMind: Multi-Task Transformer for Document Analysis
-A PyTorch-based multi-task learning system that performs abstractive summarization, emotion classification, and topic clustering on textual data using a shared Transformer encoder architecture.
-## 🎯 Project Overview
-LexiMind demonstrates multi-task learning (MTL) by training a single model to simultaneously:
-1. **Abstractive Summarization**: Generate concise summaries with user-defined compression levels
-2. **Emotion Classification**: Detect multiple emotions present in text (multi-label classification)
-3. **Topic Clustering**: Group documents by semantic similarity for topic discovery
-### Key Features
-- Custom encoder-decoder Transformer architecture with shared representations
-- Multi-task loss function with learnable task weighting
-- Attention weight visualization for model interpretability
-- Interactive web interface for real-time inference
-- Trained on diverse corpora: news articles (CNN/DailyMail, BBC) and literary texts (Project Gutenberg)
-## 🏗️ Architecture
-```
-Input Text
-    ↓
-┌─────────────────────┐
-│  Shared Encoder     │  ← TransformerEncoder (6 layers)
-│  (Multi-head Attn)  │
-└─────────────────────┘
-    ↓   ↓   ↓
-    │   │   └──────────────┐
-    │   │                  │
-    │   └─────────┐        │
-    │             │        │
-    ↓             ↓        ↓
-┌─────────┐  ┌────────┐  ┌─────────┐
-│ Decoder │  │Classify│  │ Project │
-│  Head   │  │  Head  │  │  Head   │
-└─────────┘  └────────┘  └─────────┘
-    ↓             ↓          ↓
-Summary      Emotions    Embeddings
-                          (for clustering)
-```
-## 📊 Datasets
-- **CNN/DailyMail**: 300k+ news articles with human-written summaries
-- **BBC News**: 2,225 articles across 5 categories
-- **Project Gutenberg**: Classic literature for long-form text analysis
-## 🚀 Quick Start
-### Installation
 ```bash
 git clone https://github.com/OliverPerrin/LexiMind.git
 cd LexiMind
 pip install -r requirements.txt
-```
-### Download Data
-```bash
-python src/download_datasets.py
-```
-### Train Model
-```bash
-python src/train.py --config configs/default.yaml
-```
-### Launch Interface
-```bash
-python src/app.py
 ```
-## 📁 Project Structure
 ```
-LexiMind/
-├── src/
-│   ├── models/
-│   │   ├── encoder.py           # Shared Transformer encoder
-│   │   ├── summarization.py     # Seq2seq decoder head
-│   │   ├── emotion.py           # Multi-label classification head
-│   │   └── clustering.py        # Projection head for embeddings
-│   ├── data/
-│   │   ├── download_datasets.py # Data acquisition
-│   │   ├── preprocessing.py     # Text cleaning & tokenization
-│   │   └── dataset.py           # PyTorch Dataset classes
-│   ├── training/
-│   │   ├── train.py             # Training loop
-│   │   ├── losses.py            # Multi-task loss functions
-│   │   └── metrics.py           # ROUGE, F1, silhouette scores
-│   ├── inference/
-│   │   └── pipeline.py          # End-to-end inference
-│   ├── visualization/
-│   │   └── attention.py         # Attention heatmap generation
-│   └── app.py                   # Gradio/FastAPI interface
-├── configs/
-│   └── default.yaml             # Model & training hyperparameters
-├── tests/
-│   └── test_*.py                # Unit tests
-├── notebooks/
-│   └── exploratory.ipynb        # Data exploration & analysis
-├── requirements.txt
-└── README.md
 ```
-## 🧪 Evaluation Metrics
-| Task | Metric | Score |
-|------|--------|-------|
-| Summarization | ROUGE-1 / ROUGE-L | TBD |
-| Emotion Classification | Macro F1 | TBD |
-| Topic Clustering | Silhouette Score | TBD |
-## 🔬 Technical Details
-### Model Specifications
-- **Encoder**: 6-layer Transformer (d_model=512, 8 attention heads)
-- **Decoder**: 6-layer autoregressive Transformer
-- **Vocab Size**: 32,000 (SentencePiece tokenizer)
-- **Parameters**: ~60M total
-### Training
-- **Optimizer**: AdamW (lr=1e-4, weight_decay=0.01)
-- **Scheduler**: Linear warmup (5000 steps) + cosine decay
-- **Loss**: Weighted sum of cross-entropy (summarization), BCE (emotions), triplet loss (clustering)
-- **Hardware**: Trained on single NVIDIA RTX 3090 (24GB VRAM)
-- **Time**: ~48 hours for 10 epochs
-### Multi-Task Learning Strategy
-Uses uncertainty weighting ([Kendall et al., 2018](https://arxiv.org/abs/1705.07115)) to automatically balance task losses:
-```
-L_total = Σ (1/2σ²_i * L_i + log(σ_i))
 ```
-where σ_i are learnable parameters representing task uncertainty.
-## 🎨 Interface Preview
-The web interface provides:
-- Text input with real-time token count
-- Compression level slider (20%-80%)
-- Side-by-side original/summary comparison
-- Emotion probability bars with color coding
-- Interactive attention heatmap (click tokens to highlight attention)
-- Downloadable results (JSON/CSV)
-## 📈 Future Enhancements
-- [ ] Add multilingual support (mBART)
-- [ ] Implement beam search for better summaries
-- [ ] Fine-tune on domain-specific corpora (medical, legal)
-- [ ] Add semantic search across document embeddings
-- [ ] Deploy as REST API with Docker
-- [ ] Implement model distillation for mobile deployment
-## 📚 References
-- Vaswani et al. (2017) - [Attention Is All You Need](https://arxiv.org/abs/1706.03762)
-- Lewis et al. (2019) - [BART: Denoising Sequence-to-Sequence Pre-training](https://arxiv.org/abs/1910.13461)
-- Caruana (1997) - [Multitask Learning](https://link.springer.com/article/10.1023/A:1007379606734)
-- Demszky et al. (2020) - [GoEmotions Dataset](https://arxiv.org/abs/2005.00547)
-## 📄 License
-GNU General Public License v3.0
-## 👤 Author
-**Oliver Perrin**
-- Portfolio: [oliverperrin.com](https://oliverperrin.com)
-- LinkedIn: [linkedin.com/in/oliverperrin](https://linkedin.com/in/oliverperrin)
-- Email: [email protected]
----

+# LexiMind (Inference Edition)
+LexiMind now ships as a focused inference sandbox for the custom multitask Transformer found in
+`src/models`. Training, dataset downloaders, and legacy scripts have been removed so it is easy to
+load a checkpoint, run the Streamlit demo, and experiment with summarization, emotion
+classification, and topic cues on your own text.
+## What Stays
+- Transformer encoder/decoder and task heads under `src/models`
+- Unit tests for the model stack (`tests/test_models`)
+- Streamlit UI (`src/ui/streamlit_app.py`) wired to the inference helpers in `src/api/inference`
+## What Changed
+- Hugging Face tokenizers provide all tokenization (see `TextPreprocessor`)
+- Training, dataset downloaders, and CLI scripts have been removed
+- Scikit-learn powers light text normalization (stop-word removal optional)
+- Requirements trimmed to inference-only dependencies
+## Quick Start
 ```bash
 git clone https://github.com/OliverPerrin/LexiMind.git
 cd LexiMind
 pip install -r requirements.txt
+# Optional extras via setup.py packaging metadata
+pip install .[web]   # installs streamlit + plotly
+pip install .[api]   # installs fastapi
+pip install .[all]   # installs both groups
+streamlit run src/ui/streamlit_app.py
 ```
+Configure the Streamlit app via the sidebar to point at your tokenizer directory and model
+checkpoint (defaults assume `artifacts/hf_tokenizer` and `checkpoints/best.pt`).
+## Minimal Project Map
 ```
+src/
+├── api/       # load_models + helpers
+├── data/      # TextPreprocessor using Hugging Face + sklearn
+├── inference/ # thin summarizer facade
+├── models/    # core Transformer architecture (untouched)
+└── ui/        # Streamlit interface
 ```
+Everything outside `src/` now holds optional assets such as checkpoints, tokenizer exports, and
+documentation stubs.
+## Loading a Checkpoint Programmatically
+```python
+from src.api.inference import load_models, summarize_text
+models = load_models({
+    "checkpoint_path": "checkpoints/best.pt",
+    "tokenizer_path": "artifacts/hf_tokenizer",
+    "hf_tokenizer_name": "facebook/bart-base",
+})
+summary, _ = summarize_text("Paste any article here.", models=models)
+print(summary)
 ```
+## License
+GPL-3.0
+## Author
+Oliver Perrin · [email protected]

configs/data/datasets.yaml CHANGED Viewed

	@@ -0,0 +1,26 @@

+raw:
+  summarization: data/raw/summarization/cnn_dailymail
+  emotion: data/raw/emotion
+  topic: data/raw/topic
+  books: data/raw/books
+processed:
+  summarization: data/processed/summarization
+  emotion: data/processed/emotion
+  topic: data/processed/topic
+  books: data/processed/books
+tokenizer:
+  pretrained_model_name: facebook/bart-base
+  max_length: 512
+  lower: false
+downloads:
+  summarization:
+    dataset: gowrishankarp/newspaper-text-summarization-cnn-dailymail
+    output: data/raw/summarization/cnn_dailymail
+  books:
+    - name: pride_and_prejudice
+      url: https://www.gutenberg.org/cache/epub/1342/pg1342.txt
+      output: data/raw/books/pride_and_prejudice.txt
+  emotion:
+    dataset: dair-ai/emotion
+  topic:
+    dataset: ag_news

configs/model/base.yaml CHANGED Viewed

@@ -1,50 +1,6 @@
-model:
-  vocab_size: 32000
-  d_model: 512
-  num_encoder_layers: 6
-  num_decoder_layers: 6
-  num_heads: 8
-  d_ff: 2048
-  dropout: 0.1
-  max_seq_length: 512
-tasks:
-  summarization:
-    enabled: true
-    decoder_layers: 6
-  emotion:
-    enabled: true
-    num_classes: 27
-    pool_strategy: "mean"  # Options: mean, max, cls, attention
-  clustering:
-    enabled: true
-    embedding_dim: 128
-    normalize: true
-training:
-  batch_size: 16
-  gradient_accumulation_steps: 2  # Effective batch = 32
-  learning_rate: 1e-4
-  weight_decay: 0.01
-  num_epochs: 10
-  warmup_steps: 1000
-  max_grad_norm: 1.0
-  scheduler:
-    type: "cosine"  # Options: linear, cosine, polynomial
-  mixed_precision: true  # Use AMP for faster training
-data:
-  max_length: 512
-  summary_max_length: 128
-  train_split: 0.8
-  val_split: 0.1
-  test_split: 0.1
-  preprocessing:
-    lowercase: true
-    remove_stopwords: false
-    min_token_length: 3

+d_model: 512
+num_encoder_layers: 6
+num_decoder_layers: 6
+num_attention_heads: 8
+ffn_dim: 2048
+dropout: 0.1

configs/model/large.yaml CHANGED Viewed

	@@ -0,0 +1,6 @@

+d_model: 768
+num_encoder_layers: 12
+num_decoder_layers: 12
+num_attention_heads: 12
+ffn_dim: 3072
+dropout: 0.1

configs/model/small.yaml CHANGED Viewed

@@ -1,23 +1,6 @@
-# configs/model/small.yaml (for fast iteration)
-model:
-  d_model: 256
-  num_encoder_layers: 4
-  num_decoder_layers: 4
-  num_heads: 8
-training:
-  batch_size: 32  # ~4GB VRAM
-  gradient_accumulation_steps: 1
-  mixed_precision: true  # Essential!
-# configs/model/base.yaml (production)
-model:
-  d_model: 512
-  num_encoder_layers: 6
-  num_decoder_layers: 6
-  num_heads: 8
-training:
-  batch_size: 8   # ~8GB VRAM
-  gradient_accumulation_steps: 4  # Effective batch = 32
-  mixed_precision: true

+d_model: 256
+num_encoder_layers: 4
+num_decoder_layers: 4
+num_attention_heads: 4
+ffn_dim: 1024
+dropout: 0.1

configs/training/default.yaml CHANGED Viewed

	@@ -0,0 +1,12 @@

+dataloader:
+  batch_size: 8
+  shuffle: true
+optimizer:
+  name: adamw
+  lr: 3.0e-5
+scheduler:
+  name: cosine
+  warmup_steps: 500
+trainer:
+  max_epochs: 5
+  gradient_clip_norm: 1.0

configs/training/full.yaml CHANGED Viewed

	@@ -0,0 +1,12 @@

+dataloader:
+  batch_size: 16
+  shuffle: true
+optimizer:
+  name: adamw
+  lr: 2.0e-5
+scheduler:
+  name: cosine
+  warmup_steps: 1000
+trainer:
+  max_epochs: 15
+  gradient_clip_norm: 1.0

configs/training/quick_test.yaml CHANGED Viewed

	@@ -0,0 +1,9 @@

+dataloader:
+  batch_size: 2
+  shuffle: false
+optimizer:
+  name: adamw
+  lr: 1.0e-4
+trainer:
+  max_epochs: 1
+  gradient_clip_norm: 0.5

docker/Dockerfile DELETED Viewed

File without changes

docker/docker-compose.yml DELETED Viewed

File without changes

docs/api.md CHANGED Viewed

	@@ -0,0 +1,79 @@

+# API & CLI Documentation
+## FastAPI Service
+The FastAPI application is defined in `src/api/app.py` and wires routes from
+`src/api/routes.py`. All dependencies resolve through `src/api/dependencies.py`, which lazily constructs the shared inference pipeline.
+### POST `/summarize`
+- **Request Body** (`SummaryRequest`):
+  ```json
+  {
+    "text": "Your input document"
+  }
+  ```
+- **Response** (`SummaryResponse`):
+  ```json
+  {
+    "summary": "Generated abstractive summary",
+    "emotion_labels": ["joy", "surprise"],
+    "emotion_scores": [0.91, 0.63],
+    "topic": "news",
+    "topic_confidence": 0.82
+  }
+  ```
+- **Behaviour:**
+  1. Text is preprocessed through `TextPreprocessor` (with optional sklearn transformer if configured).
+  2. The multitask model generates a summary via greedy decoding.
+  3. Emotion and topic heads produce logits which are converted to probabilities and mapped to
+     human-readable labels using `artifacts/labels.json`.
+  4. Results are returned as structured JSON suitable for a future Gradio interface.
+### Error Handling
+- If the checkpoint or label metadata is missing, the dependency raises an HTTP 503 error with
+  an explanatory message.
+- Validation errors (missing `text`) are handled automatically by FastAPI/Pydantic.
+## Command-Line Interface
+`scripts/inference.py` provides a CLI that mirrors the API behaviour.
+### Usage
+```bash
+python scripts/inference.py "Document to analyse" \
+  --checkpoint checkpoints/best.pt \
+  --labels artifacts/labels.json \
+  --tokenizer artifacts/hf_tokenizer \
+  --model-config configs/model/base.yaml \
+  --device cpu
+```
+Options:
+- `text` – zero or more positional arguments. If omitted, use `--file` to point to a newline
+  delimited text file.
+- `--file` – optional path containing one text per line.
+- `--checkpoint` – path to the trained model weights.
+- `--labels` – JSON containing emotion/topic vocabularies (defaults to `artifacts/labels.json`).
+- `--tokenizer` – optional tokenizer directory; defaults to the exported artifact if present.
+- `--model-config` – YAML describing the architecture.
+- `--device` – `cpu` or `cuda`. Passing `cuda` attempts to run inference on GPU.
+- `--summary-max-length` – overrides the default maximum generation length.
+### Output
+The CLI prints a JSON array where each entry contains the original text, summary, emotion labels
+with scores, and topic prediction. This format is identical to the REST response, facilitating
+integration tests and future Gradio UI rendering.
+## Future Gradio UI
+- The planned UI will call the same inference pipeline and display results interactively.
+- Given the response schema, the UI can show:
+  - Generated summary text.
+  - Emotion chips with probability bars.
+  - Topic confidence gauges.
+  - Placeholder panel for attention heatmaps and explanations.
+- Once implemented, documentation updates will add a `docs/ui.md` section and screenshots under
+  `docs/images/`.
+## Testing
+- `tests/test_api/test_routes.py` stubs the pipeline to ensure response fields and dependency
+  overrides behave as expected.
+- `tests/test_inference/test_pipeline.py` validates pipeline methods end-to-end with dummy models,
+  guaranteeing API and CLI consumers receive consistent payload shapes.

docs/architecture.md CHANGED Viewed

	@@ -0,0 +1,57 @@

+# LexiMind Architecture
+## Overview
+LexiMind couples a from-scratch Transformer implementation with a modern data and inference stack. The project consists of three major layers:
+1. **Data & Preprocessing** – lightweight text cleaning built on top of scikit-learn
+   primitives and a Hugging Face tokenizer wrapper with deterministic batching helpers.
+2. **Model Composition** – the bespoke encoder/decoder stack with task heads assembled via
+   `MultiTaskModel`, plus `models.factory.build_multitask_model` to rebuild the network from
+   configuration files.
+3. **Inference & Serving** – a multi-task pipeline capable of summarization, emotion, and topic classification; surfaced through a CLI and FastAPI service with plans for a Gradio UI.
+## Custom Transformer Stack
+- `src/models/encoder.py` and `src/models/decoder.py` implement Pre-LayerNorm Transformer
+  blocks with explicit positional encoding, masking logic, and incremental decoding support.
+- `src/models/heads.py` provides modular output heads. Summarization uses an `LMHead` tied to
+  the decoder embedding weights; emotion and topic tasks use `ClassificationHead` instances.
+- `src/models/multitask.py` routes inputs to the correct head, computes task-specific losses,
+  and exposes a single forward API used by the trainer and inference pipeline.
+- `src/models/factory.py` rebuilds the encoder, decoder, and heads directly from YAML config
+  and tokenizer metadata so inference rebuilds the exact architecture used in training.
+## Data, Tokenization, and Preprocessing
+- `src/data/tokenization.py` wraps `AutoTokenizer` to provide tensor-aware batching and helper
+  utilities for decoder input shifting, BOS/EOS resolution, and vocab size retrieval.
+- `src/data/preprocessing.py` introduces `TextPreprocessor`, layering a `BasicTextCleaner` with
+  optional scikit-learn transformers (via `sklearn_transformer`) before tokenization. This keeps
+  the default cleaning minimal while allowing future reuse of `sklearn.preprocessing` utilities
+  without changing calling code.
+- `src/data/dataset.py` and `src/data/dataloader.py` define strongly typed dataset containers and
+  collators that encode inputs with the shared tokenizer and set up task-specific labels (multi-label
+  emotions, categorical topics, seq2seq summaries).
+## Training Pipeline
+- `src/training/trainer.py` coordinates multi-task optimization with per-task loss functions, gradient clipping, and shared tokenizer decoding for metric computation.
+- Metrics in `src/training/metrics.py` include accuracy, multi-label F1, and a ROUGE-like overlap score for summarization. These metrics mirror the trainer outputs logged per task.
+- Label vocabularies are serialized to `artifacts/labels.json` after training so inference can decode class indices consistently.
+## Inference & Serving
+- `src/inference/pipeline.py` exposes summarization, emotion, and topic predictions with shared pre-processing, generation, and thresholding logic. It expects label vocabularies from the serialized metadata file.
+- `src/inference/factory.py` rebuilds the full pipeline by loading the tokenizer (preferring the exported tokenizer artifact), reconstructing the model via the factory helpers, restoring checkpoints, and injecting label metadata.
+- The CLI (`scripts/inference.py`) drives the pipeline from the command line. The FastAPI app (`src/api/routes.py`) exposes the `/summarize` endpoint that returns summaries, emotion labels + scores, and topic predictions. Test coverage in `tests/test_inference` and `tests/test_api` validates both layers with lightweight stubs.
+## Gradio UI Roadmap
+- The inference pipeline returns structured outputs that are already suitable for a web UI.
+- Planned steps for a Gradio demo:
+  1. Wrap `InferencePipeline.batch_predict` inside Gradio callbacks for text input.
+  2. Display summaries alongside emotion tag chips and topic confidence bars.
+  3. Surface token-level attention visualizations by extending the pipeline to emit decoder attention maps (hooks already exist in the decoder).
+- Documentation and code paths were structured to keep the Gradio integration isolated in a future `src/ui/gradio_app.py` module without altering core logic.
+## Key Decisions
+- **Custom Transformer Preservation** – all modeling remains on the bespoke encoder/decoder, satisfying the constraint to avoid Hugging Face model classes while still leveraging their tokenizer implementation.
+- **Tokenizer Artifact Preference** – inference automatically favors the exported tokenizer in `artifacts/hf_tokenizer`, guaranteeing consistent vocabularies between training and serving.
+- **Sklearn-friendly Preprocessing** – the text preprocessor now accepts an optional
+  `TransformerMixin` so additional normalization (lemmatization, custom token filters, etc.) can be injected using familiar scikit-learn tooling without rewriting the batching code.
+- **Documentation Alignment** – the `docs/` folder mirrors the structure requested, capturing design reasoning and paving the way for future diagrams in `docs/images`.

docs/training.md CHANGED Viewed

	@@ -0,0 +1,59 @@

+# Training Procedure
+## Data Sources
+- **Summarization** – expects JSONL files with `source` and `summary` fields under
+  `data/processed/summarization`.
+- **Emotion Classification** – multi-label samples loaded from JSONL files with
+  `text` and `emotions` arrays. The dataset owns a `MultiLabelBinarizer` for consistent encoding.
+- **Topic Classification** – single-label categorical samples with `text` and `topic` fields, encoded via `LabelEncoder`.
+Paths and tokenizer defaults are configured in `configs/data/datasets.yaml`. The tokenizer section chooses the Hugging Face backbone (`facebook/bart-base` by default) and maximum length. Gutenberg book downloads are controlled via the `downloads.books` list (each entry includes `name`, `url`, and `output`).
+## Dataloaders & Collators
+- `SummarizationCollator` encodes encoder/decoder inputs, prepares decoder input IDs via `Tokenizer.prepare_decoder_inputs`, and masks padding tokens with `-100` for loss computation.
+- `EmotionCollator` applies the dataset's `MultiLabelBinarizer`, returning dense float tensors suitable for `BCEWithLogitsLoss`.
+- `TopicCollator` emits integer class IDs via the dataset's `LabelEncoder` for `CrossEntropyLoss`.
+These collators keep all tokenization centralized, reducing duplication and making it easy to swap in additional sklearn transformations through `TextPreprocessor` should we wish to extend cleaning or normalization.
+## Model Assembly
+- `src/models/factory.build_multitask_model` rebuilds the encoder, decoder, and heads from the tokenizer metadata and YAML config. This factory is used both during training and inference to eliminate drift between environments.
+- The model wraps:
+  - Transformer encoder/decoder stacks with shared positional encodings.
+  - LM head tied to decoder embeddings for summarization.
+  - Mean-pooled classification heads for emotion and topic tasks.
+## Optimisation Loop
+- `src/training/trainer.Trainer` orchestrates multi-task training.
+  - Cross-entropy is used for summarization (seq2seq logits vs. shifted labels).
+  - `BCEWithLogitsLoss` handles multi-label emotions.
+  - `CrossEntropyLoss` handles topic classification.
+- Gradient clipping ensures stability, and per-task weights can be configured via
+  `TrainerConfig.task_weights` to balance gradients if needed.
+- Metrics tracked per task:
+  - **Summarization** – ROUGE-like overlap metric (`training.metrics.rouge_like`).
+  - **Emotion** – micro F1 score for multi-label predictions.
+  - **Topic** – categorical accuracy.
+## Checkpoints & Artifacts
+- `src/utils/io.save_state` stores model weights; checkpoints live under `checkpoints/`.
+- `artifacts/labels.json` captures the ordered emotion/topic vocabularies immediately after
+  training. This file is required for inference so class indices map back to human-readable labels.
+- The tokenizer is exported to `artifacts/hf_tokenizer/` for reproducible vocabularies.
+## Running Training
+1. Ensure processed datasets are available (see `data/processed/` structure).
+2. Choose a configuration (e.g., `configs/training/default.yaml`) for hyperparameters and data splits.
+3. Instantiate the tokenizer via `TokenizerConfig` and build datasets/dataloaders.
+4. Use `build_multitask_model` to construct the model, create an optimizer, and run
+   `Trainer.fit(train_loaders, val_loaders)`.
+5. Save checkpoints and update `artifacts/labels.json` with the dataset label order.
+> **Note:** A full CLI for training is forthcoming. The scripts in `scripts/` currently act as
+> scaffolding; once the Gradio UI is introduced we will extend these utilities to launch
+> training jobs with configuration files directly.
+## Future Enhancements
+- Integrate curriculum scheduling or task-balanced sampling once empirical results dictate.
+- Capture attention maps during training to support visualization in the planned Gradio UI.
+- Leverage the optional `sklearn_transformer` hook in `TextPreprocessor` for lemmatization or domain-specific normalization when datasets require it.

pyproject.toml CHANGED Viewed

@@ -13,19 +13,14 @@ license = {text = "GPL-3.0"}
 dependencies = [
     "torch>=2.0.0",
-    "transformers>=4.30.0",
-    "datasets>=2.14.0",
-    "tokenizers>=0.13.0",
     "numpy>=1.24.0",
     "pandas>=2.0.0",
-    "scikit-learn>=1.3.0",
-    "matplotlib>=3.7.0",
-    "seaborn>=0.12.0",
-    "tqdm>=4.65.0",
-    "pyyaml>=6.0",
-    "omegaconf>=2.3.0",
-    "tensorboard>=2.13.0",
-    "gradio>=3.35.0",
 ]
 [project.optional-dependencies]

 dependencies = [
     "torch>=2.0.0",
+    "scikit-learn>=1.4.0",
     "numpy>=1.24.0",
     "pandas>=2.0.0",
+    "streamlit>=1.25.0",
+    "plotly>=5.18.0",
+    "transformers>=4.40.0",
+    "fastapi>=0.110.0",
+    "datasets>=4.4.0",
 ]
 [project.optional-dependencies]

requirements.txt CHANGED Viewed

@@ -1,22 +1,12 @@
 # requirements.txt
 torch>=2.0.0
-transformers>=4.30.0
-datasets>=2.14.0
-tokenizers>=0.13.0
 numpy>=1.24.0
 pandas>=2.0.0
-scikit-learn>=1.3.0
-matplotlib>=3.7.0
-seaborn>=0.12.0
-nltk>=3.8.0
-tqdm>=4.65.0
-pyyaml>=6.0
-omegaconf>=2.3.0
-tensorboard>=2.13.0
-gradio>=3.35.0
-requests>=2.31.0
-kaggle>=1.5.12
 streamlit>=1.25.0
 plotly>=5.18.0
-faiss-cpu==1.9.0; platform_system != "Windows"
-faiss-cpu==1.9.0; platform_system == "Windows"

 # requirements.txt
 torch>=2.0.0
+transformers>=4.40.0
+scikit-learn>=1.4.0
 numpy>=1.24.0
 pandas>=2.0.0
 streamlit>=1.25.0
 plotly>=5.18.0
+fastapi>=0.110.0
+datasets>=4.4.0
+pytest
+matplotlib

scripts/download_data.py ADDED Viewed

	@@ -0,0 +1,182 @@

+"""Download datasets used by LexiMind."""
+from __future__ import annotations
+import argparse
+import json
+import sys
+from pathlib import Path
+from typing import Iterable, Iterator, cast
+from datasets import ClassLabel, Dataset, DatasetDict, load_dataset
+PROJECT_ROOT = Path(__file__).resolve().parents[1]
+if str(PROJECT_ROOT) not in sys.path:
+    sys.path.insert(0, str(PROJECT_ROOT))
+from src.data.download import gutenberg_download, kaggle_download
+from src.utils.config import load_yaml
+DEFAULT_SUMMARIZATION_DATASET = "gowrishankarp/newspaper-text-summarization-cnn-dailymail"
+DEFAULT_EMOTION_DATASET = "dair-ai/emotion"
+DEFAULT_TOPIC_DATASET = "ag_news"
+DEFAULT_BOOK_URL = "https://www.gutenberg.org/cache/epub/1342/pg1342.txt"
+DEFAULT_BOOK_OUTPUT = "data/raw/books/pride_and_prejudice.txt"
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="Download datasets required for LexiMind training")
+    parser.add_argument(
+        "--config",
+        default="configs/data/datasets.yaml",
+        help="Path to the dataset configuration YAML.",
+    )
+    parser.add_argument("--skip-kaggle", action="store_true", help="Skip downloading the Kaggle summarization dataset.")
+    parser.add_argument("--skip-book", action="store_true", help="Skip downloading Gutenberg book texts.")
+    return parser.parse_args()
+def _safe_load_config(path: str | None) -> dict:
+    if not path:
+        return {}
+    config_path = Path(path)
+    if not config_path.exists():
+        raise FileNotFoundError(f"Config file not found: {config_path}")
+    return load_yaml(str(config_path)).data
+def _write_jsonl(records: Iterable[dict[str, object]], destination: Path) -> None:
+    destination.parent.mkdir(parents=True, exist_ok=True)
+    with destination.open("w", encoding="utf-8") as handle:
+        for record in records:
+            handle.write(json.dumps(record, ensure_ascii=False) + "\n")
+def _emotion_records(dataset_split: Dataset, label_names: list[str] | None) -> Iterator[dict[str, object]]:
+    for item in dataset_split:
+        data = dict(item)
+        text = data.get("text", "")
+        label_value = data.get("label")
+        def resolve_label(index: object) -> str:
+            if isinstance(index, int) and label_names and 0 <= index < len(label_names):
+                return label_names[index]
+            return str(index)
+        if isinstance(label_value, list):
+            labels = [resolve_label(idx) for idx in label_value]
+        else:
+            labels = [resolve_label(label_value)]
+        yield {"text": text, "emotions": labels}
+def _topic_records(dataset_split: Dataset, label_names: list[str] | None) -> Iterator[dict[str, object]]:
+    for item in dataset_split:
+        data = dict(item)
+        text = data.get("text") or data.get("content") or ""
+        label_value = data.get("label")
+        def resolve_topic(raw: object) -> str:
+            if label_names:
+                idx: int | None = None
+                if isinstance(raw, int):
+                    idx = raw
+                elif isinstance(raw, str):
+                    try:
+                        idx = int(raw)
+                    except ValueError:
+                        idx = None
+                if idx is not None and 0 <= idx < len(label_names):
+                    return label_names[idx]
+            return str(raw) if raw is not None else ""
+        if isinstance(label_value, list):
+            topic = resolve_topic(label_value[0]) if label_value else ""
+        else:
+            topic = resolve_topic(label_value)
+        yield {"text": text, "topic": topic}
+def main() -> None:
+    args = parse_args()
+    config = _safe_load_config(args.config)
+    raw_paths = config.get("raw", {}) if isinstance(config, dict) else {}
+    downloads_cfg = config.get("downloads", {}) if isinstance(config, dict) else {}
+    summarization_cfg = downloads_cfg.get("summarization", {}) if isinstance(downloads_cfg, dict) else {}
+    summarization_dataset = summarization_cfg.get("dataset", DEFAULT_SUMMARIZATION_DATASET)
+    summarization_output = summarization_cfg.get("output", raw_paths.get("summarization", "data/raw/summarization"))
+    if not args.skip_kaggle and summarization_dataset:
+        print(f"Downloading summarization dataset '{summarization_dataset}' -> {summarization_output}")
+        kaggle_download(summarization_dataset, summarization_output)
+    else:
+        print("Skipping Kaggle summarization download.")
+    books_root = Path(raw_paths.get("books", "data/raw/books"))
+    books_root.mkdir(parents=True, exist_ok=True)
+    books_entries: list[dict[str, object]] = []
+    if isinstance(downloads_cfg, dict):
+        raw_entries = downloads_cfg.get("books")
+        if isinstance(raw_entries, list):
+            books_entries = [entry for entry in raw_entries if isinstance(entry, dict)]
+    if not args.skip_book:
+        if not books_entries:
+            books_entries = [
+                {
+                    "name": "pride_and_prejudice",
+                    "url": DEFAULT_BOOK_URL,
+                    "output": DEFAULT_BOOK_OUTPUT,
+                }
+            ]
+        for entry in books_entries:
+            name = str(entry.get("name") or "gutenberg_text")
+            url = str(entry.get("url") or DEFAULT_BOOK_URL)
+            output_value = entry.get("output")
+            destination = Path(output_value) if isinstance(output_value, str) and output_value else books_root / f"{name}.txt"
+            destination.parent.mkdir(parents=True, exist_ok=True)
+            print(f"Downloading Gutenberg text '{name}' from {url} -> {destination}")
+            gutenberg_download(url, str(destination))
+    else:
+        print("Skipping Gutenberg downloads.")
+    emotion_cfg = downloads_cfg.get("emotion", {}) if isinstance(downloads_cfg, dict) else {}
+    emotion_name = emotion_cfg.get("dataset", DEFAULT_EMOTION_DATASET)
+    emotion_dir = Path(raw_paths.get("emotion", "data/raw/emotion"))
+    emotion_dir.mkdir(parents=True, exist_ok=True)
+    print(f"Downloading emotion dataset '{emotion_name}' -> {emotion_dir}")
+    emotion_dataset = cast(DatasetDict, load_dataset(emotion_name))
+    first_emotion_key = next(iter(emotion_dataset.keys()), None) if emotion_dataset else None
+    emotion_label_feature = (
+        emotion_dataset[first_emotion_key].features.get("label")
+        if first_emotion_key is not None
+        else None
+    )
+    emotion_label_names = emotion_label_feature.names if isinstance(emotion_label_feature, ClassLabel) else None
+    for split_name, split in emotion_dataset.items():
+        output_path = emotion_dir / f"{str(split_name)}.jsonl"
+        _write_jsonl(_emotion_records(split, emotion_label_names), output_path)
+    topic_cfg = downloads_cfg.get("topic", {}) if isinstance(downloads_cfg, dict) else {}
+    topic_name = topic_cfg.get("dataset", DEFAULT_TOPIC_DATASET)
+    topic_dir = Path(raw_paths.get("topic", "data/raw/topic"))
+    topic_dir.mkdir(parents=True, exist_ok=True)
+    print(f"Downloading topic dataset '{topic_name}' -> {topic_dir}")
+    topic_dataset = cast(DatasetDict, load_dataset(topic_name))
+    first_topic_key = next(iter(topic_dataset.keys()), None) if topic_dataset else None
+    topic_label_feature = (
+        topic_dataset[first_topic_key].features.get("label")
+        if first_topic_key is not None
+        else None
+    )
+    topic_label_names = topic_label_feature.names if isinstance(topic_label_feature, ClassLabel) else None
+    for split_name, split in topic_dataset.items():
+        output_path = topic_dir / f"{str(split_name)}.jsonl"
+        _write_jsonl(_topic_records(split, topic_label_names), output_path)
+    print("Download routine finished.")
+if __name__ == "__main__":
+    main()

scripts/download_data.sh ADDED Viewed

	@@ -0,0 +1,5 @@

+#!/usr/bin/env bash
+set -euo pipefail
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+python3 "${SCRIPT_DIR}/download_data.py"

scripts/evaluate.py ADDED Viewed

	@@ -0,0 +1,134 @@

+"""Evaluate the multitask model on processed validation/test splits."""
+from __future__ import annotations
+import argparse
+import json
+import sys
+from pathlib import Path
+from typing import List
+import torch
+from sklearn.preprocessing import MultiLabelBinarizer
+PROJECT_ROOT = Path(__file__).resolve().parents[1]
+if str(PROJECT_ROOT) not in sys.path:
+    sys.path.insert(0, str(PROJECT_ROOT))
+from src.data.dataset import (
+    load_emotion_jsonl,
+    load_summarization_jsonl,
+    load_topic_jsonl,
+)
+from src.inference.factory import create_inference_pipeline
+from src.training.metrics import accuracy, multilabel_f1, rouge_like
+from src.utils.config import load_yaml
+SPLIT_ALIASES = {
+    "train": ("train",),
+    "val": ("val", "validation"),
+    "test": ("test",),
+}
+def _read_split(root: Path, split: str, loader) -> list:
+    aliases = SPLIT_ALIASES.get(split, (split,))
+    for alias in aliases:
+        for ext in ("jsonl", "json"):
+            candidate = root / f"{alias}.{ext}"
+            if candidate.exists():
+                return loader(str(candidate))
+    raise FileNotFoundError(f"Missing {split} split under {root}")
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="Evaluate the LexiMind multitask model")
+    parser.add_argument("--split", default="val", choices=["train", "val", "test"], help="Dataset split to evaluate.")
+    parser.add_argument("--checkpoint", default="checkpoints/best.pt", help="Path to the trained checkpoint.")
+    parser.add_argument("--labels", default="artifacts/labels.json", help="Label metadata JSON.")
+    parser.add_argument("--data-config", default="configs/data/datasets.yaml", help="Data configuration YAML.")
+    parser.add_argument("--model-config", default="configs/model/base.yaml", help="Model architecture YAML.")
+    parser.add_argument("--device", default="cuda" if torch.cuda.is_available() else "cpu", help="Device for evaluation.")
+    parser.add_argument("--batch-size", type=int, default=16, help="Batch size for generation/classification during evaluation.")
+    return parser.parse_args()
+def chunks(items: List, size: int):
+    for start in range(0, len(items), size):
+        yield items[start : start + size]
+def main() -> None:
+    args = parse_args()
+    data_cfg = load_yaml(args.data_config).data
+    pipeline, metadata = create_inference_pipeline(
+        checkpoint_path=args.checkpoint,
+        labels_path=args.labels,
+        tokenizer_config=None,
+        model_config_path=args.model_config,
+        device=args.device,
+    )
+    summarization_dir = Path(data_cfg["processed"]["summarization"])
+    emotion_dir = Path(data_cfg["processed"]["emotion"])
+    topic_dir = Path(data_cfg["processed"]["topic"])
+    summary_examples = _read_split(summarization_dir, args.split, load_summarization_jsonl)
+    emotion_examples = _read_split(emotion_dir, args.split, load_emotion_jsonl)
+    topic_examples = _read_split(topic_dir, args.split, load_topic_jsonl)
+    emotion_binarizer = MultiLabelBinarizer(classes=metadata.emotion)
+    # Ensure scikit-learn initializes the attributes using metadata ordering.
+    emotion_binarizer.fit([[label] for label in metadata.emotion])
+    # Summarization
+    summaries_pred = []
+    summaries_ref = []
+    for batch in chunks(summary_examples, args.batch_size):
+        inputs = [example.source for example in batch]
+        summaries_pred.extend(pipeline.summarize(inputs))
+        summaries_ref.extend([example.summary for example in batch])
+    rouge_score = rouge_like(summaries_pred, summaries_ref)
+    # Emotion
+    emotion_preds_tensor = []
+    emotion_target_tensor = []
+    label_to_index = {label: idx for idx, label in enumerate(metadata.emotion)}
+    for batch in chunks(emotion_examples, args.batch_size):
+        inputs = [example.text for example in batch]
+        predictions = pipeline.predict_emotions(inputs)
+        target_matrix = emotion_binarizer.transform([list(example.emotions) for example in batch])
+        for pred, target_row in zip(predictions, target_matrix):
+            vector = torch.zeros(len(metadata.emotion), dtype=torch.float32)
+            for label in pred.labels:
+                idx = label_to_index.get(label)
+                if idx is not None:
+                    vector[idx] = 1.0
+            emotion_preds_tensor.append(vector)
+            emotion_target_tensor.append(torch.tensor(target_row, dtype=torch.float32))
+    emotion_f1 = multilabel_f1(torch.stack(emotion_preds_tensor), torch.stack(emotion_target_tensor))
+    # Topic
+    topic_preds = []
+    topic_targets = []
+    for batch in chunks(topic_examples, args.batch_size):
+        inputs = [example.text for example in batch]
+        predictions = pipeline.predict_topics(inputs)
+        topic_preds.extend([pred.label for pred in predictions])
+        topic_targets.extend([example.topic for example in batch])
+    topic_accuracy = accuracy(topic_preds, topic_targets)
+    print(json.dumps(
+        {
+            "split": args.split,
+            "rouge_like": rouge_score,
+            "emotion_f1": emotion_f1,
+            "topic_accuracy": topic_accuracy,
+        },
+        indent=2,
+    ))
+if __name__ == "__main__":
+    main()

scripts/export_model.py ADDED Viewed

	@@ -0,0 +1,69 @@

+"""Rebuild and export the trained multitask model for downstream use."""
+from __future__ import annotations
+import argparse
+from pathlib import Path
+import torch
+from src.data.tokenization import Tokenizer, TokenizerConfig
+from src.models.factory import build_multitask_model, load_model_config
+from src.utils.config import load_yaml
+from src.utils.labels import load_label_metadata
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="Export LexiMind model weights")
+    parser.add_argument("--checkpoint", default="checkpoints/best.pt", help="Path to the trained checkpoint.")
+    parser.add_argument("--output", default="outputs/model.pt", help="Output path for the exported state dict.")
+    parser.add_argument("--labels", default="artifacts/labels.json", help="Label metadata JSON produced after training.")
+    parser.add_argument("--model-config", default="configs/model/base.yaml", help="Model architecture configuration.")
+    parser.add_argument("--data-config", default="configs/data/datasets.yaml", help="Data configuration (for tokenizer settings).")
+    return parser.parse_args()
+def main() -> None:
+    """Export multitask model weights from a training checkpoint to a standalone state dict."""
+    args = parse_args()
+    checkpoint = Path(args.checkpoint)
+    if not checkpoint.exists():
+        raise FileNotFoundError(checkpoint)
+    labels = load_label_metadata(args.labels)
+    data_cfg = load_yaml(args.data_config).data
+    tokenizer_section = data_cfg.get("tokenizer", {})
+    tokenizer_config = TokenizerConfig(
+        pretrained_model_name=tokenizer_section.get("pretrained_model_name", "facebook/bart-base"),
+        max_length=int(tokenizer_section.get("max_length", 512)),
+        lower=bool(tokenizer_section.get("lower", False)),
+    )
+    tokenizer = Tokenizer(tokenizer_config)
+    model = build_multitask_model(
+        tokenizer,
+        num_emotions=labels.emotion_size,
+        num_topics=labels.topic_size,
+        config=load_model_config(args.model_config),
+    )
+    raw_state = torch.load(checkpoint, map_location="cpu")
+    if isinstance(raw_state, dict):
+        if "model_state_dict" in raw_state and isinstance(raw_state["model_state_dict"], dict):
+            state_dict = raw_state["model_state_dict"]
+        elif "state_dict" in raw_state and isinstance(raw_state["state_dict"], dict):
+            state_dict = raw_state["state_dict"]
+        else:
+            state_dict = raw_state
+    else:
+        raise TypeError(f"Unsupported checkpoint format: expected dict, got {type(raw_state)!r}")
+    model.load_state_dict(state_dict)
+    output_path = Path(args.output)
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    torch.save(model.state_dict(), output_path)
+    print(f"Model exported to {output_path}")
+if __name__ == "__main__":
+    main()

scripts/inference.py ADDED Viewed

	@@ -0,0 +1,112 @@

+"""Run inference with the multitask model."""
+from __future__ import annotations
+import argparse
+import json
+from pathlib import Path
+from typing import List, cast
+from src.data.tokenization import TokenizerConfig
+from src.inference import EmotionPrediction, TopicPrediction, create_inference_pipeline
+def _load_texts(positional: List[str], file_path: Path | None) -> List[str]:
+    texts = [text for text in positional if text]
+    if file_path is not None:
+        if not file_path.exists():
+            raise FileNotFoundError(file_path)
+        with file_path.open("r", encoding="utf-8") as handle:
+            texts.extend([line.strip() for line in handle if line.strip()])
+    if not texts:
+        raise ValueError("No input texts provided. Pass text arguments or use --file.")
+    return texts
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="Run LexiMind multitask inference.")
+    parser.add_argument("text", nargs="*", help="Input text(s) to analyse.")
+    parser.add_argument("--file", type=Path, help="Path to a file containing one text per line.")
+    parser.add_argument(
+        "--checkpoint",
+        type=Path,
+        default=Path("checkpoints/best.pt"),
+        help="Path to the model checkpoint produced during training.",
+    )
+    parser.add_argument(
+        "--labels",
+        type=Path,
+        default=Path("artifacts/labels.json"),
+        help="JSON file containing emotion/topic label vocabularies.",
+    )
+    parser.add_argument(
+        "--tokenizer",
+        type=Path,
+        default=None,
+        help="Optional path to a tokenizer directory exported during training.",
+    )
+    parser.add_argument(
+        "--model-config",
+        type=Path,
+        default=Path("configs/model/base.yaml"),
+        help="Model architecture config used to rebuild the transformer stack.",
+    )
+    parser.add_argument("--device", default="cpu", help="Device to run inference on (cpu or cuda).")
+    parser.add_argument(
+        "--summary-max-length",
+        type=int,
+        default=None,
+        help="Optional maximum length for generated summaries.",
+    )
+    return parser.parse_args()
+def main() -> None:
+    args = parse_args()
+    texts = _load_texts(args.text, args.file)
+    tokenizer_config = None
+    if args.tokenizer is not None:
+        tokenizer_config = TokenizerConfig(pretrained_model_name=str(args.tokenizer))
+    else:
+        local_dir = Path("artifacts/hf_tokenizer")
+        if local_dir.exists():
+            tokenizer_config = TokenizerConfig(pretrained_model_name=str(local_dir))
+    pipeline, _ = create_inference_pipeline(
+        checkpoint_path=args.checkpoint,
+        labels_path=args.labels,
+        tokenizer_config=tokenizer_config,
+        model_config_path=args.model_config,
+        device=args.device,
+        summary_max_length=args.summary_max_length,
+    )
+    results = pipeline.batch_predict(texts)
+    summaries = cast(List[str], results["summaries"])
+    emotion_preds = cast(List[EmotionPrediction], results["emotion"])
+    topic_preds = cast(List[TopicPrediction], results["topic"])
+    packaged = []
+    for idx, text in enumerate(texts):
+        emotion = emotion_preds[idx]
+        topic = topic_preds[idx]
+        packaged.append(
+            {
+                "text": text,
+                "summary": summaries[idx],
+                "emotion": {
+                    "labels": emotion.labels,
+                    "scores": emotion.scores,
+                },
+                "topic": {
+                    "label": topic.label,
+                    "confidence": topic.confidence,
+                },
+            }
+        )
+    print(json.dumps(packaged, indent=2, ensure_ascii=False))
+if __name__ == "__main__":
+    main()

scripts/preprocess_data.py ADDED Viewed

	@@ -0,0 +1,321 @@

+"""Preprocess raw datasets into JSONL splits for LexiMind training."""
+from __future__ import annotations
+import argparse
+import csv
+import json
+import sys
+from pathlib import Path
+from typing import Dict, Iterable, Iterator, Sequence, Tuple
+from sklearn.model_selection import train_test_split
+PROJECT_ROOT = Path(__file__).resolve().parents[1]
+if str(PROJECT_ROOT) not in sys.path:
+    sys.path.insert(0, str(PROJECT_ROOT))
+from src.data.preprocessing import BasicTextCleaner
+from src.utils.config import load_yaml
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="Preprocess datasets configured for LexiMind")
+    parser.add_argument(
+        "--config",
+        default="configs/data/datasets.yaml",
+        help="Path to data configuration YAML.",
+    )
+    parser.add_argument("--val-ratio", type=float, default=0.1, help="Validation split size for topic dataset when no validation split is present.")
+    parser.add_argument("--seed", type=int, default=17, help="Random seed for deterministic splitting.")
+    return parser.parse_args()
+def _resolve_csv(base: Path, filename: str) -> Path | None:
+    primary = base / filename
+    if primary.exists():
+        return primary
+    nested = base / "cnn_dailymail" / filename
+    if nested.exists():
+        return nested
+    return None
+def _write_jsonl(records: Iterable[Dict[str, object]], destination: Path) -> None:
+    destination.parent.mkdir(parents=True, exist_ok=True)
+    with destination.open("w", encoding="utf-8") as handle:
+        for record in records:
+            handle.write(json.dumps(record, ensure_ascii=False) + "\n")
+def _read_jsonl(path: Path) -> Iterator[Dict[str, object]]:
+    with path.open("r", encoding="utf-8") as handle:
+        for line in handle:
+            row = line.strip()
+            if not row:
+                continue
+            yield json.loads(row)
+def preprocess_books(
+    raw_dir: Path,
+    processed_dir: Path,
+    cleaner: BasicTextCleaner,
+    *,
+    min_tokens: int = 30,
+) -> None:
+    if not raw_dir.exists():
+        print(f"Skipping book preprocessing (missing directory: {raw_dir})")
+        return
+    processed_dir.mkdir(parents=True, exist_ok=True)
+    index: list[Dict[str, object]] = []
+    for book_path in sorted(raw_dir.glob("*.txt")):
+        text = book_path.read_text(encoding="utf-8").lstrip("\ufeff")
+        normalized = text.replace("\r\n", "\n")
+        paragraphs = [paragraph.strip() for paragraph in normalized.split("\n\n") if paragraph.strip()]
+        records: list[Dict[str, object]] = []
+        for paragraph_id, paragraph in enumerate(paragraphs):
+            cleaned = cleaner.transform([paragraph])[0]
+            tokens = cleaned.split()
+            if len(tokens) < min_tokens:
+                continue
+            record = {
+                "book": book_path.stem,
+                "title": book_path.stem.replace("_", " ").title(),
+                "paragraph_id": paragraph_id,
+                "text": paragraph,
+                "clean_text": cleaned,
+                "token_count": len(tokens),
+                "char_count": len(paragraph),
+            }
+            records.append(record)
+        if not records:
+            print(f"No suitably sized paragraphs found in {book_path}; skipping.")
+            continue
+        output_path = processed_dir / f"{book_path.stem}.jsonl"
+        print(f"Writing book segments for '{book_path.stem}' to {output_path}")
+        _write_jsonl(records, output_path)
+        index.append(
+            {
+                "book": book_path.stem,
+                "title": records[0]["title"],
+                "paragraphs": len(records),
+                "source": str(book_path),
+                "output": str(output_path),
+            }
+        )
+    if index:
+        index_path = processed_dir / "index.json"
+        with index_path.open("w", encoding="utf-8") as handle:
+            json.dump(index, handle, ensure_ascii=False, indent=2)
+        print(f"Book index written to {index_path}")
+def preprocess_summarization(raw_dir: Path, processed_dir: Path) -> None:
+    if not raw_dir.exists():
+        print(f"Skipping summarization preprocessing (missing directory: {raw_dir})")
+        return
+    for split in ("train", "validation", "test"):
+        source_path = _resolve_csv(raw_dir, f"{split}.csv")
+        if source_path is None:
+            print(f"Skipping summarization split '{split}' (file not found)")
+            continue
+        output_path = processed_dir / f"{split}.jsonl"
+        output_path.parent.mkdir(parents=True, exist_ok=True)
+        print(f"Writing summarization split '{split}' to {output_path}")
+        with source_path.open("r", encoding="utf-8", newline="") as source_handle, output_path.open("w", encoding="utf-8") as sink:
+            reader = csv.DictReader(source_handle)
+            for row in reader:
+                article = row.get("article") or row.get("Article") or ""
+                highlights = row.get("highlights") or row.get("summary") or ""
+                payload = {"source": article.strip(), "summary": highlights.strip()}
+                sink.write(json.dumps(payload, ensure_ascii=False) + "\n")
+def preprocess_emotion(raw_dir: Path, processed_dir: Path, cleaner: BasicTextCleaner) -> None:
+    if not raw_dir.exists():
+        print(f"Skipping emotion preprocessing (missing directory: {raw_dir})")
+        return
+    split_aliases: Dict[str, Sequence[str]] = {
+        "train": ("train",),
+        "val": ("val", "validation"),
+        "test": ("test",),
+    }
+    for split, aliases in split_aliases.items():
+        source_path: Path | None = None
+        for alias in aliases:
+            for extension in ("jsonl", "txt", "csv"):
+                candidate = raw_dir / f"{alias}.{extension}"
+                if candidate.exists():
+                    source_path = candidate
+                    break
+            if source_path is not None:
+                break
+        if source_path is None:
+            print(f"Skipping emotion split '{split}' (file not found)")
+            continue
+        assert source_path is not None
+        path = source_path
+        def iter_records() -> Iterator[Dict[str, object]]:
+            if path.suffix == ".jsonl":
+                for row in _read_jsonl(path):
+                    raw_text = str(row.get("text", ""))
+                    text = cleaner.transform([raw_text])[0]
+                    labels = row.get("emotions") or row.get("labels") or []
+                    if isinstance(labels, str):
+                        labels = [label.strip() for label in labels.split(",") if label.strip()]
+                    elif isinstance(labels, Sequence):
+                        labels = [str(label) for label in labels]
+                    else:
+                        labels = [str(labels)] if labels else []
+                    if not labels:
+                        labels = ["neutral"]
+                    yield {"text": text, "emotions": labels}
+            else:
+                delimiter = ";" if path.suffix == ".txt" else ","
+                with path.open("r", encoding="utf-8", newline="") as handle:
+                    reader = csv.reader(handle, delimiter=delimiter)
+                    for row in reader:
+                        if not row:
+                            continue
+                        raw_text = str(row[0])
+                        text = cleaner.transform([raw_text])[0]
+                        raw_labels = row[1] if len(row) > 1 else ""
+                        labels = [label.strip() for label in raw_labels.split(",") if label.strip()]
+                        if not labels:
+                            labels = ["neutral"]
+                        yield {"text": text, "emotions": labels}
+        output_path = processed_dir / f"{split}.jsonl"
+        print(f"Writing emotion split '{split}' to {output_path}")
+        _write_jsonl(iter_records(), output_path)
+def preprocess_topic(
+    raw_dir: Path,
+    processed_dir: Path,
+    cleaner: BasicTextCleaner,
+    val_ratio: float,
+    seed: int,
+) -> None:
+    if not raw_dir.exists():
+        print(f"Skipping topic preprocessing (missing directory: {raw_dir})")
+        return
+    def locate(*names: str) -> Path | None:
+        for name in names:
+            candidate = raw_dir / name
+            if candidate.exists():
+                return candidate
+        return None
+    train_path = locate("train.jsonl", "train.csv")
+    if train_path is None:
+        print(f"Skipping topic preprocessing (missing train split in {raw_dir})")
+        return
+    assert train_path is not None
+    def load_topic_rows(path: Path) -> list[Tuple[str, str]]:
+        rows: list[Tuple[str, str]] = []
+        if path.suffix == ".jsonl":
+            for record in _read_jsonl(path):
+                text = str(record.get("text") or record.get("content") or "")
+                topic = record.get("topic") or record.get("label")
+                cleaned_text = cleaner.transform([text])[0]
+                rows.append((cleaned_text, str(topic).strip()))
+        else:
+            with path.open("r", encoding="utf-8", newline="") as handle:
+                reader = csv.DictReader(handle)
+                for row in reader:
+                    topic = row.get("Class Index") or row.get("topic") or row.get("label")
+                    title = str(row.get("Title") or "")
+                    description = str(row.get("Description") or row.get("text") or "")
+                    text = " ".join(filter(None, (title, description)))
+                    cleaned_text = cleaner.transform([text])[0]
+                    rows.append((cleaned_text, str(topic).strip()))
+        return rows
+    train_rows = load_topic_rows(train_path)
+    if not train_rows:
+        print("No topic training rows found; skipping topic preprocessing.")
+        return
+    texts = [row[0] for row in train_rows]
+    topics = [row[1] for row in train_rows]
+    validation_path = locate("val.jsonl", "validation.jsonl", "val.csv", "validation.csv")
+    has_validation = validation_path is not None
+    if has_validation and validation_path:
+        val_rows = load_topic_rows(validation_path)
+        train_records = train_rows
+    else:
+        train_texts, val_texts, train_topics, val_topics = train_test_split(
+            texts,
+            topics,
+            test_size=val_ratio,
+            random_state=seed,
+            stratify=topics,
+        )
+        train_records = list(zip(train_texts, train_topics))
+        val_rows = list(zip(val_texts, val_topics))
+    def to_records(pairs: Sequence[Tuple[str, str]]) -> Iterator[Dict[str, object]]:
+        for text, topic in pairs:
+            yield {"text": text, "topic": topic}
+    print(f"Writing topic train split to {processed_dir / 'train.jsonl'}")
+    _write_jsonl(to_records(train_records), processed_dir / "train.jsonl")
+    print(f"Writing topic val split to {processed_dir / 'val.jsonl'}")
+    _write_jsonl(to_records(val_rows), processed_dir / "val.jsonl")
+    test_path = locate("test.jsonl", "test.csv")
+    if test_path is not None:
+        test_rows = load_topic_rows(test_path)
+        print(f"Writing topic test split to {processed_dir / 'test.jsonl'}")
+        _write_jsonl(to_records(test_rows), processed_dir / "test.jsonl")
+    else:
+        print(f"Skipping topic test split (missing test split in {raw_dir})")
+def main() -> None:
+    args = parse_args()
+    config = load_yaml(args.config).data
+    raw_cfg = config.get("raw", {})
+    processed_cfg = config.get("processed", {})
+    books_raw = Path(raw_cfg.get("books", "data/raw/books"))
+    summarization_raw = Path(raw_cfg.get("summarization", "data/raw/summarization"))
+    emotion_raw = Path(raw_cfg.get("emotion", "data/raw/emotion"))
+    topic_raw = Path(raw_cfg.get("topic", "data/raw/topic"))
+    books_processed = Path(processed_cfg.get("books", "data/processed/books"))
+    summarization_processed = Path(processed_cfg.get("summarization", "data/processed/summarization"))
+    emotion_processed = Path(processed_cfg.get("emotion", "data/processed/emotion"))
+    topic_processed = Path(processed_cfg.get("topic", "data/processed/topic"))
+    cleaner = BasicTextCleaner()
+    preprocess_books(books_raw, books_processed, cleaner)
+    preprocess_summarization(summarization_raw, summarization_processed)
+    preprocess_emotion(emotion_raw, emotion_processed, cleaner)
+    preprocess_topic(topic_raw, topic_processed, cleaner, val_ratio=args.val_ratio, seed=args.seed)
+    print("Preprocessing complete.")
+if __name__ == "__main__":
+    main()

scripts/test_gpu.py DELETED Viewed

@@ -1,27 +0,0 @@
-# test_gpu.py
-import torch
-print("=" * 50)
-print("GPU Information")
-print("=" * 50)
-if torch.cuda.is_available():
-    gpu_name = torch.cuda.get_device_name(0)
-    gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1e9
-    print(f"✅ GPU: {gpu_name}")
-    print(f"✅ Memory: {gpu_memory:.2f} GB")
-    # Test tensor creation
-    x = torch.randn(1000, 1000, device='cuda')
-    y = torch.randn(1000, 1000, device='cuda')
-    z = x @ y
-    print(f"✅ CUDA operations working!")
-    print(f"✅ Current memory allocated: {torch.cuda.memory_allocated(0) / 1e9:.2f} GB")
-    print(f"✅ Max memory allocated: {torch.cuda.max_memory_allocated(0) / 1e9:.2f} GB")
-else:
-    print("❌ CUDA not available!")
-    print("Using CPU - training will be slow!")
-print("=" * 50)

scripts/train.py CHANGED Viewed

@@ -1,8 +1,219 @@
-# scripts/train.py
-from src.training.trainer import Trainer
-from src.utils.config import load_config
 if __name__ == "__main__":
-    config = load_config("configs/training/default.yaml")
-    trainer = Trainer(config)
-    trainer.train()

+"""End-to-end training entrypoint for the LexiMind multitask model."""
+from __future__ import annotations
+import argparse
+import json
+import sys
+from pathlib import Path
+from typing import Dict, Sequence
+import torch
+PROJECT_ROOT = Path(__file__).resolve().parents[1]
+if str(PROJECT_ROOT) not in sys.path:
+    sys.path.insert(0, str(PROJECT_ROOT))
+from src.data.dataloader import (
+    build_emotion_dataloader,
+    build_summarization_dataloader,
+    build_topic_dataloader,
+)
+from src.data.dataset import (
+    EmotionDataset,
+    SummarizationDataset,
+    TopicDataset,
+    load_emotion_jsonl,
+    load_summarization_jsonl,
+    load_topic_jsonl,
+)
+from src.data.tokenization import Tokenizer, TokenizerConfig
+from src.models.factory import build_multitask_model, load_model_config
+from src.training.trainer import Trainer, TrainerConfig
+from src.training.utils import set_seed
+from src.utils.config import load_yaml
+from src.utils.io import save_state
+from src.utils.labels import LabelMetadata, save_label_metadata
+SplitExamples = Dict[str, list]
+SPLIT_ALIASES: Dict[str, Sequence[str]] = {
+    "train": ("train",),
+    "val": ("val", "validation"),
+    "test": ("test",),
+}
+def _read_examples(data_dir: Path, loader) -> SplitExamples:
+    splits: SplitExamples = {}
+    for canonical, aliases in SPLIT_ALIASES.items():
+        found = False
+        for alias in aliases:
+            for extension in ("jsonl", "json"):
+                candidate = data_dir / f"{alias}.{extension}"
+                if candidate.exists():
+                    splits[canonical] = loader(str(candidate))
+                    found = True
+                    break
+            if found:
+                break
+        if not found:
+            raise FileNotFoundError(f"Missing {canonical} split under {data_dir}")
+    return splits
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="Train the LexiMind multitask transformer")
+    parser.add_argument("--data-config", default="configs/data/datasets.yaml", help="Path to data configuration YAML.")
+    parser.add_argument("--training-config", default="configs/training/default.yaml", help="Path to training hyperparameter YAML.")
+    parser.add_argument("--model-config", default="configs/model/base.yaml", help="Path to model architecture YAML.")
+    parser.add_argument("--checkpoint-out", default="checkpoints/best.pt", help="Where to store the trained checkpoint.")
+    parser.add_argument("--labels-out", default="artifacts/labels.json", help="Where to persist label vocabularies.")
+    parser.add_argument("--history-out", default="outputs/training_history.json", help="Where to write training history.")
+    parser.add_argument("--device", default="cpu", help="Training device identifier (cpu or cuda).")
+    parser.add_argument("--seed", type=int, default=17, help="Random seed for reproducibility.")
+    return parser.parse_args()
+def main() -> None:
+    args = parse_args()
+    set_seed(args.seed)
+    data_cfg = load_yaml(args.data_config).data
+    training_cfg = load_yaml(args.training_config).data
+    model_cfg = load_model_config(args.model_config)
+    summarization_dir = Path(data_cfg["processed"]["summarization"])
+    emotion_dir = Path(data_cfg["processed"]["emotion"])
+    topic_dir = Path(data_cfg["processed"]["topic"])
+    summarization_splits = _read_examples(summarization_dir, load_summarization_jsonl)
+    emotion_splits = _read_examples(emotion_dir, load_emotion_jsonl)
+    topic_splits = _read_examples(topic_dir, load_topic_jsonl)
+    tokenizer_section = data_cfg.get("tokenizer", {})
+    tokenizer_config = TokenizerConfig(
+        pretrained_model_name=tokenizer_section.get("pretrained_model_name", "facebook/bart-base"),
+        max_length=int(tokenizer_section.get("max_length", 512)),
+        lower=bool(tokenizer_section.get("lower", False)),
+    )
+    tokenizer = Tokenizer(tokenizer_config)
+    summarization_train = SummarizationDataset(summarization_splits["train"])
+    summarization_val = SummarizationDataset(summarization_splits["val"])
+    emotion_train = EmotionDataset(emotion_splits["train"])
+    emotion_val = EmotionDataset(emotion_splits["val"], binarizer=emotion_train.binarizer)
+    topic_train = TopicDataset(topic_splits["train"])
+    topic_val = TopicDataset(topic_splits["val"], encoder=topic_train.encoder)
+    dataloader_args = training_cfg.get("dataloader", {})
+    batch_size = int(dataloader_args.get("batch_size", 8))
+    shuffle = bool(dataloader_args.get("shuffle", True))
+    max_length = tokenizer.config.max_length
+    train_loaders = {
+        "summarization": build_summarization_dataloader(
+            summarization_train,
+            tokenizer,
+            batch_size=batch_size,
+            shuffle=shuffle,
+            max_source_length=max_length,
+            max_target_length=max_length,
+        ),
+        "emotion": build_emotion_dataloader(
+            emotion_train,
+            tokenizer,
+            batch_size=batch_size,
+            shuffle=shuffle,
+            max_length=max_length,
+        ),
+        "topic": build_topic_dataloader(
+            topic_train,
+            tokenizer,
+            batch_size=batch_size,
+            shuffle=shuffle,
+            max_length=max_length,
+        ),
+    }
+    val_loaders = {
+        "summarization": build_summarization_dataloader(
+            summarization_val,
+            tokenizer,
+            batch_size=batch_size,
+            shuffle=False,
+            max_source_length=max_length,
+            max_target_length=max_length,
+        ),
+        "emotion": build_emotion_dataloader(
+            emotion_val,
+            tokenizer,
+            batch_size=batch_size,
+            shuffle=False,
+            max_length=max_length,
+        ),
+        "topic": build_topic_dataloader(
+            topic_val,
+            tokenizer,
+            batch_size=batch_size,
+            shuffle=False,
+            max_length=max_length,
+        ),
+    }
+    device = torch.device(args.device)
+    model = build_multitask_model(
+        tokenizer,
+        num_emotions=len(emotion_train.emotion_classes),
+        num_topics=len(topic_train.topic_classes),
+        config=model_cfg,
+    ).to(device)
+    optimizer_cfg = training_cfg.get("optimizer", {})
+    lr = float(optimizer_cfg.get("lr", 3.0e-5))
+    optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
+    trainer_cfg = training_cfg.get("trainer", {})
+    trainer = Trainer(
+        model=model,
+        optimizer=optimizer,
+        config=TrainerConfig(
+            max_epochs=int(trainer_cfg.get("max_epochs", 1)),
+            gradient_clip_norm=float(trainer_cfg.get("gradient_clip_norm", 1.0)),
+            logging_interval=int(trainer_cfg.get("logging_interval", 50)),
+            task_weights=trainer_cfg.get("task_weights"),
+        ),
+        device=device,
+        tokenizer=tokenizer,
+    )
+    history = trainer.fit(train_loaders, val_loaders)
+    checkpoint_path = Path(args.checkpoint_out)
+    checkpoint_path.parent.mkdir(parents=True, exist_ok=True)
+    save_state(model, str(checkpoint_path))
+    labels_path = Path(args.labels_out)
+    save_label_metadata(
+        LabelMetadata(
+            emotion=emotion_train.emotion_classes,
+            topic=topic_train.topic_classes,
+        ),
+        labels_path,
+    )
+    history_path = Path(args.history_out)
+    history_path.parent.mkdir(parents=True, exist_ok=True)
+    with history_path.open("w", encoding="utf-8") as handle:
+        json.dump(history, handle, indent=2)
+    print(f"Training complete. Checkpoint saved to {checkpoint_path}")
+    print(f"Label metadata saved to {labels_path}")
+    print(f"History saved to {history_path}")
 if __name__ == "__main__":
+    main()

setup.py CHANGED Viewed

@@ -7,13 +7,23 @@ setup(
     package_dir={"": "src"},
     install_requires=[
         "torch>=2.0.0",
-        "transformers>=4.30.0",
-        # ... (or read from requirements.txt)
     ],
-    entry_points={
-        "console_scripts": [
-            "leximind-train=scripts.train:main",
-            "leximind-infer=scripts.inference:main",
         ],
     },
 )

     package_dir={"": "src"},
     install_requires=[
         "torch>=2.0.0",
+        "transformers>=4.40.0",
+        "scikit-learn>=1.4.0",
+        "numpy>=1.24.0",
+        "pandas>=2.0.0",
     ],
+    extras_require={
+        "web": [
+            "streamlit>=1.25.0",
+            "plotly>=5.18.0",
+        ],
+        "api": [
+            "fastapi>=0.110.0",
+        ],
+        "all": [
+            "streamlit>=1.25.0",
+            "plotly>=5.18.0",
+            "fastapi>=0.110.0",
         ],
     },
 )

src/__init__.py CHANGED Viewed

	@@ -0,0 +1 @@


1	+ """LexiMind core package."""

src/api/__init__.py CHANGED Viewed

	@@ -0,0 +1 @@


1	+ """API surface for LexiMind."""

src/api/app.py ADDED Viewed

	@@ -0,0 +1,10 @@

+"""FastAPI application entrypoint."""
+from fastapi import FastAPI
+from .routes import router
+def create_app() -> FastAPI:
+    app = FastAPI(title="LexiMind")
+    app.include_router(router)
+    return app

src/api/dependencies.py ADDED Viewed

	@@ -0,0 +1,42 @@

+"""Dependency providers for the FastAPI application."""
+from __future__ import annotations
+from functools import lru_cache
+from pathlib import Path
+from fastapi import HTTPException, status
+from ..utils.logging import get_logger
+logger = get_logger(__name__)
+from ..inference.factory import create_inference_pipeline
+from ..inference.pipeline import InferencePipeline
+@lru_cache(maxsize=1)
+def get_pipeline() -> InferencePipeline:
+    """Lazily construct and cache the inference pipeline for the API."""
+    checkpoint = Path("checkpoints/best.pt")
+    labels = Path("artifacts/labels.json")
+    model_config = Path("configs/model/base.yaml")
+    try:
+        pipeline, _ = create_inference_pipeline(
+            checkpoint_path=checkpoint,
+            labels_path=labels,
+            model_config_path=model_config,
+        )
+    except FileNotFoundError as exc:
+        logger.exception("Pipeline initialization failed: missing artifact")
+        raise HTTPException(
+            status_code=status.HTTP_503_SERVICE_UNAVAILABLE,
+            detail="Service temporarily unavailable",
+        ) from exc
+    except Exception as exc:  # noqa: BLE001 - surface initialization issues to the caller
+        logger.exception("Pipeline initialization failed")
+        raise HTTPException(
+            status_code=status.HTTP_503_SERVICE_UNAVAILABLE,
+            detail="Service temporarily unavailable",
+        ) from exc
+    return pipeline

src/api/inference/__init__.py DELETED Viewed

@@ -1,7 +0,0 @@
-"""
-API inference module for LexiMind.
-"""
-from .inference import load_models, summarize_text, classify_emotion, topic_for_text
-__all__ = ["load_models", "summarize_text", "classify_emotion", "topic_for_text"]

src/api/inference/inference.py DELETED Viewed

@@ -1,133 +0,0 @@
-"""Minimal inference helpers that rely on the custom transformer stack."""
-from __future__ import annotations
-from pathlib import Path
-from typing import Any, Dict, List, Optional, Tuple
-import torch
-from ...data.preprocessing import TextPreprocessor, TransformerTokenizer
-from ...models.multitask import MultiTaskModel
-def _load_tokenizer(tokenizer_path: Path) -> TransformerTokenizer:
-    if not tokenizer_path.exists():
-        raise FileNotFoundError(f"tokenizer file '{tokenizer_path}' not found")
-    return TransformerTokenizer.load(tokenizer_path)
-def load_models(config: Dict[str, Any]) -> Dict[str, Any]:
-    """Load MultiTaskModel together with the tokenizer-driven preprocessor."""
-    device = torch.device(config.get("device", "cpu"))
-    tokenizer_path = config.get("tokenizer_path")
-    if tokenizer_path is None:
-        raise ValueError("'tokenizer_path' missing in config")
-    tokenizer = _load_tokenizer(Path(tokenizer_path))
-    preprocessor = TextPreprocessor(
-        max_length=int(config.get("max_length", 512)),
-        tokenizer=tokenizer,
-        min_freq=int(config.get("min_freq", 1)),
-        lowercase=bool(config.get("lowercase", True)),
-    )
-    encoder_kwargs = dict(config.get("encoder", {}))
-    decoder_kwargs = dict(config.get("decoder", {}))
-    encoder = preprocessor.build_encoder(**encoder_kwargs)
-    decoder = preprocessor.build_decoder(**decoder_kwargs)
-    model = MultiTaskModel(encoder=encoder, decoder=decoder)
-    checkpoint_path = config.get("checkpoint_path")
-    if checkpoint_path:
-        state = torch.load(checkpoint_path, map_location=device)
-        if isinstance(state, dict) and "state_dict" in state:
-            state = state["state_dict"]
-        model.load_state_dict(state, strict=False)
-    model.to(device)
-    return {
-        "loaded": True,
-        "device": device,
-        "mt": model,
-        "preprocessor": preprocessor,
-    }
-def summarize_text(
-    text: str,
-    compression: float = 0.25,
-    collect_attn: bool = False,
-    models: Optional[Dict[str, Any]] = None,
-) -> Tuple[str, Optional[Dict[str, torch.Tensor]]]:
-    if models is None or not models.get("loaded"):
-        raise RuntimeError("Models must be loaded via load_models before summarize_text is called")
-    model: MultiTaskModel = models["mt"]
-    preprocessor: TextPreprocessor = models["preprocessor"]
-    device: torch.device = models["device"]
-    batch = preprocessor.batch_encode([text])
-    tokenizer = preprocessor.tokenizer
-    encoder = model.encoder
-    decoder = model.decoder
-    if tokenizer is None or encoder is None or decoder is None:
-        raise RuntimeError("Encoder, decoder, and tokenizer must be configured before summarization")
-    input_ids = batch.input_ids.to(device)
-    memory = encoder(input_ids)
-    src_len = batch.lengths[0]
-    max_tgt = max(4, int(src_len * compression))
-    generated = decoder.greedy_decode(
-        memory,
-        max_len=min(preprocessor.max_length, max_tgt),
-        start_token_id=tokenizer.bos_id,
-        end_token_id=tokenizer.eos_id,
-    )
-    summary = tokenizer.decode(generated[0].tolist(), skip_special_tokens=True)
-    return summary.strip(), None if not collect_attn else {}
-def classify_emotion(text: str, models: Optional[Dict[str, Any]] = None) -> Tuple[List[float], List[str]]:
-    if models is None or not models.get("loaded"):
-        raise RuntimeError("Models must be loaded via load_models before classify_emotion is called")
-    model: MultiTaskModel = models["mt"]
-    preprocessor: TextPreprocessor = models["preprocessor"]
-    device: torch.device = models["device"]
-    batch = preprocessor.batch_encode([text])
-    input_ids = batch.input_ids.to(device)
-    result = model.forward("emotion", {"input_ids": input_ids})
-    logits = result[1] if isinstance(result, tuple) else result
-    scores = torch.sigmoid(logits).squeeze(0).detach().cpu().tolist()
-    labels = models.get("emotion_labels") or [
-        "joy",
-        "sadness",
-        "anger",
-        "fear",
-        "surprise",
-        "disgust",
-    ]
-    return scores, labels[: len(scores)]
-def topic_for_text(text: str, models: Optional[Dict[str, Any]] = None) -> Tuple[int, List[str]]:
-    if models is None or not models.get("loaded"):
-        raise RuntimeError("Models must be loaded via load_models before topic_for_text is called")
-    model: MultiTaskModel = models["mt"]
-    preprocessor: TextPreprocessor = models["preprocessor"]
-    device: torch.device = models["device"]
-    batch = preprocessor.batch_encode([text])
-    input_ids = batch.input_ids.to(device)
-    encoder = model.encoder
-    if encoder is None:
-        raise RuntimeError("Encoder must be configured before topic_for_text is called")
-    memory = encoder(input_ids)
-    embedding = memory.mean(dim=1).detach().cpu()
-    _ = embedding  # placeholder for downstream clustering hook
-    return 0, ["topic_stub"]

src/api/routes.py ADDED Viewed

	@@ -0,0 +1,34 @@

+"""API routes."""
+from typing import cast
+from fastapi import APIRouter, Depends, HTTPException, status
+from ..inference import EmotionPrediction, InferencePipeline, TopicPrediction
+from .dependencies import get_pipeline
+from .schemas import SummaryRequest, SummaryResponse
+router = APIRouter()
+@router.post("/summarize", response_model=SummaryResponse)
+def summarize(payload: SummaryRequest, pipeline: InferencePipeline = Depends(get_pipeline)) -> SummaryResponse:
+    try:
+        outputs = pipeline.batch_predict([payload.text])
+    except Exception as exc:  # noqa: BLE001 - surface inference error to client
+        raise HTTPException(
+            status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
+            detail=str(exc),
+        ) from exc
+    summaries = cast(list[str], outputs["summaries"])
+    emotion_preds = cast(list[EmotionPrediction], outputs["emotion"])
+    topic_preds = cast(list[TopicPrediction], outputs["topic"])
+    emotion = emotion_preds[0]
+    topic = topic_preds[0]
+    return SummaryResponse(
+        summary=summaries[0],
+        emotion_labels=emotion.labels,
+        emotion_scores=emotion.scores,
+        topic=topic.label,
+        topic_confidence=topic.confidence,
+    )

src/api/schemas.py ADDED Viewed

	@@ -0,0 +1,14 @@

+"""API schemas."""
+from pydantic import BaseModel
+class SummaryRequest(BaseModel):
+    text: str
+class SummaryResponse(BaseModel):
+    summary: str
+    emotion_labels: list[str]
+    emotion_scores: list[float]
+    topic: str
+    topic_confidence: float

src/data/__init__.py CHANGED Viewed

	@@ -0,0 +1 @@


1	+ """Data utilities for LexiMind."""

src/data/dataloader.py ADDED Viewed

	@@ -0,0 +1,117 @@

+"""Task-aware DataLoader builders for the LexiMind multitask suite."""
+from __future__ import annotations
+from typing import Iterable, List
+import torch
+from torch.utils.data import DataLoader
+from .dataset import EmotionDataset, EmotionExample, SummarizationDataset, SummarizationExample, TopicDataset, TopicExample
+from .tokenization import Tokenizer
+class SummarizationCollator:
+    """Prepare encoder-decoder batches for abstractive summarization."""
+    def __init__(self, tokenizer: Tokenizer, *, max_source_length: int | None = None, max_target_length: int | None = None) -> None:
+        self.tokenizer = tokenizer
+        self.max_source_length = max_source_length
+        self.max_target_length = max_target_length
+    def __call__(self, batch: List[SummarizationExample]) -> dict[str, torch.Tensor]:
+        sources = [example.source for example in batch]
+        targets = [example.summary for example in batch]
+        source_enc = self.tokenizer.batch_encode(sources, max_length=self.max_source_length)
+        target_enc = self.tokenizer.batch_encode(targets, max_length=self.max_target_length)
+        labels = target_enc["input_ids"].clone()
+        decoder_input_ids = self.tokenizer.prepare_decoder_inputs(target_enc["input_ids"])
+        labels[target_enc["attention_mask"] == 0] = -100
+        return {
+            "src_ids": source_enc["input_ids"],
+            "src_mask": source_enc["attention_mask"],
+            "tgt_ids": decoder_input_ids,
+            "labels": labels,
+        }
+class EmotionCollator:
+    """Prepare batches for multi-label emotion classification."""
+    def __init__(self, tokenizer: Tokenizer, dataset: EmotionDataset, *, max_length: int | None = None) -> None:
+        self.tokenizer = tokenizer
+        self.binarizer = dataset.binarizer
+        self.max_length = max_length
+    def __call__(self, batch: List[EmotionExample]) -> dict[str, torch.Tensor]:
+        texts = [example.text for example in batch]
+        encoded = self.tokenizer.batch_encode(texts, max_length=self.max_length)
+        label_array = self.binarizer.transform([example.emotions for example in batch])
+        labels = torch.as_tensor(label_array, dtype=torch.float32)
+        return {
+            "input_ids": encoded["input_ids"],
+            "attention_mask": encoded["attention_mask"],
+            "labels": labels,
+        }
+class TopicCollator:
+    """Prepare batches for topic classification using the projection head."""
+    def __init__(self, tokenizer: Tokenizer, dataset: TopicDataset, *, max_length: int | None = None) -> None:
+        self.tokenizer = tokenizer
+        self.encoder = dataset.encoder
+        self.max_length = max_length
+    def __call__(self, batch: List[TopicExample]) -> dict[str, torch.Tensor]:
+        texts = [example.text for example in batch]
+        encoded = self.tokenizer.batch_encode(texts, max_length=self.max_length)
+        labels = torch.as_tensor(self.encoder.transform([example.topic for example in batch]), dtype=torch.long)
+        return {
+            "input_ids": encoded["input_ids"],
+            "attention_mask": encoded["attention_mask"],
+            "labels": labels,
+        }
+def build_summarization_dataloader(
+    dataset: SummarizationDataset,
+    tokenizer: Tokenizer,
+    *,
+    batch_size: int,
+    shuffle: bool = True,
+    max_source_length: int | None = None,
+    max_target_length: int | None = None,
+) -> DataLoader:
+    collator = SummarizationCollator(
+        tokenizer,
+        max_source_length=max_source_length,
+        max_target_length=max_target_length,
+    )
+    return DataLoader(dataset, batch_size=batch_size, shuffle=shuffle, collate_fn=collator)
+def build_emotion_dataloader(
+    dataset: EmotionDataset,
+    tokenizer: Tokenizer,
+    *,
+    batch_size: int,
+    shuffle: bool = True,
+    max_length: int | None = None,
+) -> DataLoader:
+    collator = EmotionCollator(tokenizer, dataset, max_length=max_length)
+    return DataLoader(dataset, batch_size=batch_size, shuffle=shuffle, collate_fn=collator)
+def build_topic_dataloader(
+    dataset: TopicDataset,
+    tokenizer: Tokenizer,
+    *,
+    batch_size: int,
+    shuffle: bool = True,
+    max_length: int | None = None,
+) -> DataLoader:
+    collator = TopicCollator(tokenizer, dataset, max_length=max_length)
+    return DataLoader(dataset, batch_size=batch_size, shuffle=shuffle, collate_fn=collator)

src/data/dataset.py ADDED Viewed

	@@ -0,0 +1,229 @@

+"""Dataset definitions for the LexiMind multitask training pipeline."""
+from __future__ import annotations
+import json
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Callable, Iterable, List, Sequence, TypeVar
+from sklearn.preprocessing import LabelEncoder, MultiLabelBinarizer
+from torch.utils.data import Dataset
+@dataclass(slots=True)
+class SummarizationExample:
+    """Container for abstractive summarization samples."""
+    source: str
+    summary: str
+@dataclass(slots=True)
+class EmotionExample:
+    """Container for multi-label emotion classification samples."""
+    text: str
+    emotions: Sequence[str]
+@dataclass(slots=True)
+class TopicExample:
+    """Container for topic clustering / classification samples."""
+    text: str
+    topic: str
+class SummarizationDataset(Dataset[SummarizationExample]):
+    """Dataset yielding encoder-decoder training pairs."""
+    def __init__(self, examples: Iterable[SummarizationExample]) -> None:
+        self._examples = list(examples)
+    def __len__(self) -> int:
+        return len(self._examples)
+    def __getitem__(self, index: int) -> SummarizationExample:
+        return self._examples[index]
+class EmotionDataset(Dataset[EmotionExample]):
+    """Dataset that owns a scikit-learn MultiLabelBinarizer for emissions."""
+    def __init__(
+        self,
+        examples: Iterable[EmotionExample],
+        *,
+        binarizer: MultiLabelBinarizer | None = None,
+    ) -> None:
+        self._examples = list(examples)
+        all_labels = [example.emotions for example in self._examples]
+        if binarizer is None:
+            self._binarizer = MultiLabelBinarizer()
+            self._binarizer.fit(all_labels)
+        else:
+            self._binarizer = binarizer
+            if not hasattr(self._binarizer, "classes_"):
+                raise ValueError(
+                    "Provided MultiLabelBinarizer must be pre-fitted with 'classes_' attribute."
+                )
+    def __len__(self) -> int:
+        return len(self._examples)
+    def __getitem__(self, index: int) -> EmotionExample:
+        return self._examples[index]
+    @property
+    def binarizer(self) -> MultiLabelBinarizer:
+        return self._binarizer
+    @property
+    def emotion_classes(self) -> List[str]:
+        return list(self._binarizer.classes_)
+class TopicDataset(Dataset[TopicExample]):
+    """Dataset that owns a LabelEncoder for topic ids."""
+    def __init__(
+        self,
+        examples: Iterable[TopicExample],
+        *,
+        encoder: LabelEncoder | None = None,
+    ) -> None:
+        self._examples = list(examples)
+        topics = [example.topic for example in self._examples]
+        if encoder is None:
+            self._encoder = LabelEncoder().fit(topics)
+        else:
+            self._encoder = encoder
+            if not hasattr(self._encoder, "classes_"):
+                raise ValueError(
+                    "Provided LabelEncoder must be pre-fitted with 'classes_' attribute."
+                )
+    def __len__(self) -> int:
+        return len(self._examples)
+    def __getitem__(self, index: int) -> TopicExample:
+        return self._examples[index]
+    @property
+    def encoder(self) -> LabelEncoder:
+        return self._encoder
+    @property
+    def topic_classes(self) -> List[str]:
+        return list(self._encoder.classes_)
+T = TypeVar("T")
+def _safe_json_load(handle, path: Path) -> object:
+    try:
+        return json.load(handle)
+    except json.JSONDecodeError as exc:
+        raise ValueError(f"Failed to parse JSON in '{path}': {exc}") from exc
+def _safe_json_loads(data: str, path: Path, line_number: int) -> object:
+    try:
+        return json.loads(data)
+    except json.JSONDecodeError as exc:
+        raise ValueError(f"Failed to parse JSON in '{path}' at line {line_number}: {exc}") from exc
+def _validate_keys(
+    payload: dict,
+    required_keys: Sequence[str],
+    position: int,
+    *,
+    path: Path,
+    is_array: bool = False,
+) -> None:
+    missing = [key for key in required_keys if key not in payload]
+    if missing:
+        keys = ", ".join(sorted(missing))
+        location = "index" if is_array else "line"
+        raise KeyError(f"Missing required keys ({keys}) at {location} {position} of '{path}'")
+def _load_jsonl_generic(
+    path: str,
+    constructor: Callable[[dict], T],
+    required_keys: Sequence[str],
+) -> List[T]:
+    data_path = Path(path)
+    if not data_path.exists():
+        raise FileNotFoundError(f"Dataset file '{data_path}' does not exist")
+    if not data_path.is_file():
+        raise ValueError(f"Dataset path '{data_path}' is not a file")
+    items: List[T] = []
+    with data_path.open("r", encoding="utf-8") as handle:
+        first_non_ws = ""
+        while True:
+            pos = handle.tell()
+            char = handle.read(1)
+            if not char:
+                break
+            if not char.isspace():
+                first_non_ws = char
+                handle.seek(pos)
+                break
+        if not first_non_ws:
+            raise ValueError(f"Dataset file '{data_path}' is empty or contains only whitespace")
+        if first_non_ws == "[":
+            payloads = _safe_json_load(handle, data_path)
+            if not isinstance(payloads, list):
+                raise ValueError(f"Expected a JSON array in '{data_path}' but found {type(payloads).__name__}")
+            for idx, payload in enumerate(payloads):
+                if not isinstance(payload, dict):
+                    raise ValueError(
+                        f"Expected objects in array for '{data_path}', found {type(payload).__name__} at index {idx}"
+                    )
+                _validate_keys(payload, required_keys, idx, path=data_path, is_array=True)
+                items.append(constructor(payload))
+        else:
+            handle.seek(0)
+            line_number = 0
+            for line in handle:
+                line_number += 1
+                if not line.strip():
+                    continue
+                payload = _safe_json_loads(line, data_path, line_number)
+                if not isinstance(payload, dict):
+                    raise ValueError(
+                        f"Expected JSON object per line in '{data_path}', found {type(payload).__name__} at line {line_number}"
+                    )
+                _validate_keys(payload, required_keys, line_number, path=data_path)
+                items.append(constructor(payload))
+    return items
+def load_summarization_jsonl(path: str) -> List[SummarizationExample]:
+    return _load_jsonl_generic(
+        path,
+        lambda payload: SummarizationExample(source=payload["source"], summary=payload["summary"]),
+        required_keys=("source", "summary"),
+    )
+def load_emotion_jsonl(path: str) -> List[EmotionExample]:
+    return _load_jsonl_generic(
+        path,
+        lambda payload: EmotionExample(text=payload["text"], emotions=payload.get("emotions", [])),
+        required_keys=("text",),
+    )
+def load_topic_jsonl(path: str) -> List[TopicExample]:
+    return _load_jsonl_generic(
+        path,
+        lambda payload: TopicExample(text=payload["text"], topic=payload["topic"]),
+        required_keys=("text", "topic"),
+    )

src/data/download.py CHANGED Viewed

@@ -1,66 +1,45 @@
-"""
-Download helpers for datasets.
-This version:
-- Adds robust error handling when Kaggle API is not configured.
-- Stores files under data/raw/ subfolders.
-- Keeps the Gutenberg direct download example.
-Make sure you have Kaggle credentials configured if you call Kaggle downloads.
-"""
-import os
-import requests
-def download_gutenberg(out_dir="data/raw/books", gutenberg_id: int = 1342, filename: str = "pride_and_prejudice.txt"):
-    """Download a Gutenberg text file by direct URL template (best-effort)."""
-    url = f"https://www.gutenberg.org/files/{gutenberg_id}/{gutenberg_id}-0.txt"
-    os.makedirs(out_dir, exist_ok=True)
-    out_path = os.path.join(out_dir, filename)
-    if os.path.exists(out_path):
-        print("Already downloaded:", out_path)
-        return out_path
-    try:
-        r = requests.get(url, timeout=30)
-        r.raise_for_status()
-        with open(out_path, "wb") as f:
-            f.write(r.content)
-        print("Downloaded:", out_path)
-        return out_path
-    except Exception as e:
-        print("Failed to download Gutenberg file:", e)
-        return None
-# Kaggle helpers: optional, wrapped to avoid hard failure when Kaggle isn't configured.
-def _safe_kaggle_download(dataset: str, path: str):
     try:
-        import kaggle
-    except Exception as e:
-        print("Kaggle package not available or not configured. Please install 'kaggle' and configure API token. Error:", e)
-        return False
     try:
-        os.makedirs(path, exist_ok=True)
-        kaggle.api.authenticate()
-        kaggle.api.dataset_download_files(dataset, path=path, unzip=True)
-        print(f"Downloaded Kaggle dataset {dataset} to {path}")
-        return True
-    except Exception as e:
-        print("Failed to download Kaggle dataset:", e)
-        return False
-def download_emotion_dataset():
-    target_dir = "data/raw/emotion"
-    return _safe_kaggle_download('praveengovi/emotions-dataset-for-nlp', target_dir)
-def download_cnn_dailymail():
-    target_dir = "data/raw/summarization"
-    return _safe_kaggle_download('gowrishankarp/newspaper-text-summarization-cnn-dailymail', target_dir)
-def download_ag_news():
-    target_dir = "data/raw/topic"
-    return _safe_kaggle_download('amananandrai/ag-news-classification-dataset', target_dir)
-if __name__ == "__main__":
-    download_gutenberg()
-    download_emotion_dataset()
-    download_cnn_dailymail()
-    download_ag_news()

+"""Dataset download helpers."""
+import socket
+from pathlib import Path
+from subprocess import CalledProcessError, run
+from urllib.error import URLError
+from urllib.request import urlopen
+DOWNLOAD_TIMEOUT = 60
+def kaggle_download(dataset: str, output_dir: str) -> None:
+    target = Path(output_dir)
+    target.mkdir(parents=True, exist_ok=True)
     try:
+        run([
+            "kaggle",
+            "datasets",
+            "download",
+            "-d",
+            dataset,
+            "-p",
+            str(target),
+            "--unzip",
+        ], check=True)
+    except CalledProcessError as error:
+        raise RuntimeError(
+            "Kaggle download failed. Verify that the Kaggle CLI is authenticated,"
+            " you have accepted the dataset terms on kaggle.com, and your kaggle.json"
+            " credentials are located in %USERPROFILE%/.kaggle."
+        ) from error
+def gutenberg_download(url: str, output_path: str) -> None:
+    target = Path(output_path)
+    target.parent.mkdir(parents=True, exist_ok=True)
     try:
+        with urlopen(url, timeout=DOWNLOAD_TIMEOUT) as response, target.open("wb") as handle:
+            chunk = response.read(8192)
+            while chunk:
+                handle.write(chunk)
+                chunk = response.read(8192)
+    except (URLError, socket.timeout, OSError) as error:
+        raise RuntimeError(f"Failed to download '{url}' to '{target}': {error}") from error

src/data/preprocessing.py CHANGED Viewed

@@ -1,260 +1,130 @@
-"""Lightweight preprocessing utilities built around the in-repo transformer."""
 from __future__ import annotations
-from collections import Counter
-from dataclasses import dataclass
-import json
-from pathlib import Path
 import re
-from typing import Dict, Iterable, List, Optional, Sequence, Tuple
 import torch
-from ..models.decoder import TransformerDecoder
-from ..models.encoder import TransformerEncoder
-SPECIAL_TOKENS: Tuple[str, str, str, str] = ("<pad>", "<bos>", "<eos>", "<unk>")
-def _normalize(text: str, lowercase: bool) -> str:
-    text = text.strip()
-    text = re.sub(r"\s+", " ", text)
-    if lowercase:
-        text = text.lower()
-    return text
-def _basic_tokenize(text: str) -> List[str]:
-    return re.findall(r"\b\w+\b|[.,;:?!]", text)
-class TransformerTokenizer:
-    """Minimal tokenizer that keeps vocabulary aligned with the custom transformer."""
-    def __init__(
-        self,
-        stoi: Dict[str, int],
-        itos: List[str],
-        specials: Sequence[str] = SPECIAL_TOKENS,
-        lowercase: bool = True,
-    ) -> None:
-        self.stoi = stoi
-        self.itos = itos
-        self.specials = tuple(specials)
         self.lowercase = lowercase
-        self.pad_id = self._lookup(self.specials[0])
-        self.bos_id = self._lookup(self.specials[1])
-        self.eos_id = self._lookup(self.specials[2])
-        self.unk_id = self._lookup(self.specials[3])
-    @classmethod
-    def build(
-        cls,
-        texts: Iterable[str],
-        min_freq: int = 1,
-        lowercase: bool = True,
-        specials: Sequence[str] = SPECIAL_TOKENS,
-    ) -> "TransformerTokenizer":
-        counter: Counter[str] = Counter()
-        for text in texts:
-            normalized = _normalize(text, lowercase)
-            counter.update(_basic_tokenize(normalized))
-        ordered_specials = list(dict.fromkeys(specials))
-        itos: List[str] = ordered_specials.copy()
-        for token, freq in counter.most_common():
-            if freq < min_freq:
-                continue
-            if token in itos:
-                continue
-            itos.append(token)
-        stoi = {token: idx for idx, token in enumerate(itos)}
-        return cls(stoi=stoi, itos=itos, specials=ordered_specials, lowercase=lowercase)
-    @property
-    def vocab_size(self) -> int:
-        return len(self.itos)
-    def tokenize(self, text: str) -> List[str]:
-        normalized = _normalize(text, self.lowercase)
-        return _basic_tokenize(normalized)
-    def encode(
-        self,
-        text: str,
-        add_special_tokens: bool = True,
-        max_length: Optional[int] = None,
-    ) -> List[int]:
-        tokens = self.tokenize(text)
-        pieces = [self.stoi.get(tok, self.unk_id) for tok in tokens]
-        if add_special_tokens:
-            pieces = [self.bos_id] + pieces + [self.eos_id]
-        if max_length is not None and len(pieces) > max_length:
-            if add_special_tokens and max_length >= 2:
-                inner_max = max_length - 2
-                trimmed = pieces[1:-1][:inner_max]
-                pieces = [self.bos_id] + trimmed + [self.eos_id]
-            else:
-                pieces = pieces[:max_length]
-        return pieces
-    def decode(self, ids: Sequence[int], skip_special_tokens: bool = True) -> str:
-        tokens: List[str] = []
-        for idx in ids:
-            if idx < 0 or idx >= len(self.itos):
-                continue
-            token = self.itos[idx]
-            if skip_special_tokens and token in self.specials:
-                continue
-            tokens.append(token)
-        return " ".join(tokens).strip()
-    def pad_batch(
-        self,
-        sequences: Sequence[Sequence[int]],
-        pad_to_length: Optional[int] = None,
-    ) -> Tuple[torch.Tensor, torch.Tensor]:
-        if not sequences:
-            raise ValueError("pad_batch requires at least one sequence")
-        if pad_to_length is None:
-            pad_to_length = max(len(seq) for seq in sequences)
-        padded: List[List[int]] = []
-        mask: List[List[int]] = []
-        for seq in sequences:
-            trimmed = list(seq[:pad_to_length])
-            pad_len = pad_to_length - len(trimmed)
-            padded.append(trimmed + [self.pad_id] * pad_len)
-            mask.append([1] * len(trimmed) + [0] * pad_len)
-        return torch.tensor(padded, dtype=torch.long), torch.tensor(mask, dtype=torch.bool)
-    def save(self, path: Path) -> None:
-        payload = {
-            "itos": self.itos,
-            "specials": list(self.specials),
-            "lowercase": self.lowercase,
-        }
-        path.parent.mkdir(parents=True, exist_ok=True)
-        path.write_text(json.dumps(payload, ensure_ascii=False, indent=2), encoding="utf-8")
-    @classmethod
-    def load(cls, path: Path) -> "TransformerTokenizer":
-        data = json.loads(path.read_text(encoding="utf-8"))
-        itos = list(data["itos"])
-        stoi = {token: idx for idx, token in enumerate(itos)}
-        specials = data.get("specials", list(SPECIAL_TOKENS))
-        lowercase = bool(data.get("lowercase", True))
-        return cls(stoi=stoi, itos=itos, specials=specials, lowercase=lowercase)
-    def _lookup(self, token: str) -> int:
-        if token not in self.stoi:
-            raise ValueError(f"token '{token}' missing from vocabulary")
-        return self.stoi[token]
-@dataclass
 class Batch:
     input_ids: torch.Tensor
     attention_mask: torch.Tensor
     lengths: List[int]
 class TextPreprocessor:
-    """Prepares text so it can flow directly into the custom transformer stack."""
     def __init__(
         self,
-        max_length: int = 512,
-        tokenizer: Optional[TransformerTokenizer] = None,
         *,
-        min_freq: int = 1,
         lowercase: bool = True,
     ) -> None:
-        self.max_length = max_length
-        self.min_freq = min_freq
         self.lowercase = lowercase
-        self.tokenizer = tokenizer
     def clean_text(self, text: str) -> str:
-        return _normalize(text, self.lowercase)
-    def fit_tokenizer(self, texts: Iterable[str]) -> TransformerTokenizer:
-        cleaned = [self.clean_text(text) for text in texts]
-        self.tokenizer = TransformerTokenizer.build(
-            cleaned,
-            min_freq=self.min_freq,
-            lowercase=False,
-        )
-        return self.tokenizer
-    def encode(self, text: str, *, add_special_tokens: bool = True) -> List[int]:
-        if self.tokenizer is None:
-            raise RuntimeError("Tokenizer not fitted")
-        cleaned = self.clean_text(text)
-        return self.tokenizer.encode(cleaned, add_special_tokens=add_special_tokens, max_length=self.max_length)
     def batch_encode(self, texts: Sequence[str]) -> Batch:
-        if self.tokenizer is None:
-            raise RuntimeError("Tokenizer not fitted")
-        sequences = [self.encode(text) for text in texts]
-        lengths = [len(seq) for seq in sequences]
-        input_ids, attention_mask = self.tokenizer.pad_batch(sequences, pad_to_length=self.max_length)
         return Batch(input_ids=input_ids, attention_mask=attention_mask, lengths=lengths)
-    def build_encoder(self, **encoder_kwargs) -> TransformerEncoder:
-        if self.tokenizer is None:
-            raise RuntimeError("Tokenizer not fitted")
-        return TransformerEncoder(
-            vocab_size=self.tokenizer.vocab_size,
-            max_len=self.max_length,
-            pad_token_id=self.tokenizer.pad_id,
-            **encoder_kwargs,
-        )
-    def build_decoder(self, **decoder_kwargs) -> TransformerDecoder:
-        if self.tokenizer is None:
-            raise RuntimeError("Tokenizer not fitted")
-        return TransformerDecoder(
-            vocab_size=self.tokenizer.vocab_size,
-            max_len=self.max_length,
-            pad_token_id=self.tokenizer.pad_id,
-            **decoder_kwargs,
-        )
-    def save_tokenizer(self, path: Path) -> None:
-        if self.tokenizer is None:
-            raise RuntimeError("Tokenizer not fitted")
-        self.tokenizer.save(path)
-    def load_tokenizer(self, path: Path) -> TransformerTokenizer:
-        self.tokenizer = TransformerTokenizer.load(path)
-        return self.tokenizer
-    def chunk_text(self, text: str, *, chunk_size: int = 1000, overlap: int = 100) -> List[str]:
-        if chunk_size <= overlap:
-            raise ValueError("chunk_size must be larger than overlap")
-        words = self.clean_text(text).split()
-        chunks: List[str] = []
-        start = 0
-        while start < len(words):
-            end = min(start + chunk_size, len(words))
-            chunks.append(" ".join(words[start:end]))
-            start += chunk_size - overlap
-        return chunks
-    def save_book_chunks(
-        self,
-        input_path: Path,
-        out_dir: Path,
-        *,
-        chunk_size: int = 1000,
-        overlap: int = 100,
-    ) -> Path:
-        out_dir.mkdir(parents=True, exist_ok=True)
-        raw_text = input_path.read_text(encoding="utf-8", errors="ignore")
-        chunks = self.chunk_text(raw_text, chunk_size=chunk_size, overlap=overlap)
-        out_file = out_dir / f"{input_path.stem}.json"
-        out_file.write_text(json.dumps(chunks, ensure_ascii=False, indent=2), encoding="utf-8")
-        return out_file

+"""Text preprocessing utilities built around Hugging Face tokenizers."""
 from __future__ import annotations
 import re
+from dataclasses import dataclass, replace
+from typing import Iterable, List, Sequence
 import torch
+from sklearn.base import BaseEstimator, TransformerMixin
+from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
+from .tokenization import Tokenizer, TokenizerConfig
+class BasicTextCleaner(BaseEstimator, TransformerMixin):
+    """Minimal text cleaner following scikit-learn conventions."""
+    def __init__(self, lowercase: bool = True, strip: bool = True) -> None:
         self.lowercase = lowercase
+        self.strip = strip
+    def fit(self, texts: Iterable[str], y: Iterable[str] | None = None):
+        return self
+    def transform(self, texts: Iterable[str]) -> List[str]:
+        return [self._clean_text(text) for text in texts]
+    def _clean_text(self, text: str) -> str:
+        item = text.strip() if self.strip else text
+        if self.lowercase:
+            item = item.lower()
+        return " ".join(item.split())
+@dataclass(slots=True)
 class Batch:
+    """Bundle of tensors returned by the text preprocessor."""
     input_ids: torch.Tensor
     attention_mask: torch.Tensor
     lengths: List[int]
 class TextPreprocessor:
+    """Coordinate lightweight text cleaning and tokenization.
+    When supplying an already-initialized tokenizer instance, its configuration is left
+    untouched. If a differing ``max_length`` is requested, a ``ValueError`` is raised to
+    avoid mutating shared tokenizer state.
+    """
     def __init__(
         self,
+        tokenizer: Tokenizer | None = None,
         *,
+        tokenizer_config: TokenizerConfig | None = None,
+        tokenizer_name: str = "facebook/bart-base",
+        max_length: int | None = None,
         lowercase: bool = True,
+        remove_stopwords: bool = False,
+        sklearn_transformer: TransformerMixin | None = None,
     ) -> None:
+        self.cleaner = BasicTextCleaner(lowercase=lowercase, strip=True)
         self.lowercase = lowercase
+        if remove_stopwords:
+            raise ValueError(
+                "Stop-word removal is not supported because it conflicts with subword tokenizers; "
+                "clean the text externally before initializing TextPreprocessor."
+            )
+        self._stop_words = None
+        self._sklearn_transformer = sklearn_transformer
+        if tokenizer is None:
+            cfg = tokenizer_config or TokenizerConfig(pretrained_model_name=tokenizer_name)
+            if max_length is not None:
+                cfg = replace(cfg, max_length=max_length)
+            self.tokenizer = Tokenizer(cfg)
+        else:
+            self.tokenizer = tokenizer
+            if max_length is not None and max_length != tokenizer.config.max_length:
+                raise ValueError(
+                    "Provided tokenizer config.max_length does not match requested max_length; "
+                    "initialise the tokenizer with desired settings before passing it in."
+                )
+        self.max_length = max_length or self.tokenizer.config.max_length
     def clean_text(self, text: str) -> str:
+        item = self.cleaner.transform([text])[0]
+        return self._normalize_tokens(item)
+    def _normalize_tokens(self, text: str) -> str:
+        """Apply token-level normalization and optional stop-word filtering."""
+        # Note: Pre-tokenization word-splitting is incompatible with subword tokenizers.
+        # Stop-word filtering should be done post-tokenization or not at all for transformers.
+        return text
+    def _apply_sklearn_transform(self, texts: List[str]) -> List[str]:
+        if self._sklearn_transformer is None:
+            return texts
+        transform = getattr(self._sklearn_transformer, "transform", None)
+        if transform is None:
+            raise AttributeError("Provided sklearn transformer must implement a 'transform' method")
+        transformed = transform(texts)
+        if isinstance(transformed, list):
+            return transformed  # assume downstream type is already list[str]
+        if hasattr(transformed, "tolist"):
+            transformed = transformed.tolist()
+        result = list(transformed)
+        if not all(isinstance(item, str) for item in result):
+            result = [str(item) for item in result]
+        return result
+    def _prepare_texts(self, texts: Sequence[str]) -> List[str]:
+        cleaned = self.cleaner.transform(texts)
+        normalized = [self._normalize_tokens(text) for text in cleaned]
+        return self._apply_sklearn_transform(normalized)
     def batch_encode(self, texts: Sequence[str]) -> Batch:
+        cleaned = self._prepare_texts(texts)
+        encoded = self.tokenizer.batch_encode(cleaned, max_length=self.max_length)
+        input_ids: torch.Tensor = encoded["input_ids"]
+        attention_mask: torch.Tensor = encoded["attention_mask"].to(dtype=torch.bool)
+        lengths = attention_mask.sum(dim=1).tolist()
         return Batch(input_ids=input_ids, attention_mask=attention_mask, lengths=lengths)
+    def __call__(self, texts: Sequence[str]) -> Batch:
+        return self.batch_encode(texts)

src/data/tokenization.py ADDED Viewed

	@@ -0,0 +1,122 @@

+"""Tokenizer wrapper around HuggingFace models used across LexiMind."""
+from __future__ import annotations
+from dataclasses import dataclass
+from typing import Iterable, List, Sequence, cast
+import torch
+from transformers import AutoTokenizer, PreTrainedTokenizerBase
+@dataclass(slots=True)
+class TokenizerConfig:
+    pretrained_model_name: str = "facebook/bart-base"
+    max_length: int = 512
+    padding: str = "longest"
+    truncation: bool = True
+    lower: bool = False
+class Tokenizer:
+    """Lightweight façade over a HuggingFace tokenizer."""
+    def __init__(self, config: TokenizerConfig | None = None) -> None:
+        cfg = config or TokenizerConfig()
+        self.config = cfg
+        self._tokenizer: PreTrainedTokenizerBase = AutoTokenizer.from_pretrained(cfg.pretrained_model_name)
+        self._pad_token_id = self._resolve_id(self._tokenizer.pad_token_id)
+        self._bos_token_id = self._resolve_id(
+            self._tokenizer.bos_token_id if self._tokenizer.bos_token_id is not None else self._tokenizer.cls_token_id
+        )
+        self._eos_token_id = self._resolve_id(
+            self._tokenizer.eos_token_id if self._tokenizer.eos_token_id is not None else self._tokenizer.sep_token_id
+        )
+    @property
+    def tokenizer(self) -> PreTrainedTokenizerBase:
+        return self._tokenizer
+    @property
+    def pad_token_id(self) -> int:
+        return self._pad_token_id
+    @property
+    def bos_token_id(self) -> int:
+        return self._bos_token_id
+    @property
+    def eos_token_id(self) -> int:
+        return self._eos_token_id
+    @property
+    def vocab_size(self) -> int:
+        vocab = getattr(self._tokenizer, "vocab_size", None)
+        if vocab is None:
+            raise RuntimeError("Tokenizer must expose vocab_size")
+        return int(vocab)
+    @staticmethod
+    def _resolve_id(value) -> int:
+        if value is None:
+            raise ValueError("Tokenizer is missing required special token ids")
+        if isinstance(value, (list, tuple)):
+            value = value[0]
+        return int(value)
+    def encode(self, text: str) -> List[int]:
+        content = text.lower() if self.config.lower else text
+        return self._tokenizer.encode(
+            content,
+            max_length=self.config.max_length,
+            truncation=self.config.truncation,
+            padding=self.config.padding,
+        )
+    def encode_batch(self, texts: Sequence[str]) -> List[List[int]]:
+        normalized = (text.lower() if self.config.lower else text for text in texts)
+        encoded = self._tokenizer.batch_encode_plus(
+            list(normalized),
+            max_length=self.config.max_length,
+            padding=self.config.padding,
+            truncation=self.config.truncation,
+            return_attention_mask=False,
+            return_tensors=None,
+        )
+        return cast(List[List[int]], encoded["input_ids"])
+    def batch_encode(self, texts: Sequence[str], *, max_length: int | None = None) -> dict[str, torch.Tensor]:
+        normalized = [text.lower() if self.config.lower else text for text in texts]
+        encoded = self._tokenizer(
+            normalized,
+            padding=self.config.padding,
+            truncation=self.config.truncation,
+            max_length=max_length or self.config.max_length,
+            return_tensors="pt",
+        )
+        input_ids = cast(torch.Tensor, encoded["input_ids"])
+        attention_mask = cast(torch.Tensor, encoded["attention_mask"])
+        if input_ids.dtype != torch.long:
+            input_ids = input_ids.to(dtype=torch.long)
+        if attention_mask.dtype != torch.bool:
+            attention_mask = attention_mask.to(dtype=torch.bool)
+        return {
+            "input_ids": input_ids,
+            "attention_mask": attention_mask,
+        }
+    def decode(self, token_ids: Iterable[int]) -> str:
+        return self._tokenizer.decode(list(token_ids), skip_special_tokens=True)
+    def decode_batch(self, sequences: Sequence[Sequence[int]]) -> List[str]:
+        prepared = [list(seq) for seq in sequences]
+        return self._tokenizer.batch_decode(prepared, skip_special_tokens=True)
+    def prepare_decoder_inputs(self, labels: torch.Tensor) -> torch.Tensor:
+        """Shift decoder labels to create input ids prefixed by BOS."""
+        bos = self.bos_token_id
+        pad = self.pad_token_id
+        decoder_inputs = torch.full_like(labels, pad)
+        decoder_inputs[:, 0] = bos
+        decoder_inputs[:, 1:] = labels[:, :-1]
+        return decoder_inputs

src/inference/__init__.py CHANGED Viewed

@@ -1,7 +1,12 @@
-"""
-Inference utilities for LexiMind.
-"""
-from .baseline_summarizer import Summarizer, TransformerSummarizer
-__all__ = ["Summarizer", "TransformerSummarizer"]

+"""Inference tools for LexiMind."""
+from .factory import create_inference_pipeline
+from .pipeline import EmotionPrediction, InferenceConfig, InferencePipeline, TopicPrediction
+__all__ = [
+	"InferencePipeline",
+	"InferenceConfig",
+	"EmotionPrediction",
+	"TopicPrediction",
+	"create_inference_pipeline",
+]

src/inference/baseline_summarizer.py DELETED Viewed

@@ -1,41 +0,0 @@
-"""Thin wrapper around the custom transformer summarizer."""
-from __future__ import annotations
-from typing import Any, Dict, Optional, Tuple
-import torch
-from ..api.inference import load_models
-class TransformerSummarizer:
-    def __init__(self, config: Optional[Dict[str, Any]] = None) -> None:
-        models = load_models(config or {})
-        if not models.get("loaded"):
-            raise RuntimeError("load_models returned an unloaded model; check configuration")
-        self.model = models["mt"]
-        self.preprocessor = models["preprocessor"]
-        self.device = models["device"]
-    def summarize(
-        self,
-        text: str,
-        compression: float = 0.25,
-        collect_attn: bool = False,
-    ) -> Tuple[str, Optional[Dict[str, torch.Tensor]]]:
-        batch = self.preprocessor.batch_encode([text])
-        tokenizer = self.preprocessor.tokenizer
-        encoder = self.model.encoder
-        decoder = self.model.decoder
-        if tokenizer is None or encoder is None or decoder is None:
-            raise RuntimeError("Model components are missing; ensure encoder, decoder, and tokenizer are set")
-        input_ids = batch.input_ids.to(self.device)
-        memory = encoder(input_ids)
-        src_len = batch.lengths[0]
-        target_len = max(4, int(src_len * compression))
-        generated = decoder.greedy_decode(
-            memory,
-            max_len=min(self.preprocessor.max_length, target_len),
-            start_token_id=tokenizer.bos_id,
-            end_token_id=tokenizer.eos_id,
-        )
-        summary = tokenizer.decode(generated[0].tolist(), skip_special_tokens=True)
-        return summary.strip(), None if not collect_attn else {}

src/inference/factory.py ADDED Viewed

	@@ -0,0 +1,75 @@

+"""Helpers to assemble an inference pipeline from saved artifacts."""
+from __future__ import annotations
+from pathlib import Path
+from typing import Tuple
+import torch
+from ..data.tokenization import Tokenizer, TokenizerConfig
+from ..models.factory import ModelConfig, build_multitask_model, load_model_config
+from ..utils.io import load_state
+from ..utils.labels import LabelMetadata, load_label_metadata
+from .pipeline import InferenceConfig, InferencePipeline
+def create_inference_pipeline(
+    checkpoint_path: str | Path,
+    labels_path: str | Path,
+    *,
+    tokenizer_config: TokenizerConfig | None = None,
+    tokenizer_dir: str | Path | None = None,
+    model_config_path: str | Path | None = None,
+    device: str | torch.device = "cpu",
+    summary_max_length: int | None = None,
+) -> Tuple[InferencePipeline, LabelMetadata]:
+    """Build an :class:`InferencePipeline` from saved model and label metadata."""
+    checkpoint = Path(checkpoint_path)
+    if not checkpoint.exists():
+        raise FileNotFoundError(f"Checkpoint not found: {checkpoint}")
+    labels = load_label_metadata(labels_path)
+    resolved_tokenizer_config = tokenizer_config
+    if resolved_tokenizer_config is None:
+        default_dir = Path(__file__).resolve().parent.parent.parent / "artifacts" / "hf_tokenizer"
+        chosen_dir = Path(tokenizer_dir) if tokenizer_dir is not None else default_dir
+        local_tokenizer_dir = chosen_dir
+        if local_tokenizer_dir.exists():
+            resolved_tokenizer_config = TokenizerConfig(pretrained_model_name=str(local_tokenizer_dir))
+        else:
+            raise ValueError(
+                "No tokenizer configuration provided and default tokenizer directory "
+                f"'{local_tokenizer_dir}' not found. Please provide tokenizer_config parameter or set tokenizer_dir."
+            )
+    tokenizer = Tokenizer(resolved_tokenizer_config)
+    model_config = load_model_config(model_config_path)
+    model = build_multitask_model(
+        tokenizer,
+        num_emotions=labels.emotion_size,
+        num_topics=labels.topic_size,
+        config=model_config,
+    )
+    load_state(model, str(checkpoint))
+    if isinstance(device, torch.device):
+        device_str = str(device)
+    else:
+        device_str = device
+    if summary_max_length is not None:
+        pipeline_config = InferenceConfig(summary_max_length=summary_max_length, device=device_str)
+    else:
+        pipeline_config = InferenceConfig(device=device_str)
+    pipeline = InferencePipeline(
+        model=model,
+        tokenizer=tokenizer,
+        config=pipeline_config,
+        emotion_labels=labels.emotion,
+        topic_labels=labels.topic,
+        device=device,
+    )
+    return pipeline, labels

src/inference/generation.py ADDED Viewed

	@@ -0,0 +1,14 @@

+"""Generation helpers."""
+import torch
+def greedy_decode(model: torch.nn.Module, input_ids: torch.Tensor, max_length: int) -> torch.Tensor:
+    """Run greedy decoding with ``model.generate`` and return generated token ids."""
+    return model.generate(
+        input_ids,
+        max_length=max_length,
+        do_sample=False,
+        num_beams=1,
+    )

src/inference/pipeline.py ADDED Viewed

	@@ -0,0 +1,166 @@

+"""Inference helpers for multitask LexiMind models."""
+from __future__ import annotations
+from dataclasses import dataclass, fields, replace
+from typing import Iterable, List, Sequence
+import torch
+import torch.nn.functional as F
+from ..data.preprocessing import Batch, TextPreprocessor
+from ..data.tokenization import Tokenizer
+@dataclass(slots=True)
+class InferenceConfig:
+    """Configuration knobs for the inference pipeline."""
+    summary_max_length: int = 128
+    emotion_threshold: float = 0.5
+    device: str | None = None
+@dataclass(slots=True)
+class EmotionPrediction:
+    labels: List[str]
+    scores: List[float]
+@dataclass(slots=True)
+class TopicPrediction:
+    label: str
+    confidence: float
+class InferencePipeline:
+    """Run summarization, emotion, and topic heads through a unified interface."""
+    def __init__(
+        self,
+        model: torch.nn.Module,
+        tokenizer: Tokenizer,
+        *,
+        preprocessor: TextPreprocessor | None = None,
+        emotion_labels: Sequence[str] | None = None,
+        topic_labels: Sequence[str] | None = None,
+        config: InferenceConfig | None = None,
+        device: torch.device | str | None = None,
+    ) -> None:
+        self.model = model
+        self.tokenizer = tokenizer
+        self.config = config or InferenceConfig()
+        chosen_device = device or self.config.device
+        if chosen_device is None:
+            first_param = next(model.parameters(), None)
+            chosen_device = first_param.device if first_param is not None else "cpu"
+        self.device = torch.device(chosen_device)
+        self.model.to(self.device)
+        self.model.eval()
+        self.preprocessor = preprocessor or TextPreprocessor(tokenizer=tokenizer)
+        self.emotion_labels = list(emotion_labels) if emotion_labels is not None else None
+        self.topic_labels = list(topic_labels) if topic_labels is not None else None
+    def summarize(self, texts: Sequence[str], *, max_length: int | None = None) -> List[str]:
+        if not texts:
+            return []
+        batch = self._batch_to_device(self.preprocessor.batch_encode(texts))
+        src_ids = batch.input_ids
+        src_mask = batch.attention_mask
+        max_len = max_length or self.config.summary_max_length
+        if not hasattr(self.model, "encoder") or not hasattr(self.model, "decoder"):
+            raise RuntimeError("Model must expose encoder and decoder attributes for summarization.")
+        with torch.inference_mode():
+            encoder_mask = src_mask.unsqueeze(1) & src_mask.unsqueeze(2) if src_mask is not None else None
+            memory = self.model.encoder(src_ids, mask=encoder_mask)
+            generated = self.model.decoder.greedy_decode(
+                memory=memory,
+                max_len=max_len,
+                start_token_id=self.tokenizer.bos_token_id,
+                end_token_id=self.tokenizer.eos_token_id,
+                device=self.device,
+            )
+        return self.tokenizer.decode_batch(generated.tolist())
+    def predict_emotions(
+        self,
+        texts: Sequence[str],
+        *,
+        threshold: float | None = None,
+    ) -> List[EmotionPrediction]:
+        if not texts:
+            return []
+        if self.emotion_labels is None or not self.emotion_labels:
+            raise RuntimeError("emotion_labels must be provided to decode emotion predictions")
+        batch = self._batch_to_device(self.preprocessor.batch_encode(texts))
+        model_inputs = self._batch_to_model_inputs(batch)
+        decision_threshold = threshold or self.config.emotion_threshold
+        with torch.inference_mode():
+            logits = self.model.forward("emotion", model_inputs)
+            probs = torch.sigmoid(logits)
+        predictions: List[EmotionPrediction] = []
+        for row in probs.cpu():
+            pairs = [
+                (label, score)
+                for label, score in zip(self.emotion_labels, row.tolist())
+                if score >= decision_threshold
+            ]
+            labels = [label for label, _ in pairs]
+            scores = [score for _, score in pairs]
+            predictions.append(EmotionPrediction(labels=labels, scores=scores))
+        return predictions
+    def predict_topics(self, texts: Sequence[str]) -> List[TopicPrediction]:
+        if not texts:
+            return []
+        if self.topic_labels is None or not self.topic_labels:
+            raise RuntimeError("topic_labels must be provided to decode topic predictions")
+        batch = self._batch_to_device(self.preprocessor.batch_encode(texts))
+        model_inputs = self._batch_to_model_inputs(batch)
+        with torch.inference_mode():
+            logits = self.model.forward("topic", model_inputs)
+            probs = F.softmax(logits, dim=-1)
+        results: List[TopicPrediction] = []
+        for row in probs.cpu():
+            scores = row.tolist()
+            best_index = int(row.argmax().item())
+            results.append(TopicPrediction(label=self.topic_labels[best_index], confidence=scores[best_index]))
+        return results
+    def batch_predict(self, texts: Iterable[str]) -> dict[str, object]:
+        text_list = list(texts)
+        if self.emotion_labels is None or not self.emotion_labels:
+            raise RuntimeError("emotion_labels must be provided for batch predictions")
+        if self.topic_labels is None or not self.topic_labels:
+            raise RuntimeError("topic_labels must be provided for batch predictions")
+        return {
+            "summaries": self.summarize(text_list),
+            "emotion": self.predict_emotions(text_list),
+            "topic": self.predict_topics(text_list),
+        }
+    def _batch_to_device(self, batch: Batch) -> Batch:
+        tensor_updates: dict[str, torch.Tensor] = {}
+        for item in fields(batch):
+            value = getattr(batch, item.name)
+            if torch.is_tensor(value):
+                tensor_updates[item.name] = value.to(self.device)
+        if not tensor_updates:
+            return batch
+        return replace(batch, **tensor_updates)
+    @staticmethod
+    def _batch_to_model_inputs(batch: Batch) -> dict[str, torch.Tensor]:
+        inputs: dict[str, torch.Tensor] = {"input_ids": batch.input_ids}
+        if batch.attention_mask is not None:
+            inputs["attention_mask"] = batch.attention_mask
+        return inputs

src/inference/postprocessing.py ADDED Viewed

	@@ -0,0 +1,6 @@

+"""Output cleaning helpers."""
+from typing import List
+def strip_whitespace(texts: List[str]) -> List[str]:
+    return [text.strip() for text in texts]

src/models/factory.py ADDED Viewed

	@@ -0,0 +1,105 @@

+"""Factory helpers to assemble multitask models for inference/training."""
+from __future__ import annotations
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Optional
+from ..data.tokenization import Tokenizer
+from ..utils.config import load_yaml
+from .decoder import TransformerDecoder
+from .encoder import TransformerEncoder
+from .heads import ClassificationHead, LMHead
+from .multitask import MultiTaskModel
+@dataclass(slots=True)
+class ModelConfig:
+    """Configuration describing the transformer architecture."""
+    d_model: int = 512
+    num_encoder_layers: int = 6
+    num_decoder_layers: int = 6
+    num_attention_heads: int = 8
+    ffn_dim: int = 2048
+    dropout: float = 0.1
+    def __post_init__(self):
+        if self.d_model % self.num_attention_heads != 0:
+            raise ValueError(
+                f"d_model ({self.d_model}) must be divisible by num_attention_heads ({self.num_attention_heads})"
+            )
+        if not 0 <= self.dropout <= 1:
+            raise ValueError(f"dropout must be in [0, 1], got {self.dropout}")
+        if self.d_model <= 0 or self.num_encoder_layers <= 0 or self.num_decoder_layers <= 0:
+            raise ValueError("Model dimensions must be positive")
+        if self.num_attention_heads <= 0 or self.ffn_dim <= 0:
+            raise ValueError("Model dimensions must be positive")
+def load_model_config(path: Optional[str | Path]) -> ModelConfig:
+    """Load a model configuration from YAML with sane defaults."""
+    if path is None:
+        return ModelConfig()
+    data = load_yaml(str(path)).data
+    return ModelConfig(
+        d_model=int(data.get("d_model", 512)),
+        num_encoder_layers=int(data.get("num_encoder_layers", 6)),
+        num_decoder_layers=int(data.get("num_decoder_layers", 6)),
+        num_attention_heads=int(data.get("num_attention_heads", 8)),
+        ffn_dim=int(data.get("ffn_dim", 2048)),
+        dropout=float(data.get("dropout", 0.1)),
+    )
+def build_multitask_model(
+    tokenizer: Tokenizer,
+    *,
+    num_emotions: int,
+    num_topics: int,
+    config: ModelConfig | None = None,
+) -> MultiTaskModel:
+    """Construct the multitask transformer with heads for the three tasks."""
+    cfg = config or ModelConfig()
+    if not isinstance(num_emotions, int) or num_emotions <= 0:
+        raise ValueError("num_emotions must be a positive integer")
+    if not isinstance(num_topics, int) or num_topics <= 0:
+        raise ValueError("num_topics must be a positive integer")
+    encoder = TransformerEncoder(
+        vocab_size=tokenizer.vocab_size,
+        d_model=cfg.d_model,
+        num_layers=cfg.num_encoder_layers,
+        num_heads=cfg.num_attention_heads,
+        d_ff=cfg.ffn_dim,
+        dropout=cfg.dropout,
+        max_len=tokenizer.config.max_length,
+        pad_token_id=tokenizer.pad_token_id,
+    )
+    decoder = TransformerDecoder(
+        vocab_size=tokenizer.vocab_size,
+        d_model=cfg.d_model,
+        num_layers=cfg.num_decoder_layers,
+        num_heads=cfg.num_attention_heads,
+        d_ff=cfg.ffn_dim,
+        dropout=cfg.dropout,
+        max_len=tokenizer.config.max_length,
+        pad_token_id=tokenizer.pad_token_id,
+    )
+    model = MultiTaskModel(encoder=encoder, decoder=decoder, decoder_outputs_logits=True)
+    model.add_head(
+        "summarization",
+        LMHead(d_model=cfg.d_model, vocab_size=tokenizer.vocab_size, tie_embedding=decoder.embedding),
+    )
+    model.add_head(
+        "emotion",
+        ClassificationHead(d_model=cfg.d_model, num_labels=num_emotions, pooler="mean", dropout=cfg.dropout),
+    )
+    model.add_head(
+        "topic",
+        ClassificationHead(d_model=cfg.d_model, num_labels=num_topics, pooler="mean", dropout=cfg.dropout),
+    )
+    return model

src/models/multitask.py CHANGED Viewed

@@ -39,17 +39,28 @@ class MultiTaskModel(nn.Module):
         mt = MultiTaskModel(encoder=enc, decoder=dec)
         mt.add_head("summarize", LMHead(...))
         logits = mt.forward("summarize", {"src_ids": src_ids, "tgt_ids": tgt_ids})
     """
     def __init__(
         self,
         encoder: Optional[TransformerEncoder] = None,
         decoder: Optional[TransformerDecoder] = None,
     ):
         super().__init__()
         self.encoder = encoder
         self.decoder = decoder
         self.heads: Dict[str, nn.Module] = {}
     def add_head(self, name: str, module: nn.Module) -> None:
         """Register a head under a task name."""
@@ -99,9 +110,15 @@ class MultiTaskModel(nn.Module):
                 raise RuntimeError("Encoder is required for encoder-side heads")
             # accept either input_ids or embeddings
             if "input_ids" in inputs:
-                enc_out = self.encoder(inputs["input_ids"])
             elif "embeddings" in inputs:
-                enc_out = self.encoder(inputs["embeddings"])
             else:
                 raise ValueError("inputs must contain 'input_ids' or 'embeddings' for encoder tasks")
             logits = head(enc_out)
@@ -120,10 +137,20 @@ class MultiTaskModel(nn.Module):
                 raise RuntimeError("Both encoder and decoder are required for LM-style heads")
             # Build encoder memory
             if "src_ids" in inputs:
-                memory = self.encoder(inputs["src_ids"])
             elif "src_embeddings" in inputs:
-                memory = self.encoder(inputs["src_embeddings"])
             else:
                 raise ValueError("inputs must contain 'src_ids' or 'src_embeddings' for seq2seq tasks")
@@ -137,12 +164,13 @@ class MultiTaskModel(nn.Module):
                 # Here we don't attempt to generate when labels not provided.
                 raise ValueError("Seq2seq tasks require 'tgt_ids' or 'tgt_embeddings' for training forward")
-            # Run decoder. Decoder returns logits shaped (B, T, vocab) in this codebase.
             decoder_out = self.decoder(decoder_inputs, memory)
-            # If decoder already returned logits matching the head vocab size, use them directly.
-            # Otherwise, assume decoder returned hidden states and let the head project them.
-            if isinstance(decoder_out, torch.Tensor) and decoder_out.shape[-1] == head.vocab_size:
                 logits = decoder_out
             else:
                 logits = head(decoder_out)
@@ -195,4 +223,15 @@ class MultiTaskModel(nn.Module):
             return F.cross_entropy(logits, labels.long())
         # If we can't determine, raise
-        raise RuntimeError("Cannot compute loss for unknown head type")

         mt = MultiTaskModel(encoder=enc, decoder=dec)
         mt.add_head("summarize", LMHead(...))
         logits = mt.forward("summarize", {"src_ids": src_ids, "tgt_ids": tgt_ids})
+    Args:
+        encoder: optional encoder backbone.
+        decoder: optional decoder backbone.
+        decoder_outputs_logits: set True when ``decoder.forward`` already returns vocabulary logits;
+            set False if the decoder produces hidden states that must be projected by the LM head.
     """
     def __init__(
         self,
         encoder: Optional[TransformerEncoder] = None,
         decoder: Optional[TransformerDecoder] = None,
+        *,
+        decoder_outputs_logits: bool = True,
     ):
         super().__init__()
         self.encoder = encoder
         self.decoder = decoder
         self.heads: Dict[str, nn.Module] = {}
+        # When True, decoder.forward(...) is expected to return logits already projected to the vocabulary space.
+        # When False, decoder outputs hidden states that must be passed through the registered LM head.
+        self.decoder_outputs_logits = decoder_outputs_logits
     def add_head(self, name: str, module: nn.Module) -> None:
         """Register a head under a task name."""
                 raise RuntimeError("Encoder is required for encoder-side heads")
             # accept either input_ids or embeddings
             if "input_ids" in inputs:
+                encoder_mask = None
+                if "attention_mask" in inputs:
+                    encoder_mask = self._expand_attention_mask(inputs["attention_mask"], inputs["input_ids"].device)
+                enc_out = self.encoder(inputs["input_ids"], mask=encoder_mask)
             elif "embeddings" in inputs:
+                encoder_mask = inputs.get("attention_mask")
+                if encoder_mask is not None:
+                    encoder_mask = self._expand_attention_mask(encoder_mask, inputs["embeddings"].device)
+                enc_out = self.encoder(inputs["embeddings"], mask=encoder_mask)
             else:
                 raise ValueError("inputs must contain 'input_ids' or 'embeddings' for encoder tasks")
             logits = head(enc_out)
                 raise RuntimeError("Both encoder and decoder are required for LM-style heads")
             # Build encoder memory
+            src_mask = inputs.get("src_mask")
+            if src_mask is None:
+                src_mask = inputs.get("attention_mask")
+            encoder_mask = None
+            reference_tensor = inputs.get("src_ids")
+            if reference_tensor is None:
+                reference_tensor = inputs.get("src_embeddings")
+            if src_mask is not None and reference_tensor is not None:
+                encoder_mask = self._expand_attention_mask(src_mask, reference_tensor.device)
             if "src_ids" in inputs:
+                memory = self.encoder(inputs["src_ids"], mask=encoder_mask)
             elif "src_embeddings" in inputs:
+                memory = self.encoder(inputs["src_embeddings"], mask=encoder_mask)
             else:
                 raise ValueError("inputs must contain 'src_ids' or 'src_embeddings' for seq2seq tasks")
                 # Here we don't attempt to generate when labels not provided.
                 raise ValueError("Seq2seq tasks require 'tgt_ids' or 'tgt_embeddings' for training forward")
             decoder_out = self.decoder(decoder_inputs, memory)
+            if self.decoder_outputs_logits:
+                if not isinstance(decoder_out, torch.Tensor):
+                    raise TypeError(
+                        "Decoder is configured to return logits, but forward returned a non-tensor value."
+                    )
                 logits = decoder_out
             else:
                 logits = head(decoder_out)
             return F.cross_entropy(logits, labels.long())
         # If we can't determine, raise
+        raise RuntimeError("Cannot compute loss for unknown head type")
+    @staticmethod
+    def _expand_attention_mask(mask: torch.Tensor, device: torch.device) -> torch.Tensor:
+        if mask is None:
+            return None  # type: ignore[return-value]
+        bool_mask = mask.to(device=device, dtype=torch.bool)
+        if bool_mask.dim() == 2:
+            return bool_mask.unsqueeze(1) & bool_mask.unsqueeze(2)
+        if bool_mask.dim() in (3, 4):
+            return bool_mask
+        raise ValueError("Attention mask must be 2D, 3D, or 4D tensor")

src/training/__init__.py CHANGED Viewed

	@@ -0,0 +1 @@


1	+ """Training utilities for LexiMind."""

src/training/callbacks.py ADDED Viewed

	@@ -0,0 +1,37 @@

+"""Callback hooks for training."""
+from pathlib import Path
+from typing import Any, Dict, Optional
+import torch
+def save_checkpoint(
+    model: torch.nn.Module,
+    optimizer: torch.optim.Optimizer,
+    epoch: int,
+    output_path: str,
+    *,
+    metrics: Optional[Dict[str, Any]] = None,
+) -> None:
+    """Persist model and optimizer state for resuming training."""
+    checkpoint = {
+        "model_state_dict": model.state_dict(),
+        "optimizer_state_dict": optimizer.state_dict(),
+        "epoch": int(epoch),
+    }
+    if metrics:
+        checkpoint["metrics"] = metrics
+    target = Path(output_path)
+    target.parent.mkdir(parents=True, exist_ok=True)
+    temp_path = target.parent / f"{target.name}.tmp"
+    try:
+        torch.save(checkpoint, temp_path)
+        temp_path.replace(target)
+    except Exception:
+        raise
+    finally:
+        if temp_path.exists():
+            temp_path.unlink(missing_ok=True)

src/training/losses.py ADDED Viewed

	@@ -0,0 +1,13 @@

+"""Loss helpers."""
+import torch
+def multitask_loss(losses: dict[str, torch.Tensor]) -> torch.Tensor:
+    iterator = iter(losses.values())
+    try:
+        total = next(iterator).clone()
+    except StopIteration:
+        raise ValueError("losses is empty")
+    for value in iterator:
+        total = total + value
+    return total / len(losses)

src/training/metrics.py ADDED Viewed

	@@ -0,0 +1,36 @@

+"""Metric helpers used during training and evaluation."""
+from __future__ import annotations
+from typing import Sequence
+import torch
+def accuracy(predictions: Sequence[int], targets: Sequence[int]) -> float:
+    matches = sum(int(pred == target) for pred, target in zip(predictions, targets))
+    return matches / max(1, len(predictions))
+def multilabel_f1(predictions: torch.Tensor, targets: torch.Tensor) -> float:
+    preds = predictions.float()
+    gold = targets.float()
+    true_positive = (preds * gold).sum(dim=1)
+    precision = true_positive / (preds.sum(dim=1).clamp(min=1.0))
+    recall = true_positive / (gold.sum(dim=1).clamp(min=1.0))
+    f1 = (2 * precision * recall) / (precision + recall).clamp(min=1e-8)
+    return float(f1.mean().item())
+def rouge_like(predictions: Sequence[str], references: Sequence[str]) -> float:
+    if not predictions or not references:
+        return 0.0
+    scores = []
+    for pred, ref in zip(predictions, references):
+        pred_tokens = pred.split()
+        ref_tokens = ref.split()
+        if not ref_tokens:
+            scores.append(0.0)
+            continue
+        overlap = len(set(pred_tokens) & set(ref_tokens))
+        scores.append(overlap / len(ref_tokens))
+    return sum(scores) / len(scores)