A stateful multimodal architecture that understands human emotions at a biological level, retains structured memory inside the model weights, and generates true intent — not just responses.
Transformer ↔ Mamba2 SSM ↔ Transformer · Stateful · Multimodal · Multilingual
Current models process text. Humans communicate through voice, tone, context, history, and biological state simultaneously. No existing architecture unifies these signals.
User says "I'm fine" — every model believes them. We don't. We check the voice, the memory, the context. We find the truth beneath the words.
GPT-4, Claude, Gemini — all stateless. They forget everything when the context window ends. We remember. Persistently. Inside the model state itself.
When someone masks pain, current models generate "Great to hear!" — the worst possible response. divAIne generates comfort. Because it sees through.
A Transformer Encoder → Mamba2 State Space Model → Transformer Decoder pipeline, with a Biological Emotion Intelligence head running in parallel.
Multi-head self-attention across all modality tokens. Learns which modality to trust when they conflict. When voice says "sad" but text says "fine", this layer learns to weight the body signal higher.
A Selective State Space Model that maintains a persistent hidden state. This state is partitioned into 5 cognitive lanes with independent gating — each lane decides independently what to remember and what to forget.
Cross-attends to both encoder output AND Hippocampus memory state. Generates intent classification, response tokens, and updated BEI state — all grounded in emotion, memory, and context.
Parallel emotion extraction from voice and text, producing a 10-dimensional biological emotion vector. Detects cross-modal conflicts per dimension.
SSM state partitioned into cognitively structured lanes. Each lane has independent gating — a σ(W·[x, h]) gate that decides what to keep vs. overwrite.
Transformer handles attention (what's relevant). SSM handles state (what to remember). Second Transformer reasons over both. No single architecture can do all three.
We model the biological state of the human nervous system across 10 continuous dimensions. Not 6 discrete labels — a full biological manifold.
Pleasant ↔ Unpleasant
Calm ↔ Intense
Overwhelmed ↔ Balanced
Threat ↔ Opportunity
Withdraw ↔ Approach
Freeze ↔ Fight
Spike ↔ Sustained
Hopeless ↔ Hopeful
Hypervigilant ↔ Safe
Helpless ↔ Empowered
When voice and text BEI vectors disagree, the model detects per-dimension conflicts:
"I'm fine bro, chill"
Trembling, low energy, long pauses
→ Normalizer attention weights: α(voice) = 0.75, α(text) = 0.25.
→ Intent: GENTLE_PRESENCE, not "Great to hear you're
fine!"
The SSM hidden state ht is partitioned into 5 cognitive lanes. Each lane has its own gating function — deciding independently what to remember and forget.
What actually happened?
Facts, events, verified information. "Breakup happened yesterday."
What matters to them?
Moral judgments, core values. "Loyalty matters more than success."
How do they feel about it?
Emotional trajectory, identity shifts. "Grieving, masking pain."
What does it mean to them?
Personal significance. "Career defines self-worth."
State Space Update (per lane):
g_i = σ(W_gate · [x_t, h_{t-1}[lane_i]]) # gate ∈ [0, 1]
h_t[lane_i] = g_i · (A_i · h_{t-1}[lane_i] + B_i · x_t)
+ (1 - g_i) · h_{t-1}[lane_i] # gated update
Key: h_t is SAVED after each conversation.
Next conversation starts with h_t as initial state.
Memory = State. No database. No RAG. Just math.
v0.1 was trained on Kaggle's Tesla T4 GPU. Here are the actual results — no cherry-picking.
| Epoch | Train Loss | Val Loss | Intent Acc | BEI MAE |
|---|---|---|---|---|
| 1 | 7.22 | 4.19 | 29.4% | 0.416 |
| 3 | 2.50 | 2.28 | 70.1% | 0.324 |
| 6 | 1.60 | 1.59 | 100.0% | 0.248 |
| 15 | 1.52 | 1.53 | 100.0% | 0.143 |
| 30 | 1.512 | 1.523 | 100.0% | 0.130 |
"I'm fine" → Positive sentiment → "Great to hear you're doing well!"
The worst possible response to someone masking pain.
"I'm fine" + sad voice + breakup memory → "Main yahan hoon, bro. Koi rush nahi hai."
Gentle presence. No forced positivity. Respects the mask while holding space.
The Normalizer catches the voice-text conflict. The Hippocampus provides breakup context from memory. The Reasoner generates intent that respects all three signals.
divAIne M1 is not another LLM. It's an intent-and-emotion architecture that sits alongside LLMs.
| Dimension | LLMs GPT, Claude, Gemini |
Emotion AI Hume, Affectiva |
divAIne M1 |
|---|---|---|---|
| Core Unit | Token | Discrete label | Intent + 10D Emotion |
| Memory | Context window (resets) | None | Persistent 5-lane state |
| Emotion Model | Text sentiment | 6 basic categories | 10-dim biological vector |
| Conflict Detection | Cannot | Single-modal | Cross-modal per-dimension |
| Architecture | Transformer only | CNN/RNN classifier | T-SSM-T hybrid |
| Output | Next token | Emotion label | True intent + emotion + state |
| Learning | Frozen at training | Frozen at training | State evolves per interaction |
LLMs predict the most likely next token. When a user says "I'm fine", the most likely continuation is positive. They have no mechanism to check against voice, memory, or biological state. This is an architectural limitation, not a fine-tuning problem.
divAIne doesn't replace LLMs — it provides the missing intelligence layer. It reads true state, detects intent, and feeds this to any LLM. The LLM has knowledge. divAIne provides wisdom about how and when to use it.
The complete data flow as mathematical operations.
x_voice = VoiceEncoder(audio) + ModToken("voice") ∈ ℝ^(S1 × d)
x_text = TextEncoder(tokens) + ModToken("text") ∈ ℝ^(S2 × d)
x_ctx = ContextEncoder(t, loc) + ModToken("ctx") ∈ ℝ^(S3 × d)
e_bei = BEI_Head(voice, text) ∈ ℝ^10
X_fused = GatedConcat(x_voice, x_text, x_ctx, Proj(e_bei)) ∈ ℝ^(S × d)
For each layer l = 1..L₁:
Q, K, V = X·W_Q, X·W_K, X·W_V
Attn = softmax(Q·Kᵀ / √d_k) · V
X = LayerNorm(X + Attn)
X = LayerNorm(X + FFN(X))
Key: attention weights learn α(voice→output) >> α(text→output)
when they conflict. "The body does not lie."
Selective State Space Model:
h_t = Ā · h_{t-1} + B̄ · x_t (state update)
y_t = C · h_t + D · x_t (output)
h_t ∈ ℝ^(5 × d_lane) = [TRUTH | VALUE | SELF | SOCIAL | MEANING]
Per-Lane Gating:
g_i = σ(W_gate_i · [x_t, h_{t-1}[lane_i]]) gate ∈ [0,1]
h_t[lane_i] = g_i · (update) + (1-g_i) · h_{t-1} gated mix
h_t is SAVED. Next conversation loads h_t as h_0.
For each layer l = 1..L₃:
Self-Attn = MHSA(Y, Y, Y) (autoregressive)
Cross-Attn = MHCA(Y, [Enc_out ; Mamba_state]) (memory + features)
Y = LayerNorm(Y + FFN(Y))
Output Heads:
intent = softmax(W_intent · Pool(Y)) ∈ ℝ^9 (what to DO)
tokens = softmax(W_lm · Y) ∈ ℝ^vocab (what to SAY)
bei_out = tanh(W_bei · Pool(Y)) ∈ ℝ^10 (how they FEEL)
L = w₁·L_intent (cross-entropy: 9-class intent classification)
+ w₂·L_response (cross-entropy: next-token prediction)
+ w₃·L_bei (MSE: 10-dim BEI supervision)
+ w₄·L_conflict (contrastive: voice > text when mismatch)
+ w₅·L_memory (MSE: state consistency across save/load)
5 loss components, jointly optimized. Each teaches a different skill.
The model doesn't just classify — it chooses the right response strategy based on emotion, memory, and context.
Trained configuration. No aspirational numbers — only what exists and runs.
| Version | Params | d_model | Layers (E/M/D) | Data | Purpose |
|---|---|---|---|---|---|
| v0.1 ✅ | 1.7M | 128 | 2/1/2 | 2K synthetic | Prove architecture |
| v0.5 | ~45M | 512 | 4/2/4 | 50K real | Demo quality |
| v1.0 | ~150M | 768 | 6/4/6 | 500K real | Production prototype |
| v2.0 | ~500M | 1024 | 12/8/12 | 1M+ real | Production-ready |
Beyond Russell's 2D circumplex. 10 dimensions grounded in Polyvagal Theory, Appraisal Theory, and Agency Theory. Continuous, not categorical.
Novel contribution
SSM state partitioned into 5 cognitively structured lanes with independent gating. No prior work structures memory this way.
Novel contribution
Per-dimension conflict detection between voice and text BEI signals, with learned attention weights that resolve who to trust.
Novel contribution
Mamba2 hidden state as never-resetting memory. True continual learning via state updates, not weight updates. No catastrophic forgetting.
Novel contribution
Intent = f(BEI, Memory, Context). Three synced signals determine action, not text alone. A new formulation of intent generation in AI.
Novel contribution