Research Preview — v0.1 Trained & Validated

divAIne M1

A stateful multimodal architecture that understands human emotions at a biological level, retains structured memory inside the model weights, and generates true intent — not just responses.

Transformer ↔ Mamba2 SSM ↔ Transformer · Stateful · Multimodal · Multilingual

T-SSM-T
Architecture
10
Emotion Dimensions
5
Memory Lanes
100%
Intent Accuracy
The Problem

Every AI today is emotionally blind

Current models process text. Humans communicate through voice, tone, context, history, and biological state simultaneously. No existing architecture unifies these signals.

Surface-Level Understanding

User says "I'm fine" — every model believes them. We don't. We check the voice, the memory, the context. We find the truth beneath the words.

Zero Persistent Memory

GPT-4, Claude, Gemini — all stateless. They forget everything when the context window ends. We remember. Persistently. Inside the model state itself.

Wrong Intent Generation

When someone masks pain, current models generate "Great to hear!" — the worst possible response. divAIne generates comfort. Because it sees through.

Core Architecture

Three layers. Each does what the others can't.

A Transformer Encoder → Mamba2 State Space Model → Transformer Decoder pipeline, with a Biological Emotion Intelligence head running in parallel.

🎙 Voice EncoderWhisper features → d_model projection
📝 Text EncoderCharacter-level → learned embeddings
🕐 Time EncoderSinusoidal positional encoding
📍 Location EncoderGeohash → learned embeddings
🧬 BEI HeadVoice + Text → 10-dim emotion vector
↓ Gated Modality Fusion ↓
1

Transformer Encoder — "The Normalizer"

Multi-head self-attention across all modality tokens. Learns which modality to trust when they conflict. When voice says "sad" but text says "fine", this layer learns to weight the body signal higher.

2 layers4 attention headsd_ff = 256GELU activation
2

Mamba2 SSM — "The Hippocampus"

A Selective State Space Model that maintains a persistent hidden state. This state is partitioned into 5 cognitive lanes with independent gating — each lane decides independently what to remember and what to forget.

TRUTH VALUE SELF SOCIAL MEANING
1 SSM layerd_state = 32d_conv = 4expand = 25 gated lanes
3

Transformer Decoder — "The Reasoner"

Cross-attends to both encoder output AND Hippocampus memory state. Generates intent classification, response tokens, and updated BEI state — all grounded in emotion, memory, and context.

2 layers4 attention headsSelf-Attn + Cross-Attn3 output heads
🎯 True Intent9 intent classes via softmax
💬 ResponseToken-by-token generation
🧬 BEI StateUpdated 10-dim emotion vector

What's Architecturally Novel

BEI Head

Parallel emotion extraction from voice and text, producing a 10-dimensional biological emotion vector. Detects cross-modal conflicts per dimension.

5-Lane Gated Memory

SSM state partitioned into cognitively structured lanes. Each lane has independent gating — a σ(W·[x, h]) gate that decides what to keep vs. overwrite.

T-SSM-T Pipeline

Transformer handles attention (what's relevant). SSM handles state (what to remember). Second Transformer reasons over both. No single architecture can do all three.

Innovation #1

Biological Emotion Intelligence (BEI)

We model the biological state of the human nervous system across 10 continuous dimensions. Not 6 discrete labels — a full biological manifold.

01

Valence

Pleasant ↔ Unpleasant

02

Arousal

Calm ↔ Intense

03

Regulation

Overwhelmed ↔ Balanced

04

Cognitive Appraisal

Threat ↔ Opportunity

05

Relational

Withdraw ↔ Approach

06

Bodily Activation

Freeze ↔ Fight

07

Temporal Dynamics

Spike ↔ Sustained

08

Predictive

Hopeless ↔ Hopeful

09

Safety Bias

Hypervigilant ↔ Safe

10

Agency

Helpless ↔ Empowered

Cross-Modal Conflict Detection

When voice and text BEI vectors disagree, the model detects per-dimension conflicts:

Text Signal
"I'm fine bro, chill"
Valence: +0.5
⚡ Δ = 1.1 threshold exceeded
Voice Signal
Trembling, low energy, long pauses
Valence: -0.6

→ Normalizer attention weights: α(voice) = 0.75, α(text) = 0.25.
→ Intent: GENTLE_PRESENCE, not "Great to hear you're fine!"

Innovation #2

5-Lane Structured Memory

The SSM hidden state ht is partitioned into 5 cognitive lanes. Each lane has its own gating function — deciding independently what to remember and forget.

TRUTH

What actually happened?

Facts, events, verified information. "Breakup happened yesterday."

VALUE

What matters to them?

Moral judgments, core values. "Loyalty matters more than success."

SELF

How do they feel about it?

Emotional trajectory, identity shifts. "Grieving, masking pain."

SOCIAL

What's the social context?

Relationships, dynamics. "Friends are supportive, family is pressuring."

MEANING

What does it mean to them?

Personal significance. "Career defines self-worth."

How Memory Works — The Math

State Space Update (per lane):
  g_i = σ(W_gate · [x_t, h_{t-1}[lane_i]])          # gate ∈ [0, 1]

  h_t[lane_i] = g_i · (A_i · h_{t-1}[lane_i] + B_i · x_t)
              + (1 - g_i) · h_{t-1}[lane_i]           # gated update

Key: h_t is SAVED after each conversation.
     Next conversation starts with h_t as initial state.
     Memory = State. No database. No RAG. Just math.
Proof

Trained. Validated. Working.

v0.1 was trained on Kaggle's Tesla T4 GPU. Here are the actual results — no cherry-picking.

100%
Intent Accuracy
9 classes, reached by epoch 6
0.13
BEI MAE
10-dim emotion prediction error
1.52
Final Loss
From 7.17 → 1.52 over 30 epochs
~6 min
Training Time
30 epochs × 12s on Tesla T4

Training Convergence

Epoch Train Loss Val Loss Intent Acc BEI MAE
1 7.22 4.19 29.4% 0.416
3 2.50 2.28 70.1% 0.324
6 1.60 1.59 100.0% 0.248
15 1.52 1.53 100.0% 0.143
30 1.512 1.523 100.0% 0.130

Live Inference Output

  python inference.py
You >
"I'm really stressed about work"
Model Output
🎯 Intent: ACKNOWLEDGE (71.1%)
   → Validate and reflect back what was expressed

💭 Emotional State (BEI):
   Valence: -0.29 (negative — stress)
   Arousal: +0.43 (activated)
   Agency: -0.38 (low sense of control)
   Predictive: -0.35 (pessimistic outlook)

⚠️ CONFLICT DETECTED in: Temporal
🧠 Memory: active (norm=0.38)

Multi-Turn: Detecting Masked Pain

  Conversation Trace
User · Turn 1
"Meri girlfriend ne breakup kar diya kal"
Voice: trembling, low energy, long pauses
Model Internal State
BEI: valence=-0.7 arousal=0.4 agency=-0.6
Memory: all 5 lanes updated
Intent: DEEP_SUPPORT
User · Turn 2
"I'm fine bro, chill"
Voice: forced casual, strained
Model Internal State
CONFLICT DETECTED
Text: positive (+0.5 valence)
Voice: suppressed (-0.4 valence)
Memory: breakup yesterday
Normalizer trusts Voice + Memory
Intent: GENTLE_PRESENCE

What other models do

"I'm fine" → Positive sentiment → "Great to hear you're doing well!"

The worst possible response to someone masking pain.

What divAIne M1 does

"I'm fine" + sad voice + breakup memory → "Main yahan hoon, bro. Koi rush nahi hai."

Gentle presence. No forced positivity. Respects the mask while holding space.

Why it works

The Normalizer catches the voice-text conflict. The Hippocampus provides breakup context from memory. The Reasoner generates intent that respects all three signals.

Paradigm Shift

A different class of model

divAIne M1 is not another LLM. It's an intent-and-emotion architecture that sits alongside LLMs.

Dimension LLMs
GPT, Claude, Gemini
Emotion AI
Hume, Affectiva
divAIne M1
Core Unit Token Discrete label Intent + 10D Emotion
Memory Context window (resets) None Persistent 5-lane state
Emotion Model Text sentiment 6 basic categories 10-dim biological vector
Conflict Detection Cannot Single-modal Cross-modal per-dimension
Architecture Transformer only CNN/RNN classifier T-SSM-T hybrid
Output Next token Emotion label True intent + emotion + state
Learning Frozen at training Frozen at training State evolves per interaction

Why LLMs fundamentally cannot do this

LLMs predict the most likely next token. When a user says "I'm fine", the most likely continuation is positive. They have no mechanism to check against voice, memory, or biological state. This is an architectural limitation, not a fine-tuning problem.

How divAIne M1 complements LLMs

divAIne doesn't replace LLMs — it provides the missing intelligence layer. It reads true state, detects intent, and feeds this to any LLM. The LLM has knowledge. divAIne provides wisdom about how and when to use it.

Internal Mechanics

How it works inside

The complete data flow as mathematical operations.

Step 1: Modality Encoding

x_voice = VoiceEncoder(audio) + ModToken("voice")   ∈ ℝ^(S1 × d)
x_text  = TextEncoder(tokens) + ModToken("text")    ∈ ℝ^(S2 × d)
x_ctx   = ContextEncoder(t, loc) + ModToken("ctx")  ∈ ℝ^(S3 × d)
e_bei   = BEI_Head(voice, text)                      ∈ ℝ^10
X_fused = GatedConcat(x_voice, x_text, x_ctx, Proj(e_bei)) ∈ ℝ^(S × d)

Step 2: Transformer Encoder (The Normalizer)

For each layer l = 1..L₁:
  Q, K, V = X·W_Q, X·W_K, X·W_V
  Attn    = softmax(Q·Kᵀ / √d_k) · V
  X       = LayerNorm(X + Attn)
  X       = LayerNorm(X + FFN(X))

Key: attention weights learn α(voice→output) >> α(text→output)
     when they conflict. "The body does not lie."

Step 3: Mamba2 SSM (The Hippocampus)

Selective State Space Model:
  h_t = Ā · h_{t-1} + B̄ · x_t     (state update)
  y_t = C · h_t + D · x_t           (output)

h_t ∈ ℝ^(5 × d_lane) = [TRUTH | VALUE | SELF | SOCIAL | MEANING]

Per-Lane Gating:
  g_i = σ(W_gate_i · [x_t, h_{t-1}[lane_i]])          gate ∈ [0,1]
  h_t[lane_i] = g_i · (update) + (1-g_i) · h_{t-1}    gated mix

h_t is SAVED. Next conversation loads h_t as h_0.

Step 4: Transformer Decoder (The Reasoner)

For each layer l = 1..L₃:
  Self-Attn  = MHSA(Y, Y, Y)                      (autoregressive)
  Cross-Attn = MHCA(Y, [Enc_out ; Mamba_state])   (memory + features)
  Y = LayerNorm(Y + FFN(Y))

Output Heads:
  intent  = softmax(W_intent · Pool(Y))     ∈ ℝ^9      (what to DO)
  tokens  = softmax(W_lm · Y)               ∈ ℝ^vocab  (what to SAY)
  bei_out = tanh(W_bei · Pool(Y))           ∈ ℝ^10     (how they FEEL)

Multi-Task Training Loss

L = w₁·L_intent      (cross-entropy: 9-class intent classification)
  + w₂·L_response    (cross-entropy: next-token prediction)
  + w₃·L_bei         (MSE: 10-dim BEI supervision)
  + w₄·L_conflict    (contrastive: voice > text when mismatch)
  + w₅·L_memory      (MSE: state consistency across save/load)

5 loss components, jointly optimized. Each teaches a different skill.
Output Space

9 Intent Classes

The model doesn't just classify — it chooses the right response strategy based on emotion, memory, and context.

DEEP_SUPPORT
Provide emotional validation and deep empathy
GENTLE_PRESENCE
Be present without fixing; respect the emotional mask
ACKNOWLEDGE
Validate and reflect back what was expressed
CHALLENGE
Gently push back on negative self-talk with facts
CELEBRATE
Amplify and match positive emotional energy
INFORM
Share relevant information or perspective
REDIRECT
Guide conversation toward healthier framing
BOUNDARY
Set or reinforce healthy boundaries
CHECK_IN
Follow up on previous emotional state (uses memory)
v0.1 Specifications

What was actually built

Trained configuration. No aspirational numbers — only what exists and runs.

Parameters
1.72M
Trainable (proof config)
Model Dimension
128
d_model across all layers
Architecture
2-1-2
Enc / SSM / Dec layers
BEI Dimensions
10
Biological emotion axes
Memory Lanes
5
d_state = 32 per lane
Languages
3
Hindi, English, Hinglish
Training Data
2,044
Turns from 12 scenario templates
Training Hardware
T4
Kaggle free GPU, 6 minutes

Scale Roadmap

Version Params d_model Layers (E/M/D) Data Purpose
v0.1 ✅ 1.7M 128 2/1/2 2K synthetic Prove architecture
v0.5 ~45M 512 4/2/4 50K real Demo quality
v1.0 ~150M 768 6/4/6 500K real Production prototype
v2.0 ~500M 1024 12/8/12 1M+ real Production-ready
Proprietary Innovations

Five things that don't exist elsewhere

BEI: 10-Dim Biological Emotion

Beyond Russell's 2D circumplex. 10 dimensions grounded in Polyvagal Theory, Appraisal Theory, and Agency Theory. Continuous, not categorical.

Novel contribution

5-Lane Cognitive Memory

SSM state partitioned into 5 cognitively structured lanes with independent gating. No prior work structures memory this way.

Novel contribution

Cross-Modal Conflict Detection

Per-dimension conflict detection between voice and text BEI signals, with learned attention weights that resolve who to trust.

Novel contribution

Persistent-State Continual Learning

Mamba2 hidden state as never-resetting memory. True continual learning via state updates, not weight updates. No catastrophic forgetting.

Novel contribution

Intent-from-Sync Framework

Intent = f(BEI, Memory, Context). Three synced signals determine action, not text alone. A new formulation of intent generation in AI.

Novel contribution