| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by resters 620 days ago

I'm working on this: Abstract:

This paper presents a novel framework for multi-stream tokenization, which extends traditional NLP tokenization by generating simultaneous, multi-layered token representations that integrate subword embeddings, logical forms, referent tracking, scope management, and world distinctions. Unlike conventional language models that tokenize based solely on surface linguistic features (e.g., subword units) and infer relationships through deep contextual embeddings, our system outputs a rich, structured token stream. These streams include logical expressions (e.g., `∃x (John(x) ∧ Loves(x, Mary))`), referent identifiers (`ref_1`, `ref_2`), and world scopes (`world_1`, `world_2`) in parallel, enabling precise handling of referential continuity, modal logic, temporal reasoning, and ambiguity resolution across multiple passages and genres, including mathematical texts, legal documents, and natural language narratives.

This approach leverages symbolic logic and neural embeddings in a hybrid architecture, enhancing the model’s capacity for reasoning and referential disambiguation in contexts where linguistic and logical complexity intertwine. For instance, tokens for modal logic are generated concurrently with referential tokens, allowing expressions such as "If John had gone to the store, Mary would have stayed home" to be dynamically represented across possible worlds (`world_1`, `world_2`) with embedded logical dependencies (`If(Go(John, Store), Stay(Mary, Home))`).

We explore how each token stream (e.g., subword, referent, logical, scope, world) interacts in real time within a transformer-based architecture, employing distinct embedding spaces for each type. The referent space (`ref_n`) facilitates consistent entity tracking, even across ambiguous or coreferential contexts, while scope spaces (`scope_n`) manage logical boundaries such as conditional or nested clauses. Additionally, ambiguity tokens (`AMBIGUOUS(A,B)`) are introduced to capture multiple possible meanings, ensuring that referents like "bank" (financial institution or riverbank) can be resolved as more context is processed.

By extending the capabilities of existing neuro-symbolic models (e.g., Neural Theorem Provers and Hybrid NLP Systems) and integrating them with modern transformer architectures (Vaswani et al., 2017), this system addresses key limitations in current models, particularly in their handling of complex logical structures and referent disambiguation. This work sets the foundation for a new class of multi-dimensional language models that are capable of performing logical reasoning and context-sensitive disambiguation across diverse textual domains, opening new avenues for NLP applications in fields like law, mathematics, and advanced AI reasoning systems.