parse boolean expressions semantically. Intended for semantic equivalence and SMT solving.

Python 100%

Find a file

alrie 0e08b7df9f Refactored project, added ir spec and serialization		2026-07-12 13:39:42 +02:00
docs	Refactored project, added ir spec and serialization	2026-07-12 13:39:42 +02:00
examples	Refactored project, added ir spec and serialization	2026-07-12 13:39:42 +02:00
semparse	Refactored project, added ir spec and serialization	2026-07-12 13:39:42 +02:00
tests	Refactored project, added ir spec and serialization	2026-07-12 13:39:42 +02:00
.gitignore	Implemented L1 fully, fixed bugs, implemented testing	2026-05-31 18:26:09 +02:00
LICENSE	added license	2026-05-23 23:30:43 +02:00
pyproject.toml	Refactored project, added ir spec and serialization	2026-07-12 13:39:42 +02:00
README.md	Refactored project, added ir spec and serialization	2026-07-12 13:39:42 +02:00
requirements-dev.txt	Implemented L1 fully, fixed bugs, implemented testing	2026-05-31 18:26:09 +02:00

README.md

semparse

Canonicalization and semantic analysis of predicate expressions (boolean / comparison expressions like abs(x) < 5 and status in ['ERROR','DONE']) evaluated over long columnar data vectors — x > 5 means x[i] > 5 for all i.

Two jobs:

Dedup — collapse a large collection of expressions to their unique semantic meanings, so a downstream JIT compiles each meaning once. canonicalize gives a hashable canonical form; digest a stable cross-process key.
Regions — for each meaning, produce border functions: computable curves that delimit where the expression is true, for plotting / insight.

Overriding rule: never merge two inequivalent expressions (a false merge is catastrophic — one compiled function would serve two meanings). Failing to merge equivalents is merely wasteful, so the design is biased toward refusing to merge unless it can prove equivalence; unsupported constructs are kept opaque.

Quickstart

import semparse as sp

# canonical form: equivalent expressions compare equal
sp.canonicalize("2*x > 10") == sp.canonicalize("x + x > 10")     # True

# stable, cross-process dedup key (use this, not hash())
sp.digest(sp.canonicalize("abs(x) < 5 and y > z"))              # 64-hex sha256

# JSON wire format for the JIT backend (integer-only, canonical bytes)
sp.dumps(sp.canonicalize("a*a + b*b < r*r"))

# border functions for a region, evaluated on data
import numpy as np
reg = sp.extract_region(sp.canonicalize("a > b and a > c"), focus="a")
b_min, b_max = reg.envelope({"b": np.sin(t), "c": np.cos(t)})    # b_min = max(b, c)

Run the tour: python examples/demo.py.

What works

Feature	Status
Booleans, all six comparisons, chaining, constant folding, membership (`in`/`not in`)	done
`abs` / `min` / `max` (by exhaustive case-split)	done
Variable arithmetic (`+ - * / // % *`), opaque `np.` / `math.*` calls	done
Region / border-function extraction (envelopes, per-t intervals)	done

The engine is exact and fast: a purpose-built rational-polynomial ring (polyring) canonicalizes the algebra in microseconds, and region borders are exact closed-form roots evaluated on the data — no general computer-algebra system is required for this fixed-degree, rational-arithmetic work.

Layout

semparse/           the package
  __init__.py       public API (canonicalize, extract_region, digest, dumps, ...)
  models.py         immutable IR nodes
  polyring.py       exact multivariate polynomials over Q
  poly.py           L2 algebraic atoms (monic polynomial + sign trichotomy)
  canonicalize.py   frontend lowering + boolean/DNF layer + abs/min/max case-split
  borders.py        region / border-function extraction
  wire.py           canonical JSON serialization + stable digest
  regions.py        1D interval regions for simple (Level-1) comparisons
docs/
  architecture.md   how it fits together
  ir_spec.md        the IR / wire-format contract for the JIT backend
  design_notes.py   original level-by-level design notes
examples/demo.py    runnable API tour
tests/              unit + property-based fuzz + soundness stress (236 tests)

Downstream: JIT / SIMD backend

The canonical IR is the stable interface to a SIMD-JIT backend (target: Cranelift). The backend consumes the canonical JSON (sp.dumps) — fully specified in docs/ir_spec.md — and uses sp.digest as the compiled-artifact cache key.

Development

python -m pytest        # 236 tests (unit + Hypothesis fuzz)

Dev deps: pytest, hypothesis, numpy (see pyproject.toml [dev]).