Hippo 🦛

pip install hippo-llm | Python 3.9+ | MIT

Run 30B models on a ¥3800 GPU at 78 tok/s. Then chain machines together when you need to go bigger.

30-second setup

hippo-pipeline serve --model qwen3-30b-a3b-q3 --mode standalone
# → OpenAI-compatible API at localhost:8000/v1/chat/completions

import openai
client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="none")
r = client.chat.completions.create(
    model="qwen3-30b-a3b-q3",
    messages=[{"role": "user", "content": "Explain pipeline parallelism"}],
    max_tokens=500
)
print(r.choices[0].message.content)

Two-machine setup

# Machine 1
hippo-pipeline serve --model gemma-3-12b --mode pipeline --rank 0

# Machine 2
hippo-pipeline serve --model gemma-3-12b --mode pipeline --rank 1 \
  --coordinator http://192.168.1.10:9000

Split the model across machines. Run what doesn't fit on one GPU.

The loop detection problem

Q3-quantized MoE models have a known issue: the routing network gets imprecise at 3-bit, picks the same experts over and over, and the output turns into repeating garbage. We measured this at 78% loop rate on Qwen3-30B-A3B Q3_K_M.

Nobody was catching this because repeat penalties work on tokens, not semantics. The model says the same thing in different words, and token-level samplers let it through.

Hippo's loop detector runs Jaccard similarity on a sliding window of output lines — catches meaning-level repetition, not just token matches.

Result on the same GPU, same model:

	With detection	Without
Effective speed	78 tok/s	~7 tok/s usable (rest is junk)
Loop rate	0%	78%

Three actions when a loop is detected: escape (inject a redirect), stop (terminate), warn (log). Zero false positives across 30+ test runs.

Benchmarks

RTX 5060 Ti 16GB, llama.cpp backend:

Model	Quant	VRAM	tok/s
Gemma4-E4B	Q4_K_M	9.6GB	90
Qwen3-30B-A3B	Q3_K_M	14GB	78
Qwen3-8B	Q4_K_M	5.2GB	71
Qwen3-14B	Q4_K_M	9.3GB	41

Cloud equivalent (~$2/hr for 30B): a 5060 Ti breaks even at ~1,900 hours.

What else it does

Pipeline parallelism — split any HF model across N machines (Mac + PC mixed)
DFlash — speculative decoding for Apple Silicon
Auto memory budget — calculates shard splits from available VRAM
OpenAI-compatible API — point existing tools at localhost

When to use what

Situation	Mode
Model fits on one GPU	standalone
Model doesn't fit	pipeline (2+ machines)
Mac, want raw speed	dflash
You're fine with cloud APIs	this isn't for you

Safety positioning

Hippo's loop detector and output controls are L1 safety measures — they constrain behavior, not intent. This means:

✅ We can detect and stop repetitive/degenerate outputs
✅ We can escape loops and retry with different parameters
❌ We cannot guarantee the model "understands" your safety requirements
❌ We cannot prevent a sufficiently capable model from generating harmful content if prompted

This is an honest limitation. Hippo makes local inference usable (catching the 78% loop rate that makes Q3 MoE models practically useless). It does not make models safe in any philosophical sense.

If you need production-grade content safety, layer a content filter on top of Hippo.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
.github		.github
config		config
docs		docs
pipeline		pipeline
scripts		scripts
tests		tests
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
README.md.bak		README.md.bak
SECURITY.md		SECURITY.md
__init__.py		__init__.py
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
ruff.toml		ruff.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hippo 🦛

30-second setup

The loop detection problem

Benchmarks

What else it does

When to use what

Safety positioning

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Hippo 🦛

30-second setup

The loop detection problem

Benchmarks

What else it does

When to use what

Safety positioning

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages