• brucethemoose@lemmy.world
      link
      fedilink
      English
      arrow-up
      4
      ·
      edit-2
      16 hours ago

      Completely depends on your laptop hardware, but generally:

      • TabbyAPI (exllamav2/exllamav3)
      • ik_llama.cpp, and its openai server
      • kobold.cpp (or kobold.cpp rocm, or croco.cpp, depends)
      • An MLX host with one of the new distillation quantizations
      • Text-gen-web-ui (slow, but supports a lot of samplers and some exotic quantizations)
      • SGLang (extremely fast for parallel calls if thats what you want).
      • Aphrodite Engine (lots of samplers, and fast at the expense of some VRAM usage).

      I use text-gen-web-ui at the moment only because TabbyAPI is a little broken with exllamav3 (which is utterly awesome for Qwen3), otherwise I’d almost always stick to TabbyAPI.

      Tell me (vaguely) what your system has, and I can be more specific.