Which Fikra API model is right for my production workload?

Fikra API abstracts the complexity of hardware provisioning by routing your requests to highly optimized, open-weights inference models powered by Groq infrastructure. The model registry allows you to explicitly declare which engine should handle your JSON payload based on your latency, reasoning, and context window requirements.


Model Family Overview

All models in the Fikra API ecosystem share the same flat pricing structure (2M tokens per $1 USD) and are accessed via the identical `/v1/chat/completions` endpoint. To select a model, inject the exact String ID into the `model` parameter of your request body.

Model String ID Context Window Relative Latency Target Use Case
fikra-fast-8b 8,192 tokens Ultra-Low (< 200ms) Real-time chat, sentiment analysis, simple routing.
fikra-pro-20b 32,768 tokens Low (< 500ms) Structured JSON extraction, agentic tasks, RAG pipelines.
fikra-pro-120b 128,000 tokens Moderate (< 1.5s) Complex reasoning, long-document analysis, deep coding.

Detailed Specs: fikra-fast-8b

The fikra-fast-8b model is our baseline engine, optimized purely for maximum throughput and minimal Time-To-First-Token (TTFT). It is heavily instruction-tuned for following direct, concise system prompts without conversational drift.

Detailed Specs: fikra-pro-20b

The fikra-pro-20b is the recommended default for production workloads. It strikes the optimal balance between high-speed inference and complex reasoning capabilities. It understands nuanced instructions and multi-step logic significantly better than the 8B variant.

Detailed Specs: fikra-pro-120b

The fikra-pro-120b is our flagship heavy-reasoning engine. While it has a slightly higher TTFT compared to the 8B and 20B models, its output quality, logical deduction, and coding capabilities rival proprietary frontier models.

How do I declare the model in my code?

You must explicitly pass the model string ID in your client configuration. If you pass an invalid or deprecated model string, the API will return a 400 Bad Request.

Model Selection (Node.js)
const response = await client.chat.completions.create({ // Explicitly declare the model from the registry model: "fikra-pro-120b", messages: [ { role: "system", content: "You are an expert systems architect." }, { role: "user", content: "Design a microservices schema for..." } ], // Optional: Cap the output tokens to prevent runaway generation max_tokens: 4096 });

Understanding Token Counting and Context Limits

The context window represents the total number of tokens allowed in a single request. This is the sum of your input (Prompt Tokens) plus the model's output (Completion Tokens).

If you send a 7,000 token prompt to the fikra-fast-8b model, the model will only have 1,192 tokens left to generate its response before it hits the hard limit and truncates the output. Always select a model with a context window significantly larger than your input payload.


← Previous Topic

API Endpoints

Understand JSON payloads and SSE streaming structure.

Next Topic →

Rate Limits

Understand RPM scaling, tiers, and handling 429 errors.