Which Fikra API model is right for my production workload?
Fikra API abstracts the complexity of hardware provisioning by routing your requests to highly optimized, open-weights inference models powered by Groq infrastructure. The model registry allows you to explicitly declare which engine should handle your JSON payload based on your latency, reasoning, and context window requirements.
Model Family Overview
All models in the Fikra API ecosystem share the same flat pricing structure (2M tokens per $1 USD) and are accessed via the identical `/v1/chat/completions` endpoint. To select a model, inject the exact String ID into the `model` parameter of your request body.
| Model String ID | Context Window | Relative Latency | Target Use Case |
|---|---|---|---|
| fikra-fast-8b | 8,192 tokens | Ultra-Low (< 200ms) | Real-time chat, sentiment analysis, simple routing. |
| fikra-pro-20b | 32,768 tokens | Low (< 500ms) | Structured JSON extraction, agentic tasks, RAG pipelines. |
| fikra-pro-120b | 128,000 tokens | Moderate (< 1.5s) | Complex reasoning, long-document analysis, deep coding. |
Detailed Specs: fikra-fast-8b
The fikra-fast-8b model is our baseline engine, optimized purely for maximum throughput and minimal Time-To-First-Token (TTFT). It is heavily instruction-tuned for following direct, concise system prompts without conversational drift.
- Best For: High-volume automated tasks (e.g., classifying thousands of user reviews), customer service chatbots requiring instant responses, and basic translation.
- Context Limit Constraint: It can process approximately 12 pages of standard text before hitting its 8,192 token limit. It is not recommended for full-document ingestion.
- JSON Mode: Highly capable of adhering to strict JSON output when prompted, making it ideal for API-to-API middleware logic.
Detailed Specs: fikra-pro-20b
The fikra-pro-20b is the recommended default for production workloads. It strikes the optimal balance between high-speed inference and complex reasoning capabilities. It understands nuanced instructions and multi-step logic significantly better than the 8B variant.
- Best For: Retrieval-Augmented Generation (RAG) pipelines, advanced text summarization, data extraction from semi-structured formats, and serving as the "brain" for autonomous agents.
- Context Expansion: The 32k window comfortably handles medium-length documents, large codebases, and extended conversational history.
Detailed Specs: fikra-pro-120b
The fikra-pro-120b is our flagship heavy-reasoning engine. While it has a slightly higher TTFT compared to the 8B and 20B models, its output quality, logical deduction, and coding capabilities rival proprietary frontier models.
- Best For: Legal contract analysis, complex software architecture design, creative writing, and any task requiring deep analytical thought over massive datasets.
- Massive Context: The 128k context window allows you to drop entire books, extensive API documentations, or massive log files directly into the prompt without chunking or vectorizing.
How do I declare the model in my code?
You must explicitly pass the model string ID in your client configuration. If you pass an invalid or deprecated model string, the API will return a 400 Bad Request.
Understanding Token Counting and Context Limits
The context window represents the total number of tokens allowed in a single request. This is the sum of your input (Prompt Tokens) plus the model's output (Completion Tokens).
If you send a 7,000 token prompt to the fikra-fast-8b model, the model will only have 1,192 tokens left to generate its response before it hits the hard limit and truncates the output. Always select a model with a context window significantly larger than your input payload.