Which Fikra API model should I choose?

Choose fikra-fast-8b for low latency chat, fikra-pro-20b for balanced production workflows, and fikra-pro-120b for complex reasoning and large context windows.

What is the context window for fikra-pro-120b?

The fikra-pro-120b model supports a massive context window of 128,000 tokens, ideal for document analysis and long-form code generation.

Do all Fikra models cost the same?

Yes, Fikra API utilizes a unified pricing model of 2 million tokens per $1 USD, regardless of which model you route your request to.

Which Fikra API model is right for my production workload?

Fikra API abstracts the complexity of hardware provisioning by routing your requests to highly optimized, open-weights inference models powered by Groq infrastructure. The model registry allows you to explicitly declare which engine should handle your JSON payload based on your latency, reasoning, and context window requirements.

Model Family Overview

All models in the Fikra API ecosystem share the same flat pricing structure (2M tokens per $1 USD) and are accessed via the identical `/v1/chat/completions` endpoint. To select a model, inject the exact String ID into the `model` parameter of your request body.

Model String ID	Context Window	Relative Latency	Target Use Case
fikra-fast-8b	8,192 tokens	Ultra-Low (< 200ms)	Real-time chat, sentiment analysis, simple routing.
fikra-pro-20b	32,768 tokens	Low (< 500ms)	Structured JSON extraction, agentic tasks, RAG pipelines.
fikra-pro-120b	128,000 tokens	Moderate (< 1.5s)	Complex reasoning, long-document analysis, deep coding.

Detailed Specs: fikra-fast-8b

The fikra-fast-8b model is our baseline engine, optimized purely for maximum throughput and minimal Time-To-First-Token (TTFT). It is heavily instruction-tuned for following direct, concise system prompts without conversational drift.

Best For: High-volume automated tasks (e.g., classifying thousands of user reviews), customer service chatbots requiring instant responses, and basic translation.
Context Limit Constraint: It can process approximately 12 pages of standard text before hitting its 8,192 token limit. It is not recommended for full-document ingestion.
JSON Mode: Highly capable of adhering to strict JSON output when prompted, making it ideal for API-to-API middleware logic.

Detailed Specs: fikra-pro-20b

The fikra-pro-20b is the recommended default for production workloads. It strikes the optimal balance between high-speed inference and complex reasoning capabilities. It understands nuanced instructions and multi-step logic significantly better than the 8B variant.

Best For: Retrieval-Augmented Generation (RAG) pipelines, advanced text summarization, data extraction from semi-structured formats, and serving as the "brain" for autonomous agents.
Context Expansion: The 32k window comfortably handles medium-length documents, large codebases, and extended conversational history.

Detailed Specs: fikra-pro-120b

The fikra-pro-120b is our flagship heavy-reasoning engine. While it has a slightly higher TTFT compared to the 8B and 20B models, its output quality, logical deduction, and coding capabilities rival proprietary frontier models.

Best For: Legal contract analysis, complex software architecture design, creative writing, and any task requiring deep analytical thought over massive datasets.
Massive Context: The 128k context window allows you to drop entire books, extensive API documentations, or massive log files directly into the prompt without chunking or vectorizing.

How do I declare the model in my code?

You must explicitly pass the model string ID in your client configuration. If you pass an invalid or deprecated model string, the API will return a 400 Bad Request.

Model Selection (Node.js)

const response = await client.chat.completions.create({
  // Explicitly declare the model from the registry
  model: "fikra-pro-120b",
  messages: [
    { role: "system", content: "You are an expert systems architect." },
    { role: "user", content: "Design a microservices schema for..." }
  ],
  // Optional: Cap the output tokens to prevent runaway generation
  max_tokens: 4096
});

Understanding Token Counting and Context Limits

The context window represents the total number of tokens allowed in a single request. This is the sum of your input (Prompt Tokens) plus the model's output (Completion Tokens).

If you send a 7,000 token prompt to the fikra-fast-8b model, the model will only have 1,192 tokens left to generate its response before it hits the hard limit and truncates the output. Always select a model with a context window significantly larger than your input payload.

← Previous Topic

API Endpoints

Understand JSON payloads and SSE streaming structure.

Next Topic →

Rate Limits

Understand RPM scaling, tiers, and handling 429 errors.