Gemini 3.5 Flash API Model Guide
TL;DR
- Official capabilities include a 1M context window, up to 64K output, plus function calling, structured output, search tools, and code execution.
- Compared with classic lightweight Flash tiers, it is better suited to complex Q&A, coding, long-document handling, and agent workflows.
- The recommended integration path remains OpenAI-compatible chat so existing SDKs, SSE streaming, and retry middleware can be reused.
Core Capabilities
- High intelligence with Flash latency:It is not just a low-cost fallback tier; it aims to handle deeper understanding, planning, and generation while staying responsive.
- 1M long context:Well suited to long-document summarization, codebase analysis, knowledge synthesis, large prompts, and extended conversations.
- Structured and tool-based output:Function calling, structured JSON, search tools, and code execution make it a stronger fit for agentic workflows.
- Coding and technical generation:Useful for code explanation, unit test generation, refactor proposals, SDK wrappers, and technical drafts.
- OpenAI-compatible migration path:Chat Completions-style payloads reduce switching friction from GPT- or Claude-oriented stacks.
- Streaming interaction:SSE streaming supports chat UIs, terminal assistants, and IDE-style progressive rendering.
When to Use
- When you need more reasoning and coding strength than a classic lightweight model without giving up too much responsiveness.
- When handling long-context tasks such as document Q&A, codebase analysis, long-session memory, or retrieval-grounded answers.
- When you need structured JSON, function calling, or search/tool integration for agents.
When Not to Use
- For ultra-simple, ultra-cheap bulk templating tasks where higher intelligence is unnecessary.
- For pure image or video generation workloads; use dedicated media models instead.
Runtime Behavior
- The recommended path is POST /v1/chat/completions using an OpenAI-compatible request shape.
- stream=true returns SSE chunks, while stream=false returns a standard full completion object.
- If your workflow depends on structured output or tools, validate schemas, tool choice, and fallback behavior on a small slice first.
Minimal Request
{
"model": "gemini-3.5-flash",
"messages": [
{
"role": "system",
"content": "You are a senior full-stack engineer. Summarize the approach first, then provide the smallest runnable code."
},
{
"role": "user",
"content": "Write a Node.js function that reads a CSV, removes duplicate emails, and returns summary stats."
}
],
"temperature": 0.2,
"max_tokens": 500,
"stream": false
}
Minimal Response
{
"id": "chatcmpl_xxxxxxxx",
"object": "chat.completion",
"created": 1747699200,
"model": "gemini-3.5-flash",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "..."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 120,
"completion_tokens": 260,
"total_tokens": 380
}
}
Key Parameters
| Parameter | Type | Required | Default | Range | Description |
|---|
| model | string | Yes | gemini-3.5-flash | - | Model name. Use gemini-3.5-flash. |
| messages | object[] | Yes | - | - | Conversation messages in chronological order, commonly using system, user, and assistant roles. |
| max_tokens | integer | No | - | >=1, practical cap should be set by your app | Maximum output tokens. The model can reach 64K output officially, but real applications should enforce scenario-specific limits. |
| stream | boolean | No | false | - | Whether to enable SSE streaming output. |
| temperature | number | No | 1 | 0-2 | Sampling temperature. Lower values are steadier; higher values are more diverse. |
| top_p | number | No | 1 | 0-1 | Nucleus sampling threshold; avoid aggressively tuning it together with temperature. |
| tools | object[] | No | - | - | Tool or function definitions for tool-calling and agent orchestration. |
| Authorization | HTTP Header | Yes | - | - | Bearer authentication: Authorization: Bearer <YOUR_API_KEY>. |
Common Errors
| HTTP | Code | Trigger | Fix Action | Retry Policy |
|---|
| 400 | invalid_request_error | Missing required fields, malformed messages, or invalid parameter types in the payload. | Validate model, messages, max_tokens, and any tools/schema JSON before retrying. | Retry only after fixing the payload; avoid blind retries. |
| 401 | authentication_error | Missing Authorization header, malformed bearer token, or invalid API key. | Verify Authorization: Bearer <YOUR_API_KEY> format and key validity. | Retry after authentication is fixed. |
| 429 | rate_limit_error | Request rate, concurrency, or quota usage has hit upstream rate limiting. | Apply exponential backoff and inspect concurrency, context size, and current quota consumption. | Use 1s/2s/4s backoff with jitter; reduce concurrency or downgrade workload shape if it persists. |
| 500 | internal_error | Transient upstream instability, tool execution failure, or internal processing issues. | Capture request id and a compact context summary, then retry; escalate if failures persist. | Retry 2-3 times with short delays. |
FAQ
- What is Gemini 3.5 Flash best for?
It is best for text and coding assistant tasks that need strong intelligence without giving up throughput and response speed.
- How is it different from a classic Flash model?
The main difference is a higher intelligence ceiling, making it more suitable for complex Q&A, long context, and tool-enabled workflows.
- What is the fastest way to integrate it?
Use the OpenAI-compatible Chat Completions path first so you can reuse existing messages, streaming, and retry middleware.
- When should max_tokens be constrained carefully?
Constrain max_tokens carefully for long-context, coding, and structured-output tasks to avoid bloated responses, cost drift, or timeouts.
Related APIs