Gemini 3.5 Flash API Model Guide

TL;DR

Official capabilities include a 1M context window, up to 64K output, plus function calling, structured output, search tools, and code execution.
Compared with classic lightweight Flash tiers, it is better suited to complex Q&A, coding, long-document handling, and agent workflows.
The recommended integration path remains OpenAI-compatible chat so existing SDKs, SSE streaming, and retry middleware can be reused.

Core Capabilities

High intelligence with Flash latency：It is not just a low-cost fallback tier; it aims to handle deeper understanding, planning, and generation while staying responsive.
1M long context：Well suited to long-document summarization, codebase analysis, knowledge synthesis, large prompts, and extended conversations.
Structured and tool-based output：Function calling, structured JSON, search tools, and code execution make it a stronger fit for agentic workflows.
Coding and technical generation：Useful for code explanation, unit test generation, refactor proposals, SDK wrappers, and technical drafts.
OpenAI-compatible migration path：Chat Completions-style payloads reduce switching friction from GPT- or Claude-oriented stacks.
Streaming interaction：SSE streaming supports chat UIs, terminal assistants, and IDE-style progressive rendering.

When to Use

When you need more reasoning and coding strength than a classic lightweight model without giving up too much responsiveness.
When handling long-context tasks such as document Q&A, codebase analysis, long-session memory, or retrieval-grounded answers.
When you need structured JSON, function calling, or search/tool integration for agents.

When Not to Use

For ultra-simple, ultra-cheap bulk templating tasks where higher intelligence is unnecessary.
For pure image or video generation workloads; use dedicated media models instead.

Runtime Behavior

The recommended path is POST /v1/chat/completions using an OpenAI-compatible request shape.
stream=true returns SSE chunks, while stream=false returns a standard full completion object.
If your workflow depends on structured output or tools, validate schemas, tool choice, and fallback behavior on a small slice first.

Minimal Request

{
  "model": "gemini-3.5-flash",
  "messages": [
    {
      "role": "system",
      "content": "You are a senior full-stack engineer. Summarize the approach first, then provide the smallest runnable code."
    },
    {
      "role": "user",
      "content": "Write a Node.js function that reads a CSV, removes duplicate emails, and returns summary stats."
    }
  ],
  "temperature": 0.2,
  "max_tokens": 500,
  "stream": false
}

Minimal Response

{
  "id": "chatcmpl_xxxxxxxx",
  "object": "chat.completion",
  "created": 1747699200,
  "model": "gemini-3.5-flash",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 120,
    "completion_tokens": 260,
    "total_tokens": 380
  }
}

Key Parameters

Parameter	Type	Required	Default	Range	Description
model	string	Yes	gemini-3.5-flash	-	Model name. Use `gemini-3.5-flash`.
messages	object[]	Yes	-	-	Conversation messages in chronological order, commonly using system, user, and assistant roles.
max_tokens	integer	No	-	>=1, practical cap should be set by your app	Maximum output tokens. The model can reach 64K output officially, but real applications should enforce scenario-specific limits.
stream	boolean	No	false	-	Whether to enable SSE streaming output.
temperature	number	No	1	0-2	Sampling temperature. Lower values are steadier; higher values are more diverse.
top_p	number	No	1	0-1	Nucleus sampling threshold; avoid aggressively tuning it together with temperature.
tools	object[]	No	-	-	Tool or function definitions for tool-calling and agent orchestration.
Authorization	HTTP Header	Yes	-	-	Bearer authentication: Authorization: Bearer <YOUR_API_KEY>.

Common Errors

HTTP	Code	Trigger	Fix Action	Retry Policy
400	invalid_request_error	Missing required fields, malformed messages, or invalid parameter types in the payload.	Validate model, messages, max_tokens, and any tools/schema JSON before retrying.	Retry only after fixing the payload; avoid blind retries.
401	authentication_error	Missing Authorization header, malformed bearer token, or invalid API key.	Verify Authorization: Bearer <YOUR_API_KEY> format and key validity.	Retry after authentication is fixed.
429	rate_limit_error	Request rate, concurrency, or quota usage has hit upstream rate limiting.	Apply exponential backoff and inspect concurrency, context size, and current quota consumption.	Use 1s/2s/4s backoff with jitter; reduce concurrency or downgrade workload shape if it persists.
500	internal_error	Transient upstream instability, tool execution failure, or internal processing issues.	Capture request id and a compact context summary, then retry; escalate if failures persist.	Retry 2-3 times with short delays.

FAQ

What is Gemini 3.5 Flash best for?
It is best for text and coding assistant tasks that need strong intelligence without giving up throughput and response speed.
How is it different from a classic Flash model?
The main difference is a higher intelligence ceiling, making it more suitable for complex Q&A, long context, and tool-enabled workflows.
What is the fastest way to integrate it?
Use the OpenAI-compatible Chat Completions path first so you can reuse existing messages, streaming, and retry middleware.
When should max_tokens be constrained carefully?
Constrain max_tokens carefully for long-context, coding, and structured-output tasks to avoid bloated responses, cost drift, or timeouts.

Gemini 3.5 Flash Full Guide (Markdown)

Gemini 3.5 Flash API Model Guide

TL;DR

Core Capabilities

When to Use

When Not to Use

Runtime Behavior

Minimal Request

Minimal Response

Key Parameters

Common Errors

FAQ

Related APIs