Gemini 3.5 Flash Full Guide (Markdown)

Back to model guide

Gemini 3.5 Flash API Model Guide

TL;DR

  • Official capabilities include a 1M context window, up to 64K output, plus function calling, structured output, search tools, and code execution.
  • Compared with classic lightweight Flash tiers, it is better suited to complex Q&A, coding, long-document handling, and agent workflows.
  • The recommended integration path remains OpenAI-compatible chat so existing SDKs, SSE streaming, and retry middleware can be reused.

Core Capabilities

  • High intelligence with Flash latency:It is not just a low-cost fallback tier; it aims to handle deeper understanding, planning, and generation while staying responsive.
  • 1M long context:Well suited to long-document summarization, codebase analysis, knowledge synthesis, large prompts, and extended conversations.
  • Structured and tool-based output:Function calling, structured JSON, search tools, and code execution make it a stronger fit for agentic workflows.
  • Coding and technical generation:Useful for code explanation, unit test generation, refactor proposals, SDK wrappers, and technical drafts.
  • OpenAI-compatible migration path:Chat Completions-style payloads reduce switching friction from GPT- or Claude-oriented stacks.
  • Streaming interaction:SSE streaming supports chat UIs, terminal assistants, and IDE-style progressive rendering.

When to Use

  • When you need more reasoning and coding strength than a classic lightweight model without giving up too much responsiveness.
  • When handling long-context tasks such as document Q&A, codebase analysis, long-session memory, or retrieval-grounded answers.
  • When you need structured JSON, function calling, or search/tool integration for agents.

When Not to Use

  • For ultra-simple, ultra-cheap bulk templating tasks where higher intelligence is unnecessary.
  • For pure image or video generation workloads; use dedicated media models instead.

Runtime Behavior

  • The recommended path is POST /v1/chat/completions using an OpenAI-compatible request shape.
  • stream=true returns SSE chunks, while stream=false returns a standard full completion object.
  • If your workflow depends on structured output or tools, validate schemas, tool choice, and fallback behavior on a small slice first.

Minimal Request

{
  "model": "gemini-3.5-flash",
  "messages": [
    {
      "role": "system",
      "content": "You are a senior full-stack engineer. Summarize the approach first, then provide the smallest runnable code."
    },
    {
      "role": "user",
      "content": "Write a Node.js function that reads a CSV, removes duplicate emails, and returns summary stats."
    }
  ],
  "temperature": 0.2,
  "max_tokens": 500,
  "stream": false
}

Minimal Response

{
  "id": "chatcmpl_xxxxxxxx",
  "object": "chat.completion",
  "created": 1747699200,
  "model": "gemini-3.5-flash",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 120,
    "completion_tokens": 260,
    "total_tokens": 380
  }
}

Key Parameters

ParameterTypeRequiredDefaultRangeDescription
modelstringYesgemini-3.5-flash-Model name. Use gemini-3.5-flash.
messagesobject[]Yes--Conversation messages in chronological order, commonly using system, user, and assistant roles.
max_tokensintegerNo->=1, practical cap should be set by your appMaximum output tokens. The model can reach 64K output officially, but real applications should enforce scenario-specific limits.
streambooleanNofalse-Whether to enable SSE streaming output.
temperaturenumberNo10-2Sampling temperature. Lower values are steadier; higher values are more diverse.
top_pnumberNo10-1Nucleus sampling threshold; avoid aggressively tuning it together with temperature.
toolsobject[]No--Tool or function definitions for tool-calling and agent orchestration.
AuthorizationHTTP HeaderYes--Bearer authentication: Authorization: Bearer <YOUR_API_KEY>.

Common Errors

HTTPCodeTriggerFix ActionRetry Policy
400invalid_request_errorMissing required fields, malformed messages, or invalid parameter types in the payload.Validate model, messages, max_tokens, and any tools/schema JSON before retrying.Retry only after fixing the payload; avoid blind retries.
401authentication_errorMissing Authorization header, malformed bearer token, or invalid API key.Verify Authorization: Bearer <YOUR_API_KEY> format and key validity.Retry after authentication is fixed.
429rate_limit_errorRequest rate, concurrency, or quota usage has hit upstream rate limiting.Apply exponential backoff and inspect concurrency, context size, and current quota consumption.Use 1s/2s/4s backoff with jitter; reduce concurrency or downgrade workload shape if it persists.
500internal_errorTransient upstream instability, tool execution failure, or internal processing issues.Capture request id and a compact context summary, then retry; escalate if failures persist.Retry 2-3 times with short delays.

FAQ

  1. What is Gemini 3.5 Flash best for?
    It is best for text and coding assistant tasks that need strong intelligence without giving up throughput and response speed.
  2. How is it different from a classic Flash model?
    The main difference is a higher intelligence ceiling, making it more suitable for complex Q&A, long context, and tool-enabled workflows.
  3. What is the fastest way to integrate it?
    Use the OpenAI-compatible Chat Completions path first so you can reuse existing messages, streaming, and retry middleware.
  4. When should max_tokens be constrained carefully?
    Constrain max_tokens carefully for long-context, coding, and structured-output tasks to avoid bloated responses, cost drift, or timeouts.

Related APIs