fix(llm): disable Gemini thinking to prevent MAX_TOKENS on structured output
Gemini 2.5 Flash (gemini-flash-latest) enables thinking by default. Thinking tokens count toward max_output_tokens, leaving ~150 tokens for actual JSON output and causing MAX_TOKENS truncation. Disable thinking centrally in call_gemini via ThinkingConfig(thinking_budget=0). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
092fd87606
commit
ada5f2a93b
1 changed files with 7 additions and 0 deletions
|
|
@ -46,6 +46,13 @@ def call_gemini(
|
||||||
with suppress(Exception):
|
with suppress(Exception):
|
||||||
user_input = str(contents)
|
user_input = str(contents)
|
||||||
|
|
||||||
|
# Disable thinking by default — Gemini 2.5 Flash thinking tokens count toward
|
||||||
|
# max_output_tokens, leaving too little room for actual JSON output.
|
||||||
|
if config.thinking_config is None:
|
||||||
|
config = config.model_copy(
|
||||||
|
update={"thinking_config": genai_types.ThinkingConfig(thinking_budget=0)}
|
||||||
|
)
|
||||||
|
|
||||||
start = time.monotonic()
|
start = time.monotonic()
|
||||||
success, error_detail, response, finish_reason = True, None, None, None
|
success, error_detail, response, finish_reason = True, None, None, None
|
||||||
try:
|
try:
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue