-
-
Notifications
You must be signed in to change notification settings - Fork 391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Annotations abstraction for responses that are not just a stream of plain text #716
Comments
I had thought that attachments would be the way to handle this, but they only work for audio/image outputs - the thing where Claude and DeepSeek can return annotated spans of text feels different. |
Here's an extract from that Claude citations example: {
"id": "msg_01P3zs4aYz2Baebumm4Fejoi",
"content": [
{
"text": "Based on the document, here are the key trends in AI/LLMs from 2024:\n\n1. Breaking the GPT-4 Barrier:\n",
"type": "text"
},
{
"citations": [
{
"cited_text": "I’m relieved that this has changed completely in the past twelve months. 18 organizations now have models on the Chatbot Arena Leaderboard that rank higher than the original GPT-4 from March 2023 (GPT-4-0314 on the board)—70 models in total.\n\n",
"document_index": 0,
"document_title": "My Document",
"end_char_index": 531,
"start_char_index": 288,
"type": "char_location"
}
],
"text": "The GPT-4 barrier was completely broken, with 18 organizations now having models that rank higher than the original GPT-4 from March 2023, with 70 models in total surpassing it.",
"type": "text"
},
{
"text": "\n\n2. Increased Context Lengths:\n",
"type": "text"
},
{
"citations": [
{
"cited_text": "Gemini 1.5 Pro also illustrated one of the key themes of 2024: increased context lengths. Last year most models accepted 4,096 or 8,192 tokens, with the notable exception of Claude 2.1 which accepted 200,000. Today every serious provider has a 100,000+ token model, and Google’s Gemini series accepts up to 2 million.\n\n",
"document_index": 0,
"document_title": "My Document",
"end_char_index": 1680,
"start_char_index": 1361,
"type": "char_location"
}
],
"text": "A major theme was increased context lengths. While last year most models accepted 4,096 or 8,192 tokens (with Claude 2.1 accepting 200,000), today every serious provider has a 100,000+ token model, and Google's Gemini series accepts up to 2 million.",
"type": "text"
}, And from the DeepSeek reasoner streamed response (pretty-printed here). First a reasoning content chunk: {
"id": "2cf23b27-2ba6-41dd-b484-358c486a1405",
"object": "chat.completion.chunk",
"created": 1737480272,
"model": "deepseek-reasoner",
"system_fingerprint": "fp_1c5d8833bc",
"choices": [
{
"index": 0,
"delta": {
"content": null,
"reasoning_content": "Okay"
},
"logprobs": null,
"finish_reason": null
}
]
} Text content chunk: {
"id": "2cf23b27-2ba6-41dd-b484-358c486a1405",
"object": "chat.completion.chunk",
"created": 1737480272,
"model": "deepseek-reasoner",
"system_fingerprint": "fp_1c5d8833bc",
"choices": [
{
"index": 0,
"delta": {
"content": " waves",
"reasoning_content": null
},
"logprobs": null,
"finish_reason": null
}
]
} |
Meanwhile OpenAI audio responses look like this (truncated).I'm not sure if these can mix in text output as well, but in this case the audio does at least include a "transcript" key: {
"id": "chatcmpl-At42uKzhIMJfzGOwypiS9mMH3oaFG",
"object": "chat.completion",
"created": 1737686956,
"model": "gpt-4o-audio-preview-2024-12-17",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": null,
"refusal": null,
"audio": {
"id": "audio_6792ffad12f48190abab9d6b7d1a1bf7",
"data": "UklGRkZLAABXQVZFZ...",
"expires_at": 1737690557,
"transcript": "Hi"
}
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 22,
"completion_tokens": 13,
"total_tokens": 35,
"prompt_tokens_details": {
"cached_tokens": 0,
"audio_tokens": 0,
"text_tokens": 22,
"image_tokens": 0
},
"completion_tokens_details": {
"reasoning_tokens": 0,
"audio_tokens": 8,
"accepted_prediction_tokens": 0,
"rejected_prediction_tokens": 0,
"text_tokens": 5
}
},
"service_tier": "default",
"system_fingerprint": "fp_58887f9c5a"
} |
I think a combination of pydantic object with some sort of templating language. E.g. for the Claude example you have this object: from pydantic import BaseModel, Field
from typing import List, Optional, Literal
class TextRange(BaseModel):
start: int
end: int
class Citation(BaseModel):
sourceDocument: str = Field(alias="document_title")
documentIndex: int = Field(alias="document_index")
textRange: TextRange = Field(...)
citedText: str = Field(alias="cited_text")
type: Literal["char_location"]
class ContentBlock(BaseModel):
blockType: Literal["text", "heading"] = Field(alias="type")
content: str = Field(alias="text")
hasCitation: bool = Field(default=False)
citation: Optional[Citation] = None
headingLevel: Optional[int] = None
class Message(BaseModel):
messageId: str = Field(alias="id")
contentBlocks: List[ContentBlock] = Field(alias="content") and then you define a message template:
You could then create similar objects and templates for different model types. These could also be exposed to users to customize how data is shown for any model. Also - pydantic now supports partial validation so to the extent any of the json responses are streamed, this model should still work. |
Thinking about this a bit more, if you want to go down this road or something similar, it would be great to have it as a separate package. This would let it be a plugin to this library, but also useable in others. I could definitely use something like this in some of my projects where I have LiteLLM, which lets me switch models easily and so would be great to be able to have output templates that I could define like this. Not sure how hard this would be but I could probably contribute. |
Cohere is a bit outside the top tier models but probably worth considering their citation format as well when designing this: https://docs.cohere.com/docs/documents-and-citations |
That Cohere example is really interesting. It looks like they decided to have citations as a separate top-level key and then reference which bits of text the citations correspond to using start/end indexes: # response.message.content
[AssistantMessageResponseContentItem_Text(text='The tallest penguins are the Emperor penguins. They only live in Antarctica.', type='text')]
# response.message.citations
[Citation(start=29,
end=46,
text='Emperor penguins.',
sources=[Source_Document(id='doc:0:0',
document={'id': 'doc:0:0',
'snippet': 'Emperor penguins are the tallest.',
'title': 'Tall penguins'},
type='document')]),
Citation(start=65,
end=76,
text='Antarctica.',
sources=[Source_Document(id='doc:0:1',
document={'id': 'doc:0:1',
'snippet': 'Emperor penguins only live in Antarctica.',
'title': 'Penguin habitats'},
type='document')])] Note how that first citation is in a separate data structure and flags 29-46 - the text "Emperor penguins." - as the attachment point. This might actually be a way to solve the general problem: I could take the Claude citations format and turn that into a separate stored piece of information, referring back to the original text using those indexes. That way I could still store a string of text in the database / output that in the API, but additional annotations against that stream of text could be stored elsewhere. For the DeepSeek reasoner case this would mean having a start-end indexed chunk of text that is labelled as coming from the I don't think this approach works for returning audio though - there's no text segment to attach that audio to, though I guess I could say "index 55:55 is where the audio chunk came in". |
I'm going to call this annotations for the moment - where an annotation is additional metadata attached to a portion of the text returned by an LLM. The three things to consider are:
|
I think I'll treat audio/image responses separately from annotations - I'll use an expanded version of the existing attachments mechanism for that - including the existing Lines 181 to 194 in 656d8fa
I'll probably add a |
After brainstorming with Claude I think a solution to the terminal representation challenge could be to add markers around the annotated spans of text and then display those annotations below. One neat option here is corner brackets - 「 and 」- for example:
So the spans of text that have annotations are wrapped in「 and 」and the annotations themselves are then displayed below. Here's what that looks like in a macOS terminal window: ![]() |
For DeepSeek reasoner that might look like this:
In this case I'd have to do some extra post-processing to combine all of those short token snippets into a single annotation, de-duping the |
For the Python layer this might look like so: response = llm.prompt("prompt goes here")
print(response.text()) # outputs the plain text
print(response.annotations)
# Outputs annotations, see below
for annotated in response.text_with_annotations():
print(annotated.text, annotated.annotations) That The [
Annotation(start=0, end=5, data={"this": "is a dictionary of stuff"}),
Annotation(start=55, end=58, data={"this": "is more stuff"}),
] ( |
Then the SQL table design is pretty simple: CREATE TABLE [response_annotations] (
[id] INTEGER PRIMARY KEY,
[response_id] TEXT REFERENCES [responses]([id]),
[start_index] INTEGER,
[end_index] INTEGER,
[annotation] TEXT -- JSON
); |
It bothers me very slightly that this design allows for exact positioning of annotations in a text stream response (with a start and end index) but doesn't support that for recording the position at which an image or audio clip was returned. I think the fix for that is to have an optional single |
asking the obvious question, why not use the academic paper style of using |
The problem with using That said, maybe this could work (I also added text wrapping):
This would also mean I could omit the quoted truncated extract entirely:
|
Or with numbers at the start:
|
This came up again today thanks to Claude 3.7 Sonnet exposing thinking tokens: Based on that (and how it works in the Claude streaming API - see example at https://gist.github.com/simonw/c5e369753e8dbc9b045c514bb4fee987) I'm now thinking the Maybe tool usage can benefit from this too? |
Lots of these in the new OpenAI Responses API https://platform.openai.com/docs/api-reference/responses/create |
The OpenAI web search stuff needs this too: Example from https://platform.openai.com/docs/guides/tools-web-search?api-mode=chat&lang=curl#output-and-citations [
{
"index": 0,
"message": {
"role": "assistant",
"content": "the model response is here...",
"refusal": null,
"annotations": [
{
"type": "url_citation",
"url_citation": {
"end_index": 985,
"start_index": 764,
"title": "Page title...",
"url": "https://..."
}
}
]
},
"finish_reason": "stop"
}
] |
OpenAI example (including streaming) here: |
Here's a challenge: in streaming mode OpenAI only returns the annotations at the very end - but I'll already have printed the text out to the screen by the time that arrives, so I won't be able to use the fancy inline But some APIs like DeepSeek or Claude Thinking CAN return inline annotations. So the design needs to handle both cases. This will be particularly tricky at the Python API layer. If you call a method that's documented as streaming a sequence of chunks with optional annotations at you what should that method do for the OpenAI case where actually the annotations were only visible at the end? |
Let's look at what Anthropic does for streaming citations. Without streaming: curl https://api.anthropic.com/v1/messages \
-H "content-type: application/json" \
-H "x-api-key: $(llm keys get anthropic)" \
-H "anthropic-version: 2023-06-01" \
-d '{
"model": "claude-3-7-sonnet-20250219",
"max_tokens": 1024,
"messages": [
{
"role": "user",
"content": [
{
"type": "document",
"source": {
"type": "text",
"media_type": "text/plain",
"data": "The grass is green. The sky is blue."
},
"title": "My Document",
"context": "This is a trustworthy document.",
"citations": {"enabled": true}
},
{
"type": "text",
"text": "What color is the grass and sky?"
}
]
}
]
}' | jq Returns: {
"id": "msg_016NSoAFZagmYi29wfZ72wN2",
"type": "message",
"role": "assistant",
"model": "claude-3-7-sonnet-20250219",
"content": [
{
"type": "text",
"text": "Based on the document you've provided:\n\n"
},
{
"type": "text",
"text": "The grass is green.",
"citations": [
{
"type": "char_location",
"cited_text": "The grass is green. ",
"document_index": 0,
"document_title": "My Document",
"start_char_index": 0,
"end_char_index": 20
}
]
},
{
"type": "text",
"text": " "
},
{
"type": "text",
"text": "The sky is blue.",
"citations": [
{
"type": "char_location",
"cited_text": "The sky is blue.",
"document_index": 0,
"document_title": "My Document",
"start_char_index": 20,
"end_char_index": 36
}
]
}
],
"stop_reason": "end_turn",
"stop_sequence": null,
"usage": {
"input_tokens": 610,
"cache_creation_input_tokens": 0,
"cache_read_input_tokens": 0,
"output_tokens": 54
}
} But with streaming: curl https://api.anthropic.com/v1/messages \
-H "content-type: application/json" \
-H "x-api-key: $(llm keys get anthropic)" \
-H "anthropic-version: 2023-06-01" \
-d '{
"model": "claude-3-7-sonnet-20250219",
"max_tokens": 1024,
"stream": true,
"messages": [
{
"role": "user",
"content": [
{
"type": "document",
"source": {
"type": "text",
"media_type": "text/plain",
"data": "The grass is green. The sky is blue."
},
"title": "My Document",
"context": "This is a trustworthy document.",
"citations": {"enabled": true}
},
{
"type": "text",
"text": "What color is the grass and sky?"
}
]
}
]
}' It returns this:
Claude DID interleave citations among regular text, with blocks that look like this:
|
I pushed my prototype so far - the one dodgy part of it is that I got Claude to rewrite the |
Current TODO list:
|
if we did start optionally yielding In terms of display, we could teach What about for the case with OpenAI where the annotations only become available at the end of the stream? For that we cannot show Instead, we could fall back on that earlier idea from #716 (comment) to show them like this:
That would work pretty well for this edge-case I think. I guess the Python API then becomes something like this: seen_annotations = False
for chunk in response.chunks():
print(chunk, sep='')
if chunk.annotation:
seen_annotations = True
print(chunk.annotation)
if not seen_annotations and response.annotations:
# Must have been some annotations at the end that we missed
print(response.annotations) Or encapsulate that logic into a |
I think I like I'll leave it as |
This feature may be the point at which I need a Something like this: llm -m gpt-4o-mini-search-preview 'what happened on march 1 2025' --json Outputs:
Or for things where annotations come at the end maybe it ends with:
This would effectively be the debug tool version of |
The various "thinking" blocks I want to support don't actually include start and end indexes in their APIs, so I'll need a utility mechanism to keep track of those automatically for logging to the database. Noteworthy that the Claude citations streaming API does include start and end indices:
But the thinking blocks from Claude do not: curl https://api.anthropic.com/v1/messages \
--header "x-api-key: $(llm keys get anthropic)" \
--header "anthropic-version: 2023-06-01" \
--header "content-type: application/json" \
--data \
'{
"model": "claude-3-7-sonnet-20250219",
"stream": true,
"max_tokens": 2048,
"thinking": {"type": "enabled", "budget_tokens": 1024},
"messages": [
{"role": "user", "content": "Think about poetry"}
]
}' Outputs (truncated):
|
It turns out the new OpenAI responses API does stream annotations within the main stream of returned events. Here's an example: https://gist.github.com/simonw/47b043f0851c54eae85e0bd961d2e198#file-recent_john_gruber-py-L587-L612 I ran {'content_index': 0,
'delta': ' ',
'item_id': 'msg_67db4ae7a2908192b010032f387583890c691f65bff4763f',
'output_index': 1,
'type': 'response.output_text.delta'}
{'content_index': 0,
'delta': '([macrumors.com](https://www.macrumors.com/2025/03/12/gruber-says-something-is-rotten-at-apple/?utm_source=openai))',
'item_id': 'msg_67db4ae7a2908192b010032f387583890c691f65bff4763f',
'output_index': 1,
'type': 'response.output_text.delta'}
{'annotation': {'end_index': 596,
'start_index': 481,
'title': "John Gruber Says 'Something is Rotten' at Apple - "
'MacRumors',
'type': 'url_citation',
'url': 'https://www.macrumors.com/2025/03/12/gruber-says-something-is-rotten-at-apple/?utm_source=openai'},
'annotation_index': 0,
'content_index': 0,
'item_id': 'msg_67db4ae7a2908192b010032f387583890c691f65bff4763f',
'output_index': 1,
'type': 'response.output_text.annotation.added'}
{'content_index': 0,
'delta': '\n',
'item_id': 'msg_67db4ae7a2908192b010032f387583890c691f65bff4763f',
'output_index': 1,
'type': 'response.output_text.delta'} The annotations are also shown in full at the end of the streaming response: |
https://platform.openai.com/docs/api-reference/responses-streaming/response/output_text/annotation shows the event that is output in stream mode for an annotation:
That annotation is a different shape from the web search one. |
I ran through their tutorial on this page and then did: import json
response = client.responses.create(
model="gpt-4o-mini",
input="What is deep research by OpenAI?",
tools=[{
"type": "file_search",
"vector_store_ids": [vector_store.id]
}],
stream=True,
)
for event in response:
print(json.dumps(event.dict(), indent=2)) Here's the list of JSON events I got back: https://gist.github.com/simonw/7d93036f2a0a9b8b2bf20c452abe9f06 Again it included annotations returned during the stream: https://gist.github.com/simonw/7d93036f2a0a9b8b2bf20c452abe9f06#file-events-txt-L1943-L1983
It looks to me like that file annotation isn't attached to a range within the response, it's attached to a single index where it was output - I can't quite figure out what the |
Side note: here's the text that OpenAI store in the vector store for that PDF https://cdn.openai.com/API/docs/deep_research_blog.pdf The text portion is 10,007 tokens according to I got that with: data = client.vector_stores.files.content(
file_id='file-QDeY5qs4SjfyYarQ2onMK6',
vector_store_id=vector_store.id
).model_dump() This: client.vector_stores.files.retrieve(file_id='file-QDeY5qs4SjfyYarQ2onMK6', vector_store_id=vector_store.id) Gave me this: {
"id": "file-QDeY5qs4SjfyYarQ2onMK6",
"created_at": 1742426135,
"last_error": null,
"object": "vector_store.file",
"status": "completed",
"usage_bytes": 66539,
"vector_store_id": "vs_67db4ffa373c81918dead92b7a593921",
"attributes": {},
"chunking_strategy": {
"static": {
"chunk_overlap_tokens": 400,
"max_chunk_size_tokens": 800
},
"type": "static"
}
} |
Ooh, adding import json
response = client.responses.create(
model="gpt-4o-mini",
input="What is deep research by OpenAI?",
tools=[{
"type": "file_search",
"vector_store_ids": [vector_store.id]
}],
stream=True,
include = ["file_search_call.results"],
)
for event in response:
print(json.dumps(event.dict(), indent=2)) That added a huge JSON event part way through when it ran the search. That event started like this: {
"item": {
"id": "fs_67db545e5e488192873a64ea54abcb480b079e72bb72a98e",
"queries": [
"deep research by OpenAI",
"What is deep research?",
"OpenAI deep research concept"
],
"status": "completed",
"type": "file_search_call",
"results": [
{
"attributes": {},
"file_id": "file-QDeY5qs4SjfyYarQ2onMK6",
"filename": "deep_research_blog.pdf",
"score": 0.981739387423489,
"text": "Introducing deep research | OpenAI..." Does this count as an annotation? Not entirely clear - the fact that it was output at the start of the response isn't really that interesting, and it's included a second time in that final |
Refs #716 - describes a yield llm.Chunk() mechanism that does not yet exist.
I wrote the plugin author documentation for the new feature, including a description of how |
Idea: if annotations have a clear ID can include that and then use to deduplicate in case where annotation is both streamed and then repeated at the end. |
I'm going to implement Claude citations as part of this, to help test the new mechanism: |
LLM currently assumes that all responses from a model come in the form of a stream of text.
This assumption no longer holds!
And that's just variants of text - multi-modal models need consideration as well. OpenAI have a model that can return snippets of audio already, and models that return images (from OpenAI and Gemini) are coming available very soon too.
The text was updated successfully, but these errors were encountered: