-
Notifications
You must be signed in to change notification settings - Fork 725
PoC: InferenceClient
is also a MCPClient
#2986
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
mcp_client.py
Outdated
self.exit_stack = AsyncExitStack() | ||
self.available_tools: List[ChatCompletionInputTool] = [] | ||
|
||
async def add_mcp_server(self, command: str, args: List[str]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not name this method connect_to_server
?
we can't add multiple mcp servers to single instance of client, can we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes we can
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i mean we would need to store a map of sessions
, but there's nothing preventing us from doing it, conceptually
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me know if the following question is out of scope.
Question: how would I connect to multiple MCP servers? Would it look like option 1 or option 2?
Option 1
client1 = MCPClient()
client1.add_mcp_server()
client2 = MCPClient()
client2.add_mcp_server()
or
Option 2
client = MCPClient()
client.add_mcp_server(server1)
client.add_mcp_server(server2)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another design question: class MCPClient(AsyncInferenceClient)
vs class AsyncInferenceClient(...args, mcp_clients: MCPClient[])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to chime in unannounced, but from a very removed external user standpoint, I find this all very confusing - I just don't think what you coded should be called MCPClient
😄
When I came to this PR I was fully expecting MCPClient
to be passed as parameter to InferenceClient
, though I hear @Wauplin above, so why not a wrapper. But the end result is really more of an InferenceClientWithEmbeddedMCP
to me, not an MCPClient
.
That being said, it's just about semantics, but I'm kind of a semantics extremist, sorry about that (and feel free to completely disregard this message, as is very likely XD)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was fully expecting
MCPClient
to be passed as parameter toInferenceClient
What do you mean as parameter? Do you have an example signature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
second option of #2986 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yes, sure, we can probably add this I guess
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah actually with the async/await stuff i'm not so sure.
Co-authored-by: Mishig <[email protected]>
… AsyncInferenceClient" This reverts commit 2c7329c.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
few comments out of my head, SSE supports would not be so hard to add and could really be a nice addition
mcp_client.py
Outdated
self.exit_stack = AsyncExitStack() | ||
self.available_tools: List[ChatCompletionInputTool] = [] | ||
|
||
async def add_mcp_server(self, command: str, args: List[str], env: Dict[str, str]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you would need to lighten a bit the requirements on your args if you want to make it work with SSE or the intent is just to support STDIO ? I see the rest seems to focus on stdio so maybe it's by design
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for now, just Stdio, but in the future Streaming HTTP
from what i've understood should be the way to go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes this is the new spec, but it is backward compatible and at the level you are working with in this PR I wouldnt expect much to change, probably the internals of the client will change but the client interface would remain the same. Which means if today you do something like add_mcp_server(StdioParameters | dict) dict being the arguments of the sse_client from the python sdk you could already support all the SSE servers + potentially future Streaming HTTP server with minor adjustments at most
mcp_client.py
Outdated
"function": { | ||
"name": tool.name, | ||
"description": tool.description, | ||
"parameters": tool.inputSchema, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a note that I have seen some MCP servers with jsonref in their description which sometimes confuses the model. In mcpadapt I had to resolve the jsonref before passing it to the model, might be minor for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
confused or sometime plain unsupported by the model sdk like google genai...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, does the spec mention anything about whether jsonref is allowed or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the spec mention it however, it gets auto generated if you use pydantic models by the official mcp python sdk using the fastMCP syntax. I had the case for one of my mcp server I use to test things: https://github.com/grll/pubmedmcp
Thanks for the review @grll! |
FYI on what I was saying about sse_client being easy to add: this is a current proposal for a streaming HTTP client while not compatible and not officially approved or commented on you see that it takes the same arguments as before and even fallback to the SSE client if detected and yield the same as before (a read and write stream). Same as the stdio client. |
@grll indeed, thanks. I think i'll merge this PR almost as-is, but we'd welcome further PRs to add SSE/HTTP and other stuff! |
## Required reading https://modelcontextprotocol.io/quickstart/client TL;DR: MCP is a standard API to expose sets of Tools that can be hooked to LLMs ## Summary of how to use this You can either use McpClient, or you can run an example Agent directly: # Tiny Agent We now have a tiny Agent (a while loop, really) in this PR, built on top of the MCP Client. You can run it like this: ```bash cd packages/mcp-client pnpm run agent ``` # `McpClient` i.e. the underlying class ```ts const client = new McpClient({ provider: "together", model: "Qwen/Qwen2.5-72B-Instruct", apiKey: process.env.HF_TOKEN, }); await client.addMcpServer({ // Filesystem "official" mcp-server with access to your Desktop command: "npx", args: ["-y", "@modelcontextprotocol/server-filesystem", join(homedir(), "Desktop")], }); ``` ## Variant where we call a custom, local MCP server ```ts await client.addMcpServer( "node", ["--disable-warning=ExperimentalWarning", join(homedir(), "Desktop/hf-mcp/index.ts")], { HF_TOKEN: process.env.HF_TOKEN, } ); const response = await client.processQuery(` find an app that generates 3D models from text, and also get the best paper about transformers `); ``` #### Where to find the MCP Server used here as an example https://gist.github.com/julien-c/0500ba922e1b38f2dc30447fb81f7dc6 (Note that you can replace it with any MCP Server, from this doc for instance: https://modelcontextprotocol.io/examples) ## Python version Python version will be implemented in `huggingface_hub` in this PR: huggingface/huggingface_hub#2986 Contributions are welcome! --------- Co-authored-by: Eliott C. <[email protected]> Co-authored-by: SBrandeis <[email protected]> Co-authored-by: Simon Brandeis <[email protected]>
## Required reading https://modelcontextprotocol.io/quickstart/client TL;DR: MCP is a standard API to expose sets of Tools that can be hooked to LLMs ## Summary of how to use this You can either use McpClient, or you can run an example Agent directly: # Tiny Agent We now have a tiny Agent (a while loop, really) in this PR, built on top of the MCP Client. You can run it like this: ```bash cd packages/mcp-client pnpm run agent ``` # `McpClient` i.e. the underlying class ```ts const client = new McpClient({ provider: "together", model: "Qwen/Qwen2.5-72B-Instruct", apiKey: process.env.HF_TOKEN, }); await client.addMcpServer({ // Filesystem "official" mcp-server with access to your Desktop command: "npx", args: ["-y", "@modelcontextprotocol/server-filesystem", join(homedir(), "Desktop")], }); ``` ## Variant where we call a custom, local MCP server ```ts await client.addMcpServer( "node", ["--disable-warning=ExperimentalWarning", join(homedir(), "Desktop/hf-mcp/index.ts")], { HF_TOKEN: process.env.HF_TOKEN, } ); const response = await client.processQuery(` find an app that generates 3D models from text, and also get the best paper about transformers `); ``` #### Where to find the MCP Server used here as an example https://gist.github.com/julien-c/0500ba922e1b38f2dc30447fb81f7dc6 (Note that you can replace it with any MCP Server, from this doc for instance: https://modelcontextprotocol.io/examples) ## Python version Python version will be implemented in `huggingface_hub` in this PR: huggingface/huggingface_hub#2986 Contributions are welcome! --------- Co-authored-by: Eliott C. <[email protected]> Co-authored-by: SBrandeis <[email protected]> Co-authored-by: Simon Brandeis <[email protected]>
I just pushed changes to clean this PR. See PR description for list of changes #2986 (comment). It should now be very close to JS SDK (no SSE / Streaming-HTTP server support though it can be easily added later). Example: import os
from huggingface_hub import ChatCompletionInputMessage, ChatCompletionStreamOutput, MCPClient
async def main():
async with MCPClient(
provider="nebius",
model="Qwen/Qwen2.5-72B-Instruct",
api_key=os.environ["HF_TOKEN"],
) as client:
await client.add_mcp_server(
command="python",
args=["hf_mcp.py"], # https://gist.github.com/Wauplin/b6e5f1a39db843eedfa00738e4998589
env={"HF_TOKEN": os.environ["HF_TOKEN"]},
)
messages = [
{
"role": "user",
"content": "what are the most recently updated models from nvidia? Explain your reasoning.",
}
]
async for chunk in client.process_single_turn_with_tools(messages):
# Log messages
if isinstance(chunk, ChatCompletionStreamOutput):
delta = chunk.choices[0].delta
if delta.content:
print(delta.content, end="")
# Or tool calls
elif isinstance(chunk, ChatCompletionInputMessage):
print(
f"\nCalled tool '{chunk.name}'. Result: '{chunk.content if len(chunk.content) < 100 else chunk.content[:100] + '...'}'"
)
if __name__ == "__main__":
import asyncio
asyncio.run(main()) Producing:
This should now be ready for review :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥
I tested the PR too, it works as expected!
agree to add SSE / Streaming server support in a following PR 💯
|
||
# List available tools | ||
response = await session.list_tools() | ||
logger.debug("Connected to server with tools:", [tool.name for tool in response.tools]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm getting a silent error TypeError: not all arguments converted during string formatting
because the 2nd positional arg is treated as a %s placeholder
logger.debug("Connected to server with tools:", [tool.name for tool in response.tools]) | |
logger.debug("Connected to server with tools: %s", [tool.name for tool in response.tools]) |
tools = [*exit_loop_tools, *self.available_tools] | ||
|
||
# Create the streaming request | ||
async with self.client: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the async with
mandatory here? we already open self.client
in __aenter__
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment above: I proposed keeping the with
block here and removing the self.client
from the __aenter__
above.
The reason is that self.client
stores an HTTP response object each time it make a request: so better cleaning this resource after the request/response has been handled (here, in this part of the code), instead of keeping all responses if multiple calls to process_single_turn_with_tools
are made.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed I forgot to remove this part. See my comment here #2986 (comment).
The reason is that self.client stores an HTTP response object each time it make a request: so better cleaning this resource after the request/response has been handled (here, in this part of the code), instead of keeping all responses if multiple calls to process_single_turn_with_tools are made.
Currently, if process_single_turn_with_tools
is called in parallel you might get unexpected behaviors because of a used connection being closed abruptly. In practice, the InferenceClient won't really have many unclosed sessions in parallel. They are already well garbage collected except in the case of a non-awaited streaming response - which shouldn' occur.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @evalstate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, feel free to merge and release whenever you want!
warnings.warn( | ||
"'MCPClient' is experimental and might be subject to breaking changes in the future without prior notice.", | ||
UserWarning, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
personally think we could omit this, but no strong opinion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved it the docstring
Pinging @albertvillanova for a quick review too if you have time! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested with plaintext, image generator and audio generator (returning as EmbeddedResource). LGTM.
I would love to have a look at it this weekend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for this addition! It is great that we finally have a Hugging Face MCP client: this will definitely make experimentation and integration much smoother.
I know this has already been discussed and there's no strong consensus, but I'd still like to share my opinion here for the record: as we've talked about privately, I feel that this is more than a simple MCP client: in my humble view, it already constitutes a single-step agent due to the use of an LLM. A multi-step agent could then be built by looping over this single step. That said, I realize this is mostly a matter of terminology in a space that's evolving really quickly, so it's not a big deal at all, just a remark worth surfacing.
On a more concrete note, I'd love to suggest that we consider supporting use cases that don't involve an LLM. For example, I'm thinking of scenarios where a user wants to build a code agent rather than a tool-calling agent, similar to what we do in smolagents
, where tools are passed as Python functions and the model generates code that uses them.
With this in mind, do you think it would make sense to also provide a sync version of the client? That could make it much easier to support tools exposed as synchronous Python functions and to integrate into existing sync workflows.
Curious to hear your thoughts. And thanks again for pushing this forward!
for tool in response.tools: | ||
if tool.name in self.sessions: | ||
logger.warning(f"Tool '{tool.name}' already defined by another server. Skipping.") | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should eventually consider supporting tools with the same name coming from different servers. One possible approach could be to prepend the server name to the tool name to ensure uniqueness.
This could be explored in a future PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes agree 👍 (been mentioned above but for now post-poned for later PR as you said)
|
||
async def __aenter__(self): | ||
"""Enter the context manager""" | ||
await self.client.__aenter__() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the point of entering the client context manager here?
It is only used within the process_single_turn_with_tools
method, and the context manager is properly entered and exit there using a with
block?
await self.client.__aenter__() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I initially intended to delete the async with self.client:
below but forgot about it. The issue with including async with self.client:
within process_single_turn_with_tools
is that concurrent usage of process_single_turn_with_tools
may result in unexpected behaviors if one instance terminates the sessions for all others. By relocating the logic to __aenter__
, the responsibility shifts to the end user, who can handle the client lifecycle themselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated in b273cba
tools = [*exit_loop_tools, *self.available_tools] | ||
|
||
# Create the streaming request | ||
async with self.client: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment above: I proposed keeping the with
block here and removing the self.client
from the __aenter__
above.
The reason is that self.client
stores an HTTP response object each time it make a request: so better cleaning this resource after the request/response has been handled (here, in this part of the code), instead of keeping all responses if multiple calls to process_single_turn_with_tools
are made.
I disagree: the official example of MCP Client does have a LLM: https://modelcontextprotocol.io/quickstart/client Without a LLM, a bare MCP client would just be a standard JSON-over-RPC call ; IMO MCP is LLM-centric.
For me Agent === multi-step, otherwise it's just LLM with tool calling. |
I addressed all the comments above and fixed the resources lifecycle (user decides when to open/close connections). Re: should the MCP client include an LLM, I do think that yes and I would argue that this is what an HF-based MCP client is for (i.e. the goal is to have minimal code to glue MCP servers to Inference Providers). Regarding exposing the MCP tools as python functions for a code agent, that can be explored separately I think (either in a separate class or with this class but with a separate method). I'm still unfamiliar with code agents and what they would expect from MCP Client. TL;DR: are we finally ready to merge this? 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like self-approvals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥 🔥 🔥
EDIT from @Wauplin :
I took over the work from @julien-c to integrate it in
huggingface_hub
more officially:huggingface_hub.MCPClient
. Class is flagged as "experimental" meaning we can introduce breaking changes without prior notice.async with MCPClient(...) as client:
) => takes care of cleaning-upadd_mcp_server
as close to JS client as possible. Still todo: handle SSE and Streaming servers (in later PR)process_query
toprocess_single_turn_with_tools
=> now takes a list of messages as inputprocess_single_turn_with_tools
an async generator, keeping logic close to JS SDKInferenceClient.chat_completion
accept (messages: List[Union[Dict, ChatCompletionInputMessage]],
) (in practice already the case, just an additional type hint now)That's about it. See comment below for example + result. #2986 (comment)
End of @Wauplin addition.
Required reading
https://modelcontextprotocol.io/quickstart/client
TL;DR: MCP is a standard API to expose sets of Tools that can be hooked to LLMs
Summary of how to use this
EDIT: see #2986 (comment)
Open questions
(Async)InferenceClient
directly? Or on a distinct class, like here?Where to find the MCP Server used here as an example
Note that you can replace it with any MCP Server, from this doc for instance: https://modelcontextprotocol.io/examples
https://gist.github.com/julien-c/0500ba922e1b38f2dc30447fb81f7dc6
EDIT: and https://gist.github.com/Wauplin/b6e5f1a39db843eedfa00738e4998589
Script output
Generation from LLM with tools
3D Model Generation from Text
Here are some of the best apps that can generate 3D models from text:
Shap-E
Hunyuan3D-1
LGM
3D-Adapter
Fictiverse-Voxel_XL_Lora
Best Paper on Transformers
One of the most influential and highly cited papers on transformers is:
If you are looking for more recent advancements, here are a few other notable papers:
Title: "RoFormer: Enhanced Transformer with Rotary Position Embedding"
Link: huggingface.co/papers/2104.09864
Description: This paper introduces RoFormer, which improves the Transformer by using rotary position embeddings.
Title: "Performer: Generalized Attention with Gaussian Kernels for Sequence Modeling"
Link: huggingface.co/papers/2009.14794
Description: This paper presents Performer, a more efficient and scalable version of the Transformer.
Title: "Longformer: The Long-Document Transformer"
Link: huggingface.co/papers/2004.05150
Description: This paper introduces Longformer, which extends the Transformer to handle very long documents.
These resources should provide you with a solid foundation for both generating 3D models from text and understanding the latest advancements in transformer models.