feat(route/thinkingmachines): add news route for Thinking Machines Lab by w3nhao · Pull Request #21609 · DIYgod/RSSHub

w3nhao · 2026-04-03T00:27:16Z

Involved Issue / 该 PR 相关 Issue

N/A - New route

Example for the Proposed Route(s) / 路由地址示例

/thinkingmachines/news

New RSS Route Checklist / 新 RSS 路由检查表

New Route / 新的路由
- Follows Script Standard / 跟随路由规范
Anti-bot or rate limit / 反爬/频率限制
- If yes, do your code reflect this sign? / 如果有, 是否有对应的措施? (No anti-bot detected)
Date and time / 日期和时间
- Parsed / 可以解析 (using parseDate with MMM D, YYYY format)
- Correct time zone / 时区正确
New package added / 添加了新的包
Puppeteer

Note / 说明

Add RSS feed for Thinking Machines Lab news page. Founded by Mira Murati (ex-OpenAI CTO), the lab publishes news about their AI research, products (Tinker), and partnerships.

The route scrapes the news listing page using cheerio, extracting article titles, dates, and links from <time class="desktop-time"> and <div class="post-title"> elements. Full article content is fetched and cached for each entry.

Add route for Thinking Machines Lab (thinkingmachines.ai) news page. Founded by Mira Murati (ex-OpenAI CTO), the lab publishes news about their AI research and products. Closes #0

lib/routes/thinkingmachines/news.ts

github-actions · 2026-04-03T00:33:45Z

Successfully generated as following:

http://localhost:1200/thinkingmachines/news - Success ✔️

<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:atom="http://www.w3.org/2005/Atom" version="2.0">
  <channel>
    <title>Thinking Machines Lab - News</title>
    <link>https://thinkingmachines.ai/news/</link>
    <atom:link href="http://localhost:1200/thinkingmachines/news" rel="self" type="application/rss+xml"></atom:link>
    <description>Thinking Machines Lab - News - Powered by RSSHub</description>
    <generator>RSSHub</generator>
    <webMaster>contact@rsshub.app (RSSHub)</webMaster>
    <language>en</language>
    <lastBuildDate>Fri, 03 Apr 2026 00:33:44 GMT</lastBuildDate>
    <ttl>5</ttl>
    <item>
      <title>Training LLMs to Predict World Events (Guest Post with Mantic)</title>
      <description>&lt;div class=&quot;post-heading news-post&quot;&gt;
        &lt;h1 class=&quot;post-title&quot;&gt;Training LLMs to Predict World Events (Guest Post with Mantic)&lt;/h1&gt;
        &lt;div class=&quot;publish-metadata&quot;&gt;
        &lt;span class=&quot;author&quot;&gt;
        &lt;a href=&quot;https://enjeeneer.io/&quot; rel=&quot;me noopener&quot; target=&quot;_blank&quot;&gt;Scott Jeen&lt;/a&gt; and &lt;a href=&quot;https://www.linkedin.com/in/matthew-aitchison-16aa8799/&quot; rel=&quot;me noopener&quot; target=&quot;_blank&quot;&gt;Matthew Aitchison&lt;/a&gt; in collaboration with others at &lt;a href=&quot;https://www.mantic.com/&quot; rel=&quot;me noopener&quot; target=&quot;_blank&quot;&gt;Mantic&lt;/a&gt;
        &lt;/span&gt;
        &lt;span&gt;
        Mar 19, 2026
        &lt;/span&gt;
        &lt;/div&gt;
        &lt;/div&gt;
        &lt;div class=&quot;post-cover&quot;&gt;
        &lt;img data-zoomable=&quot;&quot; src=&quot;https://thinkingmachines.ai/news/training-llms-to-predict-world-events/svgs/cover.svg&quot; style=&quot;display: block; width: 100%; max-width: 800px; height: auto; margin: 0 auto;&quot; loading=&quot;lazy&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;
        &lt;/div&gt;
        &lt;article class=&quot;content &quot;&gt;
        &lt;p&gt;&lt;em&gt;&lt;a href=&quot;https://www.mantic.com/&quot;&gt;Mantic&lt;/a&gt; have been using Tinker since it launched. This guest post is a technical deep dive on what they have built so far.&lt;/em&gt;&lt;/p&gt;
        &lt;p&gt;The top AI forecasting systems are approaching superforecaster-level accuracy on geopolitics and current affairs.&lt;label for=&quot;sn1&quot; class=&quot;sidenote-number&quot;&gt;&lt;/label&gt;&lt;input type=&quot;checkbox&quot; id=&quot;sn1&quot; class=&quot;margin-toggle&quot;&gt;&lt;span class=&quot;sidenote&quot;&gt;&lt;a href=&quot;https://news.polymarket.com/p/its-over&quot;&gt;It’s over&lt;/a&gt; (Polymarket, 2026)&lt;/span&gt; This is exciting because scalable, automated forecasting could significantly improve the quality of decision-making across the economy and in government.&lt;/p&gt;
        &lt;p&gt;To date, the most successful recipe in forecasting tournaments has been to combine an off-the-shelf LLM (like Gemini 3 or GPT-5) with forecasting-specific context-gathering. These models, to our knowledge, have not been explicitly trained for forecasting. Can we improve the recipe by replacing them with models fine-tuned specifically for forecasting?&lt;/p&gt;
        &lt;p&gt;We target “judgmental forecasting”: prediction problems that require human-like research and reasoning. Judgmental forecasting is needed for domains like geopolitics, politics, technology, business, and economic policy, where there often isn’t enough data for a standard statistical approach like time-series extrapolation. It was popularized by the book &lt;a href=&quot;https://books.google.com/books?hl=en&amp;amp;lr=&amp;amp;id=_lMPDAAAQBAJ&amp;amp;oi=fnd&amp;amp;pg=PA1&amp;amp;ots=5avwaBT55R&amp;amp;sig=B2i7FYispHz5PtB6KVu2Ndt58cE&quot;&gt;Superforecasting&lt;/a&gt;, and now prediction markets like Polymarket and Kalshi.&lt;/p&gt;
        &lt;p&gt;In this post, we show it’s possible to significantly improve the forecasting performance of gpt-oss-120b using reinforcement learning. With Tinker, we fine-tune a model on around 10,000 binary questions of the form &lt;em&gt;“Will [event] occur before [date]?”&lt;/em&gt;. We reward the model for putting greater probability on the correct real-world outcome.&lt;/p&gt;
        &lt;p&gt;In a head-to-head contest, the fine-tuned model achieves marginally superior performance to the frontier LLMs (see Figure 1), despite much lower initial performance. We find that providing forecast-specific context increases the gains from fine-tuning.&lt;/p&gt;
        &lt;p&gt;In the optimal &lt;em&gt;ensemble&lt;/em&gt; of different models (which outperforms any single model), Grok 4 and our fine-tuned model are the most important contributors. The fine-tuned model learns a forecasting policy that is as accurate as the frontier LLMs, yet decorrelated from them.&lt;/p&gt;
        &lt;p&gt;Together, the results demonstrate that on-task training can extend the state-of-the-art in AI forecasting.&lt;/p&gt;
        &lt;figure style=&quot;margin-top: 2rem; margin-bottom: 2.3rem;&quot;&gt;
        &lt;img data-zoomable=&quot;&quot; src=&quot;https://thinkingmachines.ai/news/training-llms-to-predict-world-events/svgs/figure_1.svg&quot; style=&quot;display: block; width: 100%; max-width: 800px; height: auto; margin: 0 auto;&quot; loading=&quot;lazy&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;
        &lt;figcaption&gt;RL fine-tuning makes gpt-oss-120b competitive with the frontier LLMs on questions from the Metaculus AI Benchmark Q2 2025. Naively predicting 50% on every question would get a score of 0, and perfect foresight would get a score of 100, per the construction of the Metaculus “baseline score”. Naively predicting 18.8% on every question (the rate at which the equivalent questions resolved “yes” in the previous tournament, Q1 2025) yields a score of 22.3, which we use to truncate the Y-axis. Fine-tuning improves gpt-oss-120b‘s score from 38.6 to 45.8, on par with the best general models.&lt;/figcaption&gt;
        &lt;/figure&gt;
        &lt;h2 id=&quot;the-best-existing-recipe-uses-off-the-shelf-llms&quot;&gt;The best existing recipe uses off-the-shelf LLMs&lt;/h2&gt;
        &lt;p&gt;The past two years have seen considerable progress in AI judgmental forecasting capabilities. In the Metaculus Cup, a major tournament for amateur and professional forecasters, the best AI systems now rival the top humans (Figure 2).&lt;/p&gt;
        &lt;figure style=&quot;margin-top: 2rem; margin-bottom: 2.3rem;&quot;&gt;
        &lt;img data-zoomable=&quot;&quot; src=&quot;https://thinkingmachines.ai/news/training-llms-to-predict-world-events/svgs/figure_2.svg&quot; style=&quot;display: block; width: 100%; max-width: 800px; height: auto; margin: 0 auto;&quot; loading=&quot;lazy&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;
        &lt;figcaption&gt;Human and AI scores in the Metaculus Cup, a premier forecasting competition. Scores from the top 5 AI forecasters have been steadily improving since they first entered in the Summer of 2024. Mantic first entered in the Summer of 2025 and then in Fall 2025 beat the community prediction and the majority of professional forecasters. These results were without fine-tuning.&lt;/figcaption&gt;
        &lt;/figure&gt;
        &lt;p&gt;The trend has been driven by more capable off-the-shelf LLMs, and accelerated by more sophisticated forecasting architectures. Our architecture – which has performed well in recent Metaculus tournaments – consists of two standard phases: (1) a research phase, and (2) a prediction phase (Figure 3).&lt;label for=&quot;sn3&quot; class=&quot;sidenote-number&quot;&gt;&lt;/label&gt;&lt;input type=&quot;checkbox&quot; id=&quot;sn3&quot; class=&quot;margin-toggle&quot;&gt;&lt;span class=&quot;sidenote&quot;&gt;This two-phase process appears in early work on AI forecasting. See: &lt;a href=&quot;https://proceedings.neurips.cc/paper_files/paper/2024/hash/5a5acfd0876c940d81619c1dc60e7748-Abstract-Conference.html&quot;&gt;Approaching human-level forecasting with language models&lt;/a&gt; (Halawi et al., 2024); &lt;a href=&quot;https://arxiv.org/abs/2206.15474&quot;&gt;Forecasting Future World Events with Neural Networks&lt;/a&gt; (Zou et al, 2022).&lt;/span&gt;&lt;/p&gt;
        &lt;figure style=&quot;margin-top: 2rem; margin-bottom: 2.3rem;&quot;&gt;
        &lt;img data-zoomable=&quot;&quot; src=&quot;https://thinkingmachines.ai/news/training-llms-to-predict-world-events/svgs/figure_3.png&quot; style=&quot;display: block; width: 100%; max-width: 800px; height: auto; margin: 0 auto;&quot; loading=&quot;lazy&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;
        &lt;figcaption&gt;Mantic’s architecture. The research phase takes the forecasting question as input and performs deep research to collect information relevant to the question which goes into the prompt for the prediction LLM. The prediction LLM outputs chain-of-thought reasoning and specifies a probability distribution using specialized tools.&lt;/figcaption&gt;
        &lt;/figure&gt;
        &lt;p&gt;The research phase is conducted by deep research agents that collect the context needed to make a good prediction. For example, for the question “Will the United States attack Venezuela before 2026?”, search agents will find information about military buildup in the Caribbean, statements from President Trump, the health of the Venezuelan economy, and so on. The collected research is summarized into a prompt for the prediction phase.&lt;/p&gt;
        &lt;p&gt;The model’s task at the prediction phase is to use our specialized tools to output a probability distribution. In this post, we consider a canonical type of forecasting question: “Will [event] occur before [date]?”. We instruct the LLM to parameterize a mixture model for when the event will next occur – illustrated in Figure 4. The mixture model defines a cumulative distribution function, and from that we can read off the probability of the event occurring before the date specified in the original question.&lt;/p&gt;
        &lt;figure style=&quot;margin-top: 2rem; margin-bottom: 2.3rem;&quot;&gt;
        &lt;img data-zoomable=&quot;&quot; src=&quot;https://thinkingmachines.ai/news/training-llms-to-predict-world-events/svgs/figure_4.svg&quot; style=&quot;display: block; width: 100%; max-width: 800px; height: auto; margin: 0 auto;&quot; loading=&quot;lazy&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;
        &lt;figcaption&gt;Illustrative mixture model. The LLM selects: the number of components in the mixture, their parameters, and their respective weights. The LLM is prompted to select components capturing different scenarios that could lead to the event occurring. The final prediction is a weighted combination of the components.&lt;/figcaption&gt;
        &lt;/figure&gt;
        &lt;p&gt;In past tournaments, we’ve used off-the-shelf models as the prediction LLM. Existing literature has shown promising results from RL fine-tuning small models using a simple architecture.&lt;label for=&quot;sn4&quot; class=&quot;sidenote-number&quot;&gt;&lt;/label&gt;&lt;input type=&quot;checkbox&quot; id=&quot;sn4&quot; class=&quot;margin-toggle&quot;&gt;&lt;span class=&quot;sidenote&quot;&gt;&lt;a href=&quot;https://scholar.google.com/citations?view_op=view_citation&amp;amp;hl=en&amp;amp;user=oQuvQ0sAAAAJ&amp;amp;citation_for_view=oQuvQ0sAAAAJ:u5HHmVD_uO8C&quot;&gt;Outcome-based Reinforcement Learning to Predict the Future&lt;/a&gt; (Turtel et al, 2025)&lt;/span&gt; Can we improve frontier AI forecasting systems through on-task fine-tuning?&lt;/p&gt;
        &lt;h2 id=&quot;training-details&quot;&gt;Training details&lt;/h2&gt;
        &lt;h3 id=&quot;datasets&quot;&gt;Datasets&lt;/h3&gt;
        &lt;p&gt;We train the prediction LLM on ~10k questions about whether an event will happen by a given date. The questions are from August 2024 to December 2025 and the model’s knowledge cutoff is prior to that, so the resolution is known to us but not to the model. We generated the training set using an LLM pipeline similar to existing work.&lt;label for=&quot;sn5&quot; class=&quot;sidenote-number&quot;&gt;&lt;/label&gt;&lt;input type=&quot;checkbox&quot; id=&quot;sn5&quot; class=&quot;margin-toggle&quot;&gt;&lt;span class=&quot;sidenote&quot;&gt;&lt;a href=&quot;https://arxiv.org/abs/2601.22444&quot;&gt;Automating Forecasting Question Generation and Resolution for AI Evaluation&lt;/a&gt; (Bosse et al, 2026); &lt;a href=&quot;https://arxiv.org/abs/2601.06336&quot;&gt;Future-as-Label: Scalable Supervision from Real-World Outcomes&lt;/a&gt; (Turtel et al, 2026)&lt;/span&gt; Before training, we run the research phase for each question and store static prompts for the prediction LLM.&lt;/p&gt;
        &lt;p&gt;We test on unseen questions from the Q2 2025 Metaculus AI Benchmark Tournament.&lt;label for=&quot;sn6&quot; class=&quot;sidenote-number&quot;&gt;&lt;/label&gt;&lt;input type=&quot;checkbox&quot; id=&quot;sn6&quot; class=&quot;margin-toggle&quot;&gt;&lt;span class=&quot;sidenote&quot;&gt; The Fall 2025 iteration would have been a more obvious choice (for being more recent) but contains lower quality questions, indicated by less performance differentiated between strong and weak forecasters.&lt;/span&gt; We compare models whose knowledge cutoff is before this tournament’s start date. The full list of questions can be accessed on &lt;a href=&quot;https://gist.github.com/enjeeneer/86e24a52e6041a3d78e333bcab16984d&quot;&gt;Github&lt;/a&gt;.&lt;/p&gt;
        &lt;figure style=&quot;margin-top: 2rem; margin-bottom: 2.3rem;&quot;&gt;
        &lt;img data-zoomable=&quot;&quot; src=&quot;https://thinkingmachines.ai/news/training-llms-to-predict-world-events/svgs/figure_5.png&quot; style=&quot;display: block; width: 100%; max-width: 800px; height: auto; margin: 0 auto;&quot; loading=&quot;lazy&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;
        &lt;figcaption&gt;Three example binary event questions from the Metaculus Q2 2025 AI Benchmark.&lt;/figcaption&gt;
        &lt;/figure&gt;
        &lt;p&gt;We evaluate using the baseline score, following the Metaculus platform. This is log scoring, i.e. ln(probability assigned to true outcome), rescaled such that 100 is the maximum possible score and 0 is the score for a uniform prediction (in our setting, 50%).&lt;/p&gt;
        &lt;h3 id=&quot;implementation&quot;&gt;Implementation&lt;/h3&gt;
        &lt;p&gt;We run the experiments on &lt;a href=&quot;https://thinkingmachines.ai/tinker/&quot;&gt;Tinker&lt;/a&gt;. Of the models available through the API, we choose to train &lt;a href=&quot;https://huggingface.co/openai/gpt-oss-120b&quot;&gt;gpt-oss-120b&lt;/a&gt; because of its strong initial performance — second only to &lt;a href=&quot;https://huggingface.co/moonshotai/Kimi-K2.5&quot;&gt;Kimi K2.5&lt;/a&gt; — while being cheaper and faster.&lt;/p&gt;
        &lt;p&gt;We use a standard policy gradient algorithm with &lt;a href=&quot;https://arxiv.org/abs/2402.03300&quot;&gt;GRPO-style advantage normalisation&lt;/a&gt; and &lt;a href=&quot;https://arxiv.org/abs/2503.20783&quot;&gt;no division by the standard deviation&lt;/a&gt;. For rewards we use the Brier score which is &lt;a href=&quot;https://sites.stat.washington.edu/raftery/Research/PDF/Gneiting2007jasa.pdf&quot;&gt;strictly proper&lt;/a&gt;. We found that the Brier score leads to more stable training than the log score, even though the log score is also strictly proper. This could be because the Brier score is bounded in [0, 1] and so produces lower-variance policy gradient estimates.&lt;/p&gt;
        &lt;p&gt;Open-source RL packages often use &lt;a href=&quot;https://arxiv.org/abs/2309.06180&quot;&gt;vLLM&lt;/a&gt; for the sampling backend and &lt;a href=&quot;https://arxiv.org/abs/2304.11277&quot;&gt;FSDP&lt;/a&gt; for the training backend. These can disagree on token probabilities produced by identical policies, which biases policy gradient estimates and destabilises training. We found these discrepancies to be lower on Tinker’s integrated infrastructure, but further mitigate them with &lt;a href=&quot;https://fengyao.notion.site/off-policy-rl&quot;&gt;an importance sampling correction on the advantages&lt;/a&gt;.&lt;/p&gt;
        &lt;p&gt;The Brier score reward function is strictly monotonic in the predicted probability for a fixed outcome, so different rollouts almost always produce different rewards. This makes within-group reward ties extremely unlikely and lets us train with a relatively small group size (8) without needing to break ties or induce variance. We use a batch size of 64, as we find that &lt;a href=&quot;https://thinkingmachines.ai/blog/lora/&quot;&gt;larger batch sizes tend to destabilise training&lt;/a&gt;.&lt;/p&gt;
        &lt;h2 id=&quot;results&quot;&gt;Results&lt;/h2&gt;
        &lt;h3 id=&quot;fine-tuning-elevates-gpt-oss-120b-to-frontier-llm-performance&quot;&gt;Fine-tuning elevates gpt-oss-120b to frontier LLM performance&lt;/h3&gt;
        &lt;p&gt;The model’s test set score improves through training (Figure 6), moving from an initial score of 38.6 mean baseline points per question (below all frontier models) to a final score of 45.8 mean baseline points per question (marginally above). This demonstrates that on-task forecasting training can provide a large performance uplift.&lt;/p&gt;
        &lt;figure style=&quot;margin-top: 2rem; margin-bottom: 2.3rem;&quot;&gt;
        &lt;img data-zoomable=&quot;&quot; src=&quot;https://thinkingmachines.ai/news/training-llms-to-predict-world-events/svgs/figure_6.svg&quot; style=&quot;display: block; width: 100%; max-width: 800px; height: auto; margin: 0 auto;&quot; loading=&quot;lazy&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;
        &lt;figcaption&gt;Test set baseline score of gpt-oss-120b with and without Mantic research and tools. In the model-only setup, test set performance improves, but never reaches the initial score of gpt-oss-120b with Mantic research and tools. With Mantic research and tools, gpt-oss-120b climbs 7 points through training and marginally exceeds the performance of Gemini 3 Pro. Training continued for further steps but performance no longer improved.&lt;/figcaption&gt;
        &lt;/figure&gt;
        &lt;p&gt;The LLM fine-tuned without the benefit of our pre-generated research phase, and without the tools to construct mixture models, gains only 3 points from training instead of 7. This suggests these prerequisites positively influence the optimization dynamics, in addition to improving initialization.&lt;/p&gt;
        &lt;h3 id=&quot;the-fine-tuned-model-is-an-important-member-of-the-optimal-ensemble&quot;&gt;The fine-tuned model is an important member of the optimal ensemble&lt;/h3&gt;
        &lt;p&gt;In human forecasting, there is a well-known “wisdom of the crowd” effect: aggregate forecasts from multiple people often outperforms any one individual. This effect, in part, explains the &lt;a href=&quot;https://polymarket.com/accuracy&quot;&gt;impressive accuracy of prediction markets&lt;/a&gt;. Can we get the same benefit from ensembling the predictions of different LLMs?&lt;label for=&quot;sn7&quot; class=&quot;sidenote-number&quot;&gt;&lt;/label&gt;&lt;input type=&quot;checkbox&quot; id=&quot;sn1&quot; class=&quot;margin-toggle&quot;&gt;&lt;span class=&quot;sidenote&quot;&gt;See also &lt;a href=&quot;https://arxiv.org/abs/2402.19379&quot;&gt;Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Rival Human Crowd Accuracy&lt;/a&gt; (Schoenegger et al, 2024)&lt;/span&gt;&lt;/p&gt;
        &lt;p&gt;To contribute to an ensemble, an LLM must be sufficiently accurate on its own and also decorrelated from other LLMs in the ensemble. Predictions from most frontier LLMs, while accurate, contribute little diversity to the top-performing model (in our case, Gemini 3 Pro) – Figure 7. Among the frontier LLMs, Grok 4 is the exception: its predictions score well whilst correlating less with other frontier LLMs.&lt;/p&gt;
        &lt;figure style=&quot;margin-top: 2rem; margin-bottom: 2.3rem;&quot;&gt;
        &lt;img data-zoomable=&quot;&quot; src=&quot;https://thinkingmachines.ai/news/training-llms-to-predict-world-events/svgs/figure_7.svg&quot; style=&quot;display: block; width: 100%; max-width: 800px; height: auto; margin: 0 auto;&quot; loading=&quot;lazy&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;
        &lt;figcaption&gt;Mean baseline score on binary event questions from the Metaculus Q2 AIB Benchmark with respect to their Jensson Shannon divergence from Gemini 3 Pro for a suite of closed source and open source LLMs. Marker colour indicates each model’s weight in the optimal 5-sample ensemble. The optimal ensemble consists of fine-tuned gpt-oss-120b (40%), gemini 3 pro (20%), gpt-5 (20%) and grok 4 (20%).&lt;/figcaption&gt;
        &lt;/figure&gt;
        &lt;p&gt;The optimal ensemble, on a budget of 5 samples, is 2 samples from our fine-tuned gpt-oss-120b plus 1 sample from each of Gemini 3 Pro, Grok 4, and GPT-5. We can test each model’s contribution by removing it, recomputing the optimal ensemble, and seeing how much the score degrades. We find that Grok 4 is the least replaceable, with fine-tuned gpt-oss-120b in second place. Other models can be replaced with little to no performance degradation (Figure 8).&lt;/p&gt;
        &lt;figure style=&quot;margin-top: 2rem; margin-bottom: 2.3rem;&quot;&gt;
        &lt;img data-zoomable=&quot;&quot; src=&quot;https://thinkingmachines.ai/news/training-llms-to-predict-world-events/svgs/figure_8.svg&quot; style=&quot;display: block; width: 100%; max-width: 800px; height: auto; margin: 0 auto;&quot; loading=&quot;lazy&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;
        &lt;figcaption&gt;When selecting an ensemble of frontier and open-source models, Grok 4 and fine-tuned gpt-oss-120b are the least replaceable. Model replaceability is defined as the reduction in score incurred when removing a model from the optimal ensemble. By definition, if a model is not included in the optimal ensemble,there is no cost to removing it.&lt;/figcaption&gt;
        &lt;/figure&gt;
        &lt;p&gt;GPT-5 and Gemini 3 Pro make similar predictions, and thus don’t benefit much from ensembling with each other. Both models improve from mixing their predictions with either Grok 4 or the fine-tuned gpt-oss-120b, and Grok 4 also benefits most from mixing with the fine-tuned model. In either optimal 3-way ensemble from Figure 9, the fine-tuned model gets about half of the total weight.&lt;/p&gt;
        &lt;figure style=&quot;margin-top: 2rem; margin-bottom: 2.3rem;&quot;&gt;
        &lt;img data-zoomable=&quot;&quot; src=&quot;https://thinkingmachines.ai/news/training-llms-to-predict-world-events/svgs/figure_9.png&quot; style=&quot;display: block; width: 100%; max-width: 800px; height: auto; margin: 0 auto;&quot; loading=&quot;lazy&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;
        &lt;figcaption&gt;Three-way ensembles. Left: The optimal three-way ensemble between the fine-tuned gpt-oss-120b, Gemini 3 Pro and GPT-5 is weighted 56%, 26% and 18% respectively, depicted by the black star. Right: The optimal three-way ensemble between the fine-tuned gpt-oss-120b, Gemini 3 Pro and Grok 4 is weighted 48, 26% and 26% respectively, depicted by the white star.&lt;/figcaption&gt;
        &lt;/figure&gt;
        &lt;h2 id=&quot;conclusions-and-next-steps&quot;&gt;Conclusions and next steps&lt;/h2&gt;
        &lt;p&gt;We have shown that we can elevate the forecasting performance of gpt-oss-120b to match frontier LLMs with RL fine-tuning. This work can be extended in many ways, some of which we have already begun exploring:&lt;/p&gt;
        &lt;ol&gt;
        &lt;li&gt;&lt;strong&gt;Training larger models.&lt;/strong&gt; Tinker enables training larger models with higher initial performance than gpt-oss, specifically Kimi K2.5.&lt;/li&gt;
        &lt;li&gt;&lt;strong&gt;Training on all question formats.&lt;/strong&gt; We are already training models on numerical questions such as economic indicators and multiple choice questions such as election results.&lt;/li&gt;
        &lt;li&gt;&lt;strong&gt;Improved question sets.&lt;/strong&gt; As the models become stronger forecasters, we need more challenging forecasting questions.&lt;/li&gt;
        &lt;li&gt;&lt;strong&gt;Information retrieval inside the loop.&lt;/strong&gt; We could give the prediction LLM tools for information retrieval and include this in the training loop.&lt;/li&gt;
        &lt;/ol&gt;
        &lt;h2 id=&quot;citation&quot;&gt;Citation&lt;/h2&gt;
        &lt;p&gt;Please cite this work as:&lt;/p&gt;
        &lt;pre tabindex=&quot;0&quot;&gt;&lt;code&gt;Jeen, Scott; Aitchison, Matthew; and Mantic, &quot;Training LLMs to Predict World Events&quot;,
        Thinking Machines Lab: News, Mar 2026.
        &lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Or use the BibTeX citation:&lt;/p&gt;
        &lt;pre tabindex=&quot;0&quot;&gt;&lt;code&gt;@article{scott2026forecasting,
        author = {Scott Jeen, Matthew Aitchison, and Mantic},
        title = {Training LLMs to Predict World Events},
        journal = {Thinking Machines Lab: News},
        year = {2026},
        note = {https://thinkingmachines.ai/news/training-llms-to-predict-world-events/}
        }
        &lt;/code&gt;&lt;/pre&gt;
        &lt;/article&gt;
        &lt;div class=&quot;paginator&quot;&gt;
        &lt;a id=&quot;post-prev-link&quot; class=&quot;link&quot; href=&quot;https://thinkingmachines.ai/news/nvidia-partnership/&quot; title=&quot;Thinking Machines Lab and NVIDIA Announce Long-Term Gigawatt-Scale Strategic Partnership&quot;&gt;prev&lt;/a&gt;
        &lt;a href=&quot;https://thinkingmachines.ai/news/training-llms-to-predict-world-events/#&quot; id=&quot;back-to-top&quot; class=&quot;link&quot; title=&quot;Go to top&quot;&gt;back to top&lt;/a&gt;
        &lt;a id=&quot;post-next-link&quot;&gt;&lt;/a&gt;
        &lt;/div&gt;</description>
      <link>https://thinkingmachines.ai/news/training-llms-to-predict-world-events/</link>
      <guid isPermaLink="false">https://thinkingmachines.ai/news/training-llms-to-predict-world-events/</guid>
      <pubDate>Wed, 18 Mar 2026 16:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Thinking Machines Lab and NVIDIA Announce Long-Term Gigawatt-Scale Strategic Partnership</title>
      <description>&lt;div class=&quot;post-heading news-post&quot;&gt;
        &lt;h1 class=&quot;post-title&quot;&gt;Thinking Machines Lab and NVIDIA Announce Long-Term Gigawatt-Scale Strategic Partnership&lt;/h1&gt;
        &lt;div class=&quot;publish-metadata date-only&quot;&gt;
        &lt;span&gt;
        Mar 10, 2026
        &lt;/span&gt;
        &lt;/div&gt;
        &lt;/div&gt;
        &lt;div class=&quot;post-cover compact&quot;&gt;
        &lt;img data-zoomable=&quot;&quot; src=&quot;https://thinkingmachines.ai/news/nvidia-partnership/images/jensen-huang-mira-murati-nvidia-debs-3127-2.jpg&quot; alt=&quot;Cover image for Thinking Machines Lab and NVIDIA Announce Long-Term Gigawatt-Scale Strategic Partnership&quot; class=&quot;cover-image&quot; fetchpriority=&quot;high&quot; decoding=&quot;async&quot; loading=&quot;eager&quot; referrerpolicy=&quot;no-referrer&quot;&gt;
        &lt;p class=&quot;post-cover-caption&quot;&gt;Jensen Huang (NVIDIA) and Mira Murati (Thinking Machines)&lt;/p&gt;
        &lt;/div&gt;
        &lt;article class=&quot;content &quot;&gt;
        &lt;p&gt;Thinking Machines Lab and NVIDIA announced today a multi-year strategic partnership to deploy at least one gigawatt of next-generation NVIDIA Vera Rubin systems to support Thinking Machines’ frontier model training and platforms delivering customizable AI at scale. Deployment on NVIDIA’s Vera Rubin platform is targeted for early next year.&lt;/p&gt;
        &lt;p&gt;The partnership also includes an effort to design training and serving systems for NVIDIA architectures and broaden access to frontier AI and open models for enterprises, research institutions, and the scientific community.&lt;/p&gt;
        &lt;p&gt;NVIDIA has also made a significant investment in Thinking Machines Lab to support the company’s long-term growth.&lt;/p&gt;
        &lt;p&gt;“AI is the most powerful knowledge discovery instrument in human history,” said Jensen Huang, founder and CEO of NVIDIA. “Thinking Machines has brought together a world-class team to advance the frontier of AI. We are thrilled to partner with Thinking Machines to realize their exciting vision for the future of AI.”&lt;/p&gt;
        &lt;p&gt;“NVIDIA’s technology is the foundation on which the entire field is built,” said Mira Murati, cofounder and CEO of Thinking Machines. “This partnership accelerates our capacity to build AI that people can shape and make their own, as it shapes human potential in turn.”&lt;/p&gt;
        &lt;p&gt;Building powerful AI systems that are understandable, customizable, and collaborative demands advances in research, design, and infrastructure at scale. This partnership provides that foundation, with the shared aim of ensuring that the most transformative technology of our time expands human capability.&lt;/p&gt;
        &lt;img data-zoomable=&quot;&quot; src=&quot;https://thinkingmachines.ai/news/nvidia-partnership/cover.svg&quot; style=&quot;display: block; width: 100%; max-width: 800px; height: auto; margin: 0 auto;&quot; loading=&quot;lazy&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;
        &lt;/article&gt;
        &lt;div class=&quot;paginator&quot;&gt;
        &lt;a id=&quot;post-prev-link&quot; class=&quot;link&quot; href=&quot;https://thinkingmachines.ai/news/tinker-general-availability/&quot; title=&quot;Tinker: General Availability and Vision Input&quot;&gt;prev&lt;/a&gt;
        &lt;a href=&quot;https://thinkingmachines.ai/news/nvidia-partnership/#&quot; id=&quot;back-to-top&quot; class=&quot;link&quot; title=&quot;Go to top&quot;&gt;back to top&lt;/a&gt;
        &lt;a id=&quot;post-next-link&quot; class=&quot;link&quot; href=&quot;https://thinkingmachines.ai/news/training-llms-to-predict-world-events/&quot; title=&quot;Training LLMs to Predict World Events (Guest Post with Mantic)&quot;&gt;next&lt;/a&gt;
        &lt;/div&gt;</description>
      <link>https://thinkingmachines.ai/news/nvidia-partnership/</link>
      <guid isPermaLink="false">https://thinkingmachines.ai/news/nvidia-partnership/</guid>
      <pubDate>Mon, 09 Mar 2026 16:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Tinker: General Availability and Vision Input</title>
      <description>&lt;div class=&quot;post-heading news-post&quot;&gt;
        &lt;h1 class=&quot;post-title&quot;&gt;Tinker: General Availability and Vision Input&lt;/h1&gt;
        &lt;div class=&quot;publish-metadata date-only&quot;&gt;
        &lt;span&gt;
        Dec 12, 2025
        &lt;/span&gt;
        &lt;/div&gt;
        &lt;/div&gt;
        &lt;article class=&quot;content &quot;&gt;
        &lt;p&gt;Today we are announcing four updates to Tinker:&lt;/p&gt;
        &lt;ul&gt;
        &lt;li&gt;No more waitlist&lt;/li&gt;
        &lt;li&gt;New reasoning model: Kimi K2 Thinking&lt;/li&gt;
        &lt;li&gt;New inference interface that is compatible with the OpenAI API&lt;/li&gt;
        &lt;li&gt;Vision input support with Qwen3-VL&lt;/li&gt;
        &lt;/ul&gt;
        &lt;h2 id=&quot;general-availability&quot;&gt;General availability&lt;/h2&gt;
        &lt;p&gt;The waitlist is over! Everybody can use Tinker now; &lt;a href=&quot;https://auth.thinkingmachines.ai/sign-up&quot;&gt;sign up here&lt;/a&gt; to get started. See the &lt;a href=&quot;https://thinkingmachines.ai/tinker/&quot;&gt;Tinker homepage&lt;/a&gt; for available models and pricing, and check out the &lt;a href=&quot;https://github.com/thinking-machines-lab/tinker-cookbook&quot;&gt;Tinker cookbook&lt;/a&gt; for code examples.&lt;/p&gt;
        &lt;h2 id=&quot;more-reasoning-with-kimi-k2-thinking&quot;&gt;More reasoning with Kimi K2 Thinking&lt;/h2&gt;
        &lt;p&gt;Users can now fine-tune Kimi K2 Thinking on Tinker. With a trillion parameters, Kimi K2 is the largest model in our lineup so far. It is built for long chains of reasoning and tool use.&lt;/p&gt;
        &lt;h2 id=&quot;openai-api-compatible-sampling&quot;&gt;OpenAI API-compatible sampling&lt;/h2&gt;
        &lt;p&gt;Tinker has a standard function for inference:&lt;/p&gt;
        &lt;div class=&quot;highlight&quot;&gt;&lt;pre tabindex=&quot;0&quot; class=&quot;chroma&quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;line&quot;&gt;&lt;span class=&quot;cl&quot;&gt;&lt;span class=&quot;n&quot;&gt;prompt&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;types&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ModelInput&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;from_ints&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tokenizer&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;encode&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;The capital of France is&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,))&lt;/span&gt;
        &lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;line&quot;&gt;&lt;span class=&quot;cl&quot;&gt;&lt;span class=&quot;n&quot;&gt;params&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;types&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SamplingParams&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;max_tokens&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;20&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;temperature&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;0.0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;stop&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;
        &lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;line&quot;&gt;&lt;span class=&quot;cl&quot;&gt;&lt;span class=&quot;n&quot;&gt;future&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sampling_client&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sample&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;prompt&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;prompt&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sampling_params&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;params&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;With this release, we have added OpenAI API-compatible scaffolding for quickly sampling from a model by specifying a path, even while it’s still training. This also means Tinker can now plug-and-play with any OpenAI API-compatible platform. See more information in our &lt;a href=&quot;https://tinker-docs.thinkingmachines.ai/compatible-apis/openai&quot;&gt;Tinker documentation&lt;/a&gt;.&lt;/p&gt;
        &lt;div class=&quot;highlight&quot;&gt;&lt;pre tabindex=&quot;0&quot; class=&quot;chroma&quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;line&quot;&gt;&lt;span class=&quot;cl&quot;&gt;&lt;span class=&quot;n&quot;&gt;response&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;openai_client&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;completions&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;create&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
        &lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;line&quot;&gt;&lt;span class=&quot;cl&quot;&gt; &lt;span class=&quot;n&quot;&gt;model&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;tinker://0034d8c9-0a88-52a9-b2b7-bce7cb1e6fef:train:0/sampler_weights/000080&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;line&quot;&gt;&lt;span class=&quot;cl&quot;&gt; &lt;span class=&quot;n&quot;&gt;prompt&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;The capital of France is&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;line&quot;&gt;&lt;span class=&quot;cl&quot;&gt; &lt;span class=&quot;n&quot;&gt;max_tokens&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;20&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;line&quot;&gt;&lt;span class=&quot;cl&quot;&gt; &lt;span class=&quot;n&quot;&gt;temperature&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;0.0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;line&quot;&gt;&lt;span class=&quot;cl&quot;&gt; &lt;span class=&quot;n&quot;&gt;stop&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;
        &lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;line&quot;&gt;&lt;span class=&quot;cl&quot;&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id=&quot;vision-input-with-qwen3-vl&quot;&gt;Vision input with Qwen3-VL&lt;/h2&gt;
        &lt;p&gt;We’ve added two vision models to Tinker: Qwen3-VL-30B-A3B-Instruct and Qwen3-VL-235B-A22B-Instruct. With these, users can process pictures, screenshots, and diagrams for a variety of applications.&lt;/p&gt;
        &lt;p&gt;To input images, just interleave together an &lt;a href=&quot;https://tinker-docs.thinkingmachines.ai/api-reference/types#imagechunk-objects&quot;&gt;ImageChunk&lt;/a&gt; – consisting of your image, saved as bytes – with text chunks. For example:&lt;/p&gt;
        &lt;div class=&quot;highlight&quot;&gt;&lt;pre tabindex=&quot;0&quot; class=&quot;chroma&quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;line&quot;&gt;&lt;span class=&quot;cl&quot;&gt;&lt;span class=&quot;n&quot;&gt;model_input&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tinker&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ModelInput&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;chunks&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;
        &lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;line&quot;&gt;&lt;span class=&quot;cl&quot;&gt; &lt;span class=&quot;n&quot;&gt;tinker&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;types&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ImageChunk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;image_data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;format&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;png&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
        &lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;line&quot;&gt;&lt;span class=&quot;cl&quot;&gt; &lt;span class=&quot;n&quot;&gt;tinker&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;types&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;EncodedTextChunk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tokens&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tokenizer&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;encode&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;What is this?&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)),&lt;/span&gt;
        &lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;line&quot;&gt;&lt;span class=&quot;cl&quot;&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;
        &lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;These vision inputs can be used in a variety of applications out-of-the-box, including SFT and RL finetuning.&lt;/p&gt;
        &lt;p&gt;To demonstrate vision understanding in action, we are sharing &lt;a href=&quot;https://github.com/thinking-machines-lab/tinker-cookbook/tree/main/tinker_cookbook/recipes/vlm_classifier&quot;&gt;a new cookbook recipe for fine-tuning VLMs as image classifiers&lt;/a&gt;. Qwen3-VL-235B-A22B-Instruct obtains reasonable accuracy even given just one example per class; performance improves with more labeled data.&lt;/p&gt;
        &lt;h2 id=&quot;training-image-classifiers-with-tinker&quot;&gt;Training image classifiers with Tinker&lt;/h2&gt;
        &lt;p&gt;To showcase Tinker’s new vision capabilities, we finetuned &lt;a href=&quot;https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct&quot;&gt;Qwen3-VL-235B-A22B-Instruct&lt;/a&gt; to classify images on four classic datasets:&lt;/p&gt;
        &lt;ul&gt;
        &lt;li&gt;&lt;a href=&quot;https://data.caltech.edu/records/mzrjq-6wc02&quot;&gt;Caltech 101&lt;/a&gt;, a dataset of 101 general object categories.&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://www.kaggle.com/datasets/eduardo4jesus/stanford-cars-dataset&quot;&gt;Stanford Cars&lt;/a&gt;, a dataset of car makes, models, and years.&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://www.robots.ox.ac.uk/~vgg/data/flowers/102/&quot;&gt;Oxford Flowers&lt;/a&gt;, a dataset of flower species.&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://www.robots.ox.ac.uk/~vgg/data/pets/&quot;&gt;Oxford Pets&lt;/a&gt;, a dataset of pet breeds.&lt;/li&gt;
        &lt;/ul&gt;
        &lt;p&gt;Since Qwen3-VL is a language model, we frame classification as text generation: given an image, the model outputs the class name. We compare this approach against a traditional vision baseline of finetuning a vision-only model — DINOv2-base. &lt;a href=&quot;https://arxiv.org/pdf/2304.07193&quot;&gt;DINOv2&lt;/a&gt; is a self-supervised vision transformer that was trained to encode images, and is commonly used as a backbone for pure computer vision tasks. For DINOv2, we add a classification head that predicts a distribution over all N classes. Both models are fine-tuned with LoRA.&lt;/p&gt;
        &lt;p&gt;Labeled image data is scarce for many real-world use cases, so data efficiency is the primary measure we look at. We show the classification accuracy when sweeping across the number of labeled examples per class, starting with just a single one.&lt;/p&gt;
        &lt;figure id=&quot;fig:qwen-v-dino&quot;&gt;
        &lt;img data-zoomable=&quot;&quot; src=&quot;https://thinkingmachines.ai/news/tinker-general-availability/images/vlm-graphs.png&quot; referrerpolicy=&quot;no-referrer&quot;&gt;
        &lt;figcaption&gt;Comparison of fine-tuned Qwen3-VL-235-A22B and DINOv2 performance on simple image classification tasks.&lt;/figcaption&gt;
        &lt;/figure&gt;
        &lt;p&gt;In the limited-data regime, Qwen3-VL-235-A22B outperforms DINOv2. Not only is it a bigger model, but as a VLM, it also comes with language knowledge out-of-the-box (i.e. what a “golden retriever” or “sunflower” is). This general language-and-vision capability of Qwen3-VL makes it readily available for vision tasks beyond classification.&lt;/p&gt;
        &lt;h2 id=&quot;happy-holidays&quot;&gt;Happy Holidays&lt;/h2&gt;
        &lt;p&gt;Tinker exists to enable builders and researchers to train and customize state-of-the-art models. As always, we look forward to seeing what you build with Tinker. Happy holidays!&lt;/p&gt;
        &lt;/article&gt;
        &lt;div class=&quot;paginator&quot;&gt;
        &lt;a id=&quot;post-prev-link&quot; class=&quot;link&quot; href=&quot;https://thinkingmachines.ai/news/call-for-community-projects/&quot; title=&quot;Tinker: Call for Community Projects&quot;&gt;prev&lt;/a&gt;
        &lt;a href=&quot;https://thinkingmachines.ai/news/tinker-general-availability/#&quot; id=&quot;back-to-top&quot; class=&quot;link&quot; title=&quot;Go to top&quot;&gt;back to top&lt;/a&gt;
        &lt;a id=&quot;post-next-link&quot; class=&quot;link&quot; href=&quot;https://thinkingmachines.ai/news/nvidia-partnership/&quot; title=&quot;Thinking Machines Lab and NVIDIA Announce Long-Term Gigawatt-Scale Strategic Partnership&quot;&gt;next&lt;/a&gt;
        &lt;/div&gt;</description>
      <link>https://thinkingmachines.ai/news/tinker-general-availability/</link>
      <guid isPermaLink="false">https://thinkingmachines.ai/news/tinker-general-availability/</guid>
      <pubDate>Thu, 11 Dec 2025 16:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Tinker: Call for Community Projects</title>
      <description>&lt;div class=&quot;post-heading news-post&quot;&gt;
        &lt;h1 class=&quot;post-title&quot;&gt;Tinker: Call for Community Projects&lt;/h1&gt;
        &lt;div class=&quot;publish-metadata date-only&quot;&gt;
        &lt;span&gt;
        Nov 7, 2025
        &lt;/span&gt;
        &lt;/div&gt;
        &lt;/div&gt;
        &lt;article class=&quot;content &quot;&gt;
        &lt;p&gt;We launched &lt;a href=&quot;https://thinkingmachines.ai/tinker/&quot;&gt;Tinker&lt;/a&gt; to enable builders and researchers to train models their own way, whether they’re conducting studies or customizing models for new applications. We plan to publish regular roundups of the coolest projects from the Tinker community, and &lt;strong&gt;we invite you to submit what you’ve been Tinkering on to be featured on our blog&lt;/strong&gt;.&lt;/p&gt;
        &lt;p&gt;Below are some broad suggestions for what we hope to see from the Tinker featured projects, and some specific research directions we would particularly love to see pursued.&lt;/p&gt;
        &lt;h2 id=&quot;guidelines-for-tinker-featured-projects&quot;&gt;Guidelines for Tinker Featured Projects&lt;/h2&gt;
        &lt;p&gt;We’re interested in featuring ML research projects, AI-enabled research in other domains, custom models, and other contributions. Some examples:&lt;/p&gt;
        &lt;ul&gt;
        &lt;li&gt;A reimplementation of a research project or tech report using Tinker, such as papers that compare algorithmic recipes or datasets.&lt;/li&gt;
        &lt;li&gt;Original research in machine learning, such as exploring new approaches to training or optimization or applying novel benchmarks and evaluations.&lt;/li&gt;
        &lt;li&gt;Research in a non-AI field that uses fine-tuned models, such as the work on mathematical theorem provers and chemistry models we &lt;a href=&quot;https://thinkingmachines.ai/news/announcing-tinker/#:~:text=Groups%20at%20Princeton%2C%20Stanford&quot;&gt;highlighted previously&lt;/a&gt;.&lt;/li&gt;
        &lt;li&gt;Product prototypes built with Tinker, demoing a model that does something fresh or delightful.&lt;/li&gt;
        &lt;li&gt;Novel datasets and task environments for training models.&lt;/li&gt;
        &lt;li&gt;High-level libraries built on top of Tinker that enable less experienced practitioners to perform fine-tuning effectively.&lt;/li&gt;
        &lt;li&gt;Infrastructure contributions, such as a clean self-hosted implementation of the Tinker training API.&lt;/li&gt;
        &lt;/ul&gt;
        &lt;p&gt;Your submission should include a write-up and, preferably, an open-source release of your code. We encourage you to focus on rigor and clear evaluation in your write-ups: crisp charts, raw output examples, clear comparisons to alternative approaches or models on relevant benchmarks and metrics. Tinkering is experimenting — we want to feature diligent work and transparent results over novelty or hype.&lt;/p&gt;
        &lt;p&gt;Please send your projects and any related questions to &lt;a href=&quot;https://thinkingmachines.ai/cdn-cgi/l/email-protection#2c58454247495e6c584445424745424b414d4f444542495f024d45135f594e46494f58116a494d58595e4948091e1c7c5e4346494f58091e1c&quot;&gt;&lt;span class=&quot;__cf_email__&quot; data-cfemail=&quot;295d4047424c5b695d4140474240474e44484a4140474c5a074840&quot;&gt;[email&amp;nbsp;protected]&lt;/span&gt;&lt;/a&gt; with “Featured Project” in the subject line.&lt;/p&gt;
        &lt;h2 id=&quot;suggested-research-projects&quot;&gt;Suggested research projects&lt;/h2&gt;
        &lt;p&gt;Here are some research directions that we would personally love to see explored and that Tinker can enable real progress on. We have &lt;a href=&quot;https://github.com/thinking-machines-lab/tinker-project-ideas&quot;&gt;created a repo&lt;/a&gt; with detailed motivation and guidelines for each; we’ll be adding more resources to it over time to help researchers get started. We expect most project ideas to surprise us, but this short list could serve as inspiration.&lt;/p&gt;
        &lt;p&gt;&lt;strong&gt;&lt;a href=&quot;https://github.com/thinking-machines-lab/tinker-project-ideas/blob/main/replicate-cai-with-base-models.md&quot;&gt;Replicating Constitutional AI, starting from the base model.&lt;/a&gt;&lt;/strong&gt; Though RLAIF is widely used, it’s most often bootstrapped from existing instruction-tuned models. This makes it difficult to separate the impact of the constitution from the impact of the data-generating model that interprets it. A study of Constitutional AI with and without instruction-tuned models in the pipeline would shed light on the use of constitutions and RLAIF.&lt;/p&gt;
        &lt;p&gt;&lt;strong&gt;&lt;a href=&quot;https://github.com/thinking-machines-lab/tinker-project-ideas/blob/main/noisy-student.md&quot;&gt;RLVR with Noisy student.&lt;/a&gt;&lt;/strong&gt; Noisy student self-distillation was a popular technique in an earlier era of machine learning for making use of large unlabeled datasets, but it hasn’t been adapted widely to LLMs. One possible adaptation is to start RLVR with a small labeled training set and a large unlabeled one, then have the student apply labels to the latter set after each RL run and iterate.&lt;/p&gt;
        &lt;p&gt;&lt;strong&gt;&lt;a href=&quot;https://github.com/thinking-machines-lab/tinker-project-ideas/blob/main/on-policy-context-distillation.md&quot;&gt;On-Policy Context Distillation.&lt;/a&gt;&lt;/strong&gt; Context distillation trains a student model with empty context on a teacher model with long and detailed context. Prior work used off-policy distillation — fine-tuning on teacher samples. We have found that &lt;a href=&quot;https://thinkingmachines.ai/blog/on-policy-distillation/&quot;&gt;on-policy distillation&lt;/a&gt; is often much more effective; it would be useful to compare the two approaches for context distillation in particular.&lt;/p&gt;
        &lt;p&gt;&lt;strong&gt;&lt;a href=&quot;https://github.com/thinking-machines-lab/tinker-project-ideas/blob/main/memorization-empirical-study.md&quot;&gt;RL memory test.&lt;/a&gt;&lt;/strong&gt; Our &lt;a href=&quot;https://thinkingmachines.ai/blog/lora/&quot;&gt;post on LoRA&lt;/a&gt; presented theoretical arguments on the rate of information acquisition by both SFT and RL. You can set up a toy environment where RL must learn a completely random number sequence, to compare the empirical learning rate under various reward functions to the theoretical estimate.&lt;/p&gt;
        &lt;p&gt;&lt;strong&gt;&lt;a href=&quot;https://github.com/thinking-machines-lab/tinker-project-ideas/blob/main/direct-rl-on-pairwise-judge.md&quot;&gt;Direct RL on pairwise judge.&lt;/a&gt;&lt;/strong&gt; RLHF and RLAIF use datasets of pairwise preferences, which are used to train a reward model, which is then used in RL. As an alternative “direct” approach, we can do RL using a prompted model that does pairwise comparisons, without training the reward model. It would be interesting to do experiments comparing the direct and indirect approaches.&lt;/p&gt;
        &lt;p&gt;&lt;strong&gt;&lt;a href=&quot;https://github.com/thinking-machines-lab/tinker-project-ideas/blob/main/replicate-open-character-training.md&quot;&gt;Replicate Open Character Training.&lt;/a&gt;&lt;/strong&gt; Replicate the recent paper on &lt;a href=&quot;https://arxiv.org/abs/2511.01689&quot;&gt;Open Character Training&lt;/a&gt; using Tinker.&lt;/p&gt;
        &lt;p&gt;&lt;strong&gt;&lt;a href=&quot;https://github.com/thinking-machines-lab/tinker-project-ideas/blob/main/gan-joke-generation.md&quot;&gt;GAN for jokes.&lt;/a&gt;&lt;/strong&gt; In domains such as humor, it is easier to curate a human-vetted set of demonstrations than to train a reliable judge or reward model. Try implementing GAN-style training for a joke evaluator and joke generator that can craft a joke with a requested subject and keywords.&lt;/p&gt;
        &lt;h2 id=&quot;tips-for-high-quality-ml-experiments&quot;&gt;Tips for high-quality ML experiments&lt;/h2&gt;
        &lt;p&gt;In closing, we want to offer a few guidelines for running quality ML studies, the same guidelines we strive to adhere to internally when running experiments and documenting the results.&lt;/p&gt;
        &lt;p&gt;We encourage researchers to apply multiple analyses for examining each result. When creating datasets or environments, we recommend training a range of models and applying different evals. When developing novel methods, we suggest comparing to simpler baseline methods and sweeping hyperparameters that performance is sensitive to, particularly learning rate.&lt;/p&gt;
        &lt;p&gt;We’d love to see your reasoning in the write-up: assumptions you made, how your approach diverges from previous reports, and what motivated each change. We hope to see examples of the raw data and model rollouts, along with the summarized results. Finally, we appreciate crisp and detailed write-ups with clean and well-labeled charts and illustrations of the inner workings of the methods used.&lt;/p&gt;
        &lt;p&gt;We are excited to see what our community creates with Tinker, and hope that our featured projects will inspire your own work.&lt;/p&gt;
        &lt;/article&gt;
        &lt;div class=&quot;paginator&quot;&gt;
        &lt;a id=&quot;post-prev-link&quot; class=&quot;link&quot; href=&quot;https://thinkingmachines.ai/news/tinker-research-and-teaching-grants/&quot; title=&quot;Tinker: Announcing Research and Teaching Grants&quot;&gt;prev&lt;/a&gt;
        &lt;a href=&quot;https://thinkingmachines.ai/news/call-for-community-projects/#&quot; id=&quot;back-to-top&quot; class=&quot;link&quot; title=&quot;Go to top&quot;&gt;back to top&lt;/a&gt;
        &lt;a id=&quot;post-next-link&quot; class=&quot;link&quot; href=&quot;https://thinkingmachines.ai/news/tinker-general-availability/&quot; title=&quot;Tinker: General Availability and Vision Input&quot;&gt;next&lt;/a&gt;
        &lt;/div&gt;</description>
      <link>https://thinkingmachines.ai/news/call-for-community-projects/</link>
      <guid isPermaLink="false">https://thinkingmachines.ai/news/call-for-community-projects/</guid>
      <pubDate>Thu, 06 Nov 2025 16:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Tinker: Announcing Research and Teaching Grants</title>
      <description>&lt;div class=&quot;post-heading news-post&quot;&gt;
        &lt;h1 class=&quot;post-title&quot;&gt;Tinker: Announcing Research and Teaching Grants&lt;/h1&gt;
        &lt;div class=&quot;publish-metadata date-only&quot;&gt;
        &lt;span&gt;
        Oct 29, 2025
        &lt;/span&gt;
        &lt;/div&gt;
        &lt;/div&gt;
        &lt;article class=&quot;content &quot;&gt;
        &lt;p&gt;We launched &lt;a href=&quot;https://thinkingmachines.ai/news/announcing-tinker/&quot;&gt;Tinker&lt;/a&gt; nearly one month ago. Since then, researchers across academia and non-profits have been using Tinker to train custom models and advance their research.&lt;/p&gt;
        &lt;p&gt;Today, we’re launching research and teaching grants for Tinker access. As part of our commitment to open and collaborative science, we want to make it as easy as possible for students and scholars to use Tinker. If your research or teaching involves training open-weight LLMs, we encourage you to apply.&lt;/p&gt;
        &lt;p&gt;We’re offering two types of grants to support your work:&lt;/p&gt;
        &lt;ul&gt;
        &lt;li&gt;
        &lt;p&gt;&lt;strong&gt;Teaching Grants:&lt;/strong&gt; We provide $250 in free credits per student for academic classes using Tinker, whether you’re integrating it into an assignment or enabling students to use Tinker for self-directed projects, this is sized to support your entire class for the duration of the course.&lt;/p&gt;
        &lt;/li&gt;
        &lt;li&gt;
        &lt;p&gt;&lt;strong&gt;Research Grants:&lt;/strong&gt; We provide grants starting at $5,000 to support research projects and open-source software that uses Tinker.&lt;/p&gt;
        &lt;/li&gt;
        &lt;/ul&gt;
        &lt;p&gt;A selection of early grants we have awarded:&lt;/p&gt;
        &lt;ul&gt;
        &lt;li&gt;
        &lt;p&gt;Diyi Yang’s &lt;a href=&quot;https://web.stanford.edu/class/cs329x/&quot;&gt;Stanford class on Human-Centered LLMs&lt;/a&gt; uses Tinker to compare different approaches for training personalized LLMs that capture unique writing styles and align with user habits.&lt;/p&gt;
        &lt;/li&gt;
        &lt;li&gt;
        &lt;p&gt;Aviral Kumar and Katerina Fragkiadaki’s &lt;a href=&quot;https://cmudeeprl.github.io/703website_f25/&quot;&gt;CMU class on Deep RL&lt;/a&gt; will use Tinker to enable class projects to experiment with state-of-the-art methods for training LLM and VLM based policies via RL.&lt;/p&gt;
        &lt;/li&gt;
        &lt;li&gt;
        &lt;p&gt;&lt;a href=&quot;https://chemistry.stanford.edu/people/grant-m-rotskoff&quot;&gt;Grant Rotskoff&lt;/a&gt;’s lab at Stanford is fine-tuning small-molecule chemistry models with Tinker to help solve problems in computational chemistry.&lt;/p&gt;
        &lt;/li&gt;
        &lt;/ul&gt;
        &lt;p&gt;&lt;strong&gt;Instructors&lt;/strong&gt;, please apply for teaching grants &lt;a href=&quot;https://form.typeform.com/to/JgPkuMvB&quot;&gt;&lt;strong&gt;here.&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;
        &lt;p&gt;&lt;strong&gt;Researchers&lt;/strong&gt;, please apply for research grants &lt;a href=&quot;https://form.typeform.com/to/E9wVFZJJ&quot;&gt;&lt;strong&gt;here.&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;
        &lt;p&gt;We’re assessing applications on a rolling basis and will aim to respond within a week of your application.&lt;/p&gt;
        &lt;/article&gt;
        &lt;div class=&quot;paginator&quot;&gt;
        &lt;a id=&quot;post-prev-link&quot; class=&quot;link&quot; href=&quot;https://thinkingmachines.ai/news/announcing-tinker/&quot; title=&quot;Announcing Tinker&quot;&gt;prev&lt;/a&gt;
        &lt;a href=&quot;https://thinkingmachines.ai/news/tinker-research-and-teaching-grants/#&quot; id=&quot;back-to-top&quot; class=&quot;link&quot; title=&quot;Go to top&quot;&gt;back to top&lt;/a&gt;
        &lt;a id=&quot;post-next-link&quot; class=&quot;link&quot; href=&quot;https://thinkingmachines.ai/news/call-for-community-projects/&quot; title=&quot;Tinker: Call for Community Projects&quot;&gt;next&lt;/a&gt;
        &lt;/div&gt;</description>
      <link>https://thinkingmachines.ai/news/tinker-research-and-teaching-grants/</link>
      <guid isPermaLink="false">https://thinkingmachines.ai/news/tinker-research-and-teaching-grants/</guid>
      <pubDate>Tue, 28 Oct 2025 16:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Announcing Tinker</title>
      <description>&lt;div class=&quot;post-heading news-post&quot;&gt;
        &lt;h1 class=&quot;post-title&quot;&gt;Announcing Tinker&lt;/h1&gt;
        &lt;div class=&quot;publish-metadata date-only&quot;&gt;
        &lt;span&gt;
        Oct 1, 2025
        &lt;/span&gt;
        &lt;/div&gt;
        &lt;/div&gt;
        &lt;div class=&quot;post-cover&quot;&gt;
        &lt;img data-zoomable=&quot;&quot; src=&quot;https://thinkingmachines.ai/news/announcing-tinker/svgs/tinker-cover.svg&quot; style=&quot;display: block; width: 100%; max-width: 800px; height: auto; margin: 0 auto;&quot; loading=&quot;lazy&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;
        &lt;/div&gt;
        &lt;article class=&quot;content &quot;&gt;
        &lt;p class=&quot;image-caption&quot; style=&quot;text-align: center; margin-top: -1rem; margin-bottom: 2rem; font-size: 0.9rem; color: var(--fg-muted, #666);&quot;&gt;
        &lt;a href=&quot;https://www.computerhistory.org/collections/catalog/X39.81/&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;TinkerToy Computer&lt;/a&gt; invented by &lt;a href=&quot;https://en.wikipedia.org/wiki/Danny_Hillis&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;Daniel Hillis&lt;/a&gt; and &lt;a href=&quot;https://en.wikipedia.org/wiki/Brian_Silverman&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;Brian Silverman&lt;/a&gt;
        &lt;/p&gt;
        &lt;p&gt;Today, we are launching &lt;a href=&quot;https://thinkingmachines.ai/tinker&quot;&gt;Tinker&lt;/a&gt;, a flexible API for fine-tuning language models. It empowers researchers and hackers to experiment with models by giving them control over the algorithms and data while we handle the complexity of distributed training. Tinker advances our mission of enabling more people to do research on cutting-edge models and customize them to their needs.&lt;/p&gt;
        &lt;p&gt;Tinker lets you fine-tune a range of large and small open-weight models, including large mixture-of-experts models such as Qwen-235B-A22B. Switching from a small model to a large one is as simple as changing a single string in your Python code.&lt;/p&gt;
        &lt;p&gt;Tinker is a managed service that runs on our internal clusters and training infrastructure. We handle scheduling, resource allocation, and failure recovery. This allows you to get small or large runs started immediately, without worrying about managing infrastructure. We use LoRA so that we can share the same pool of compute between multiple training runs, lowering costs.&lt;/p&gt;
        &lt;p&gt;Tinker’s API gives you low-level primitives like &lt;code&gt;forward_backward&lt;/code&gt; and &lt;code&gt;sample&lt;/code&gt;, which can be used to express most common post-training methods. Even so, achieving good results requires getting many details right. That’s why we’re releasing an open-source library, the &lt;a href=&quot;http://github.com/thinking-machines-lab/tinker-cookbook&quot;&gt;Tinker Cookbook&lt;/a&gt;, with modern implementations of post-training methods that run on top of the Tinker API.&lt;/p&gt;
        &lt;p&gt;Groups at Princeton, Stanford, Berkeley, and Redwood Research have already been using Tinker:&lt;/p&gt;
        &lt;ul&gt;
        &lt;li&gt;The &lt;a href=&quot;https://blog.goedel-prover.com/&quot;&gt;Princeton Goedel Team&lt;/a&gt; trained mathematical theorem provers&lt;/li&gt;
        &lt;li&gt;The &lt;a href=&quot;https://statmech.stanford.edu/&quot;&gt;Rotskoff Chemistry group&lt;/a&gt; at Stanford fine-tuned a model to complete chemistry reasoning tasks&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://sky.cs.berkeley.edu/project/skyrl/&quot;&gt;Berkeley’s SkyRL group&lt;/a&gt; ran experiments on a custom async off-policy RL training loop with multi-agents and multi-turn tool-use.&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://www.redwoodresearch.org/&quot;&gt;Redwood Re

lib/routes/thinkingmachines/news.ts

github-actions · 2026-04-03T01:22:09Z

Successfully generated as following:

http://localhost:1200/thinkingmachines/news - Success ✔️

<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:atom="http://www.w3.org/2005/Atom" version="2.0">
  <channel>
    <title>Thinking Machines Lab - News</title>
    <link>https://thinkingmachines.ai/news/</link>
    <atom:link href="http://localhost:1200/thinkingmachines/news" rel="self" type="application/rss+xml"></atom:link>
    <description>Thinking Machines Lab - News - Powered by RSSHub</description>
    <generator>RSSHub</generator>
    <webMaster>contact@rsshub.app (RSSHub)</webMaster>
    <language>en</language>
    <lastBuildDate>Fri, 03 Apr 2026 01:22:08 GMT</lastBuildDate>
    <ttl>5</ttl>
    <item>
      <title>Training LLMs to Predict World Events (Guest Post with Mantic)</title>
      <description>&lt;div class=&quot;post-heading news-post&quot;&gt;
        &lt;h1 class=&quot;post-title&quot;&gt;Training LLMs to Predict World Events (Guest Post with Mantic)&lt;/h1&gt;
        &lt;div class=&quot;publish-metadata&quot;&gt;
        &lt;span class=&quot;author&quot;&gt;
        &lt;a href=&quot;https://enjeeneer.io/&quot; rel=&quot;me noopener&quot; target=&quot;_blank&quot;&gt;Scott Jeen&lt;/a&gt; and &lt;a href=&quot;https://www.linkedin.com/in/matthew-aitchison-16aa8799/&quot; rel=&quot;me noopener&quot; target=&quot;_blank&quot;&gt;Matthew Aitchison&lt;/a&gt; in collaboration with others at &lt;a href=&quot;https://www.mantic.com/&quot; rel=&quot;me noopener&quot; target=&quot;_blank&quot;&gt;Mantic&lt;/a&gt;
        &lt;/span&gt;
        &lt;span&gt;
        Mar 19, 2026
        &lt;/span&gt;
        &lt;/div&gt;
        &lt;/div&gt;
        &lt;div class=&quot;post-cover&quot;&gt;
        &lt;img data-zoomable=&quot;&quot; src=&quot;https://thinkingmachines.ai/news/training-llms-to-predict-world-events/svgs/cover.svg&quot; style=&quot;display: block; width: 100%; max-width: 800px; height: auto; margin: 0 auto;&quot; loading=&quot;lazy&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;
        &lt;/div&gt;
        &lt;article class=&quot;content &quot;&gt;
        &lt;p&gt;&lt;em&gt;&lt;a href=&quot;https://www.mantic.com/&quot;&gt;Mantic&lt;/a&gt; have been using Tinker since it launched. This guest post is a technical deep dive on what they have built so far.&lt;/em&gt;&lt;/p&gt;
        &lt;p&gt;The top AI forecasting systems are approaching superforecaster-level accuracy on geopolitics and current affairs.&lt;label for=&quot;sn1&quot; class=&quot;sidenote-number&quot;&gt;&lt;/label&gt;&lt;input type=&quot;checkbox&quot; id=&quot;sn1&quot; class=&quot;margin-toggle&quot;&gt;&lt;span class=&quot;sidenote&quot;&gt;&lt;a href=&quot;https://news.polymarket.com/p/its-over&quot;&gt;It’s over&lt;/a&gt; (Polymarket, 2026)&lt;/span&gt; This is exciting because scalable, automated forecasting could significantly improve the quality of decision-making across the economy and in government.&lt;/p&gt;
        &lt;p&gt;To date, the most successful recipe in forecasting tournaments has been to combine an off-the-shelf LLM (like Gemini 3 or GPT-5) with forecasting-specific context-gathering. These models, to our knowledge, have not been explicitly trained for forecasting. Can we improve the recipe by replacing them with models fine-tuned specifically for forecasting?&lt;/p&gt;
        &lt;p&gt;We target “judgmental forecasting”: prediction problems that require human-like research and reasoning. Judgmental forecasting is needed for domains like geopolitics, politics, technology, business, and economic policy, where there often isn’t enough data for a standard statistical approach like time-series extrapolation. It was popularized by the book &lt;a href=&quot;https://books.google.com/books?hl=en&amp;amp;lr=&amp;amp;id=_lMPDAAAQBAJ&amp;amp;oi=fnd&amp;amp;pg=PA1&amp;amp;ots=5avwaBT55R&amp;amp;sig=B2i7FYispHz5PtB6KVu2Ndt58cE&quot;&gt;Superforecasting&lt;/a&gt;, and now prediction markets like Polymarket and Kalshi.&lt;/p&gt;
        &lt;p&gt;In this post, we show it’s possible to significantly improve the forecasting performance of gpt-oss-120b using reinforcement learning. With Tinker, we fine-tune a model on around 10,000 binary questions of the form &lt;em&gt;“Will [event] occur before [date]?”&lt;/em&gt;. We reward the model for putting greater probability on the correct real-world outcome.&lt;/p&gt;
        &lt;p&gt;In a head-to-head contest, the fine-tuned model achieves marginally superior performance to the frontier LLMs (see Figure 1), despite much lower initial performance. We find that providing forecast-specific context increases the gains from fine-tuning.&lt;/p&gt;
        &lt;p&gt;In the optimal &lt;em&gt;ensemble&lt;/em&gt; of different models (which outperforms any single model), Grok 4 and our fine-tuned model are the most important contributors. The fine-tuned model learns a forecasting policy that is as accurate as the frontier LLMs, yet decorrelated from them.&lt;/p&gt;
        &lt;p&gt;Together, the results demonstrate that on-task training can extend the state-of-the-art in AI forecasting.&lt;/p&gt;
        &lt;figure style=&quot;margin-top: 2rem; margin-bottom: 2.3rem;&quot;&gt;
        &lt;img data-zoomable=&quot;&quot; src=&quot;https://thinkingmachines.ai/news/training-llms-to-predict-world-events/svgs/figure_1.svg&quot; style=&quot;display: block; width: 100%; max-width: 800px; height: auto; margin: 0 auto;&quot; loading=&quot;lazy&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;
        &lt;figcaption&gt;RL fine-tuning makes gpt-oss-120b competitive with the frontier LLMs on questions from the Metaculus AI Benchmark Q2 2025. Naively predicting 50% on every question would get a score of 0, and perfect foresight would get a score of 100, per the construction of the Metaculus “baseline score”. Naively predicting 18.8% on every question (the rate at which the equivalent questions resolved “yes” in the previous tournament, Q1 2025) yields a score of 22.3, which we use to truncate the Y-axis. Fine-tuning improves gpt-oss-120b‘s score from 38.6 to 45.8, on par with the best general models.&lt;/figcaption&gt;
        &lt;/figure&gt;
        &lt;h2 id=&quot;the-best-existing-recipe-uses-off-the-shelf-llms&quot;&gt;The best existing recipe uses off-the-shelf LLMs&lt;/h2&gt;
        &lt;p&gt;The past two years have seen considerable progress in AI judgmental forecasting capabilities. In the Metaculus Cup, a major tournament for amateur and professional forecasters, the best AI systems now rival the top humans (Figure 2).&lt;/p&gt;
        &lt;figure style=&quot;margin-top: 2rem; margin-bottom: 2.3rem;&quot;&gt;
        &lt;img data-zoomable=&quot;&quot; src=&quot;https://thinkingmachines.ai/news/training-llms-to-predict-world-events/svgs/figure_2.svg&quot; style=&quot;display: block; width: 100%; max-width: 800px; height: auto; margin: 0 auto;&quot; loading=&quot;lazy&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;
        &lt;figcaption&gt;Human and AI scores in the Metaculus Cup, a premier forecasting competition. Scores from the top 5 AI forecasters have been steadily improving since they first entered in the Summer of 2024. Mantic first entered in the Summer of 2025 and then in Fall 2025 beat the community prediction and the majority of professional forecasters. These results were without fine-tuning.&lt;/figcaption&gt;
        &lt;/figure&gt;
        &lt;p&gt;The trend has been driven by more capable off-the-shelf LLMs, and accelerated by more sophisticated forecasting architectures. Our architecture – which has performed well in recent Metaculus tournaments – consists of two standard phases: (1) a research phase, and (2) a prediction phase (Figure 3).&lt;label for=&quot;sn3&quot; class=&quot;sidenote-number&quot;&gt;&lt;/label&gt;&lt;input type=&quot;checkbox&quot; id=&quot;sn3&quot; class=&quot;margin-toggle&quot;&gt;&lt;span class=&quot;sidenote&quot;&gt;This two-phase process appears in early work on AI forecasting. See: &lt;a href=&quot;https://proceedings.neurips.cc/paper_files/paper/2024/hash/5a5acfd0876c940d81619c1dc60e7748-Abstract-Conference.html&quot;&gt;Approaching human-level forecasting with language models&lt;/a&gt; (Halawi et al., 2024); &lt;a href=&quot;https://arxiv.org/abs/2206.15474&quot;&gt;Forecasting Future World Events with Neural Networks&lt;/a&gt; (Zou et al, 2022).&lt;/span&gt;&lt;/p&gt;
        &lt;figure style=&quot;margin-top: 2rem; margin-bottom: 2.3rem;&quot;&gt;
        &lt;img data-zoomable=&quot;&quot; src=&quot;https://thinkingmachines.ai/news/training-llms-to-predict-world-events/svgs/figure_3.png&quot; style=&quot;display: block; width: 100%; max-width: 800px; height: auto; margin: 0 auto;&quot; loading=&quot;lazy&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;
        &lt;figcaption&gt;Mantic’s architecture. The research phase takes the forecasting question as input and performs deep research to collect information relevant to the question which goes into the prompt for the prediction LLM. The prediction LLM outputs chain-of-thought reasoning and specifies a probability distribution using specialized tools.&lt;/figcaption&gt;
        &lt;/figure&gt;
        &lt;p&gt;The research phase is conducted by deep research agents that collect the context needed to make a good prediction. For example, for the question “Will the United States attack Venezuela before 2026?”, search agents will find information about military buildup in the Caribbean, statements from President Trump, the health of the Venezuelan economy, and so on. The collected research is summarized into a prompt for the prediction phase.&lt;/p&gt;
        &lt;p&gt;The model’s task at the prediction phase is to use our specialized tools to output a probability distribution. In this post, we consider a canonical type of forecasting question: “Will [event] occur before [date]?”. We instruct the LLM to parameterize a mixture model for when the event will next occur – illustrated in Figure 4. The mixture model defines a cumulative distribution function, and from that we can read off the probability of the event occurring before the date specified in the original question.&lt;/p&gt;
        &lt;figure style=&quot;margin-top: 2rem; margin-bottom: 2.3rem;&quot;&gt;
        &lt;img data-zoomable=&quot;&quot; src=&quot;https://thinkingmachines.ai/news/training-llms-to-predict-world-events/svgs/figure_4.svg&quot; style=&quot;display: block; width: 100%; max-width: 800px; height: auto; margin: 0 auto;&quot; loading=&quot;lazy&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;
        &lt;figcaption&gt;Illustrative mixture model. The LLM selects: the number of components in the mixture, their parameters, and their respective weights. The LLM is prompted to select components capturing different scenarios that could lead to the event occurring. The final prediction is a weighted combination of the components.&lt;/figcaption&gt;
        &lt;/figure&gt;
        &lt;p&gt;In past tournaments, we’ve used off-the-shelf models as the prediction LLM. Existing literature has shown promising results from RL fine-tuning small models using a simple architecture.&lt;label for=&quot;sn4&quot; class=&quot;sidenote-number&quot;&gt;&lt;/label&gt;&lt;input type=&quot;checkbox&quot; id=&quot;sn4&quot; class=&quot;margin-toggle&quot;&gt;&lt;span class=&quot;sidenote&quot;&gt;&lt;a href=&quot;https://scholar.google.com/citations?view_op=view_citation&amp;amp;hl=en&amp;amp;user=oQuvQ0sAAAAJ&amp;amp;citation_for_view=oQuvQ0sAAAAJ:u5HHmVD_uO8C&quot;&gt;Outcome-based Reinforcement Learning to Predict the Future&lt;/a&gt; (Turtel et al, 2025)&lt;/span&gt; Can we improve frontier AI forecasting systems through on-task fine-tuning?&lt;/p&gt;
        &lt;h2 id=&quot;training-details&quot;&gt;Training details&lt;/h2&gt;
        &lt;h3 id=&quot;datasets&quot;&gt;Datasets&lt;/h3&gt;
        &lt;p&gt;We train the prediction LLM on ~10k questions about whether an event will happen by a given date. The questions are from August 2024 to December 2025 and the model’s knowledge cutoff is prior to that, so the resolution is known to us but not to the model. We generated the training set using an LLM pipeline similar to existing work.&lt;label for=&quot;sn5&quot; class=&quot;sidenote-number&quot;&gt;&lt;/label&gt;&lt;input type=&quot;checkbox&quot; id=&quot;sn5&quot; class=&quot;margin-toggle&quot;&gt;&lt;span class=&quot;sidenote&quot;&gt;&lt;a href=&quot;https://arxiv.org/abs/2601.22444&quot;&gt;Automating Forecasting Question Generation and Resolution for AI Evaluation&lt;/a&gt; (Bosse et al, 2026); &lt;a href=&quot;https://arxiv.org/abs/2601.06336&quot;&gt;Future-as-Label: Scalable Supervision from Real-World Outcomes&lt;/a&gt; (Turtel et al, 2026)&lt;/span&gt; Before training, we run the research phase for each question and store static prompts for the prediction LLM.&lt;/p&gt;
        &lt;p&gt;We test on unseen questions from the Q2 2025 Metaculus AI Benchmark Tournament.&lt;label for=&quot;sn6&quot; class=&quot;sidenote-number&quot;&gt;&lt;/label&gt;&lt;input type=&quot;checkbox&quot; id=&quot;sn6&quot; class=&quot;margin-toggle&quot;&gt;&lt;span class=&quot;sidenote&quot;&gt; The Fall 2025 iteration would have been a more obvious choice (for being more recent) but contains lower quality questions, indicated by less performance differentiated between strong and weak forecasters.&lt;/span&gt; We compare models whose knowledge cutoff is before this tournament’s start date. The full list of questions can be accessed on &lt;a href=&quot;https://gist.github.com/enjeeneer/86e24a52e6041a3d78e333bcab16984d&quot;&gt;Github&lt;/a&gt;.&lt;/p&gt;
        &lt;figure style=&quot;margin-top: 2rem; margin-bottom: 2.3rem;&quot;&gt;
        &lt;img data-zoomable=&quot;&quot; src=&quot;https://thinkingmachines.ai/news/training-llms-to-predict-world-events/svgs/figure_5.png&quot; style=&quot;display: block; width: 100%; max-width: 800px; height: auto; margin: 0 auto;&quot; loading=&quot;lazy&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;
        &lt;figcaption&gt;Three example binary event questions from the Metaculus Q2 2025 AI Benchmark.&lt;/figcaption&gt;
        &lt;/figure&gt;
        &lt;p&gt;We evaluate using the baseline score, following the Metaculus platform. This is log scoring, i.e. ln(probability assigned to true outcome), rescaled such that 100 is the maximum possible score and 0 is the score for a uniform prediction (in our setting, 50%).&lt;/p&gt;
        &lt;h3 id=&quot;implementation&quot;&gt;Implementation&lt;/h3&gt;
        &lt;p&gt;We run the experiments on &lt;a href=&quot;https://thinkingmachines.ai/tinker/&quot;&gt;Tinker&lt;/a&gt;. Of the models available through the API, we choose to train &lt;a href=&quot;https://huggingface.co/openai/gpt-oss-120b&quot;&gt;gpt-oss-120b&lt;/a&gt; because of its strong initial performance — second only to &lt;a href=&quot;https://huggingface.co/moonshotai/Kimi-K2.5&quot;&gt;Kimi K2.5&lt;/a&gt; — while being cheaper and faster.&lt;/p&gt;
        &lt;p&gt;We use a standard policy gradient algorithm with &lt;a href=&quot;https://arxiv.org/abs/2402.03300&quot;&gt;GRPO-style advantage normalisation&lt;/a&gt; and &lt;a href=&quot;https://arxiv.org/abs/2503.20783&quot;&gt;no division by the standard deviation&lt;/a&gt;. For rewards we use the Brier score which is &lt;a href=&quot;https://sites.stat.washington.edu/raftery/Research/PDF/Gneiting2007jasa.pdf&quot;&gt;strictly proper&lt;/a&gt;. We found that the Brier score leads to more stable training than the log score, even though the log score is also strictly proper. This could be because the Brier score is bounded in [0, 1] and so produces lower-variance policy gradient estimates.&lt;/p&gt;
        &lt;p&gt;Open-source RL packages often use &lt;a href=&quot;https://arxiv.org/abs/2309.06180&quot;&gt;vLLM&lt;/a&gt; for the sampling backend and &lt;a href=&quot;https://arxiv.org/abs/2304.11277&quot;&gt;FSDP&lt;/a&gt; for the training backend. These can disagree on token probabilities produced by identical policies, which biases policy gradient estimates and destabilises training. We found these discrepancies to be lower on Tinker’s integrated infrastructure, but further mitigate them with &lt;a href=&quot;https://fengyao.notion.site/off-policy-rl&quot;&gt;an importance sampling correction on the advantages&lt;/a&gt;.&lt;/p&gt;
        &lt;p&gt;The Brier score reward function is strictly monotonic in the predicted probability for a fixed outcome, so different rollouts almost always produce different rewards. This makes within-group reward ties extremely unlikely and lets us train with a relatively small group size (8) without needing to break ties or induce variance. We use a batch size of 64, as we find that &lt;a href=&quot;https://thinkingmachines.ai/blog/lora/&quot;&gt;larger batch sizes tend to destabilise training&lt;/a&gt;.&lt;/p&gt;
        &lt;h2 id=&quot;results&quot;&gt;Results&lt;/h2&gt;
        &lt;h3 id=&quot;fine-tuning-elevates-gpt-oss-120b-to-frontier-llm-performance&quot;&gt;Fine-tuning elevates gpt-oss-120b to frontier LLM performance&lt;/h3&gt;
        &lt;p&gt;The model’s test set score improves through training (Figure 6), moving from an initial score of 38.6 mean baseline points per question (below all frontier models) to a final score of 45.8 mean baseline points per question (marginally above). This demonstrates that on-task forecasting training can provide a large performance uplift.&lt;/p&gt;
        &lt;figure style=&quot;margin-top: 2rem; margin-bottom: 2.3rem;&quot;&gt;
        &lt;img data-zoomable=&quot;&quot; src=&quot;https://thinkingmachines.ai/news/training-llms-to-predict-world-events/svgs/figure_6.svg&quot; style=&quot;display: block; width: 100%; max-width: 800px; height: auto; margin: 0 auto;&quot; loading=&quot;lazy&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;
        &lt;figcaption&gt;Test set baseline score of gpt-oss-120b with and without Mantic research and tools. In the model-only setup, test set performance improves, but never reaches the initial score of gpt-oss-120b with Mantic research and tools. With Mantic research and tools, gpt-oss-120b climbs 7 points through training and marginally exceeds the performance of Gemini 3 Pro. Training continued for further steps but performance no longer improved.&lt;/figcaption&gt;
        &lt;/figure&gt;
        &lt;p&gt;The LLM fine-tuned without the benefit of our pre-generated research phase, and without the tools to construct mixture models, gains only 3 points from training instead of 7. This suggests these prerequisites positively influence the optimization dynamics, in addition to improving initialization.&lt;/p&gt;
        &lt;h3 id=&quot;the-fine-tuned-model-is-an-important-member-of-the-optimal-ensemble&quot;&gt;The fine-tuned model is an important member of the optimal ensemble&lt;/h3&gt;
        &lt;p&gt;In human forecasting, there is a well-known “wisdom of the crowd” effect: aggregate forecasts from multiple people often outperforms any one individual. This effect, in part, explains the &lt;a href=&quot;https://polymarket.com/accuracy&quot;&gt;impressive accuracy of prediction markets&lt;/a&gt;. Can we get the same benefit from ensembling the predictions of different LLMs?&lt;label for=&quot;sn7&quot; class=&quot;sidenote-number&quot;&gt;&lt;/label&gt;&lt;input type=&quot;checkbox&quot; id=&quot;sn1&quot; class=&quot;margin-toggle&quot;&gt;&lt;span class=&quot;sidenote&quot;&gt;See also &lt;a href=&quot;https://arxiv.org/abs/2402.19379&quot;&gt;Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Rival Human Crowd Accuracy&lt;/a&gt; (Schoenegger et al, 2024)&lt;/span&gt;&lt;/p&gt;
        &lt;p&gt;To contribute to an ensemble, an LLM must be sufficiently accurate on its own and also decorrelated from other LLMs in the ensemble. Predictions from most frontier LLMs, while accurate, contribute little diversity to the top-performing model (in our case, Gemini 3 Pro) – Figure 7. Among the frontier LLMs, Grok 4 is the exception: its predictions score well whilst correlating less with other frontier LLMs.&lt;/p&gt;
        &lt;figure style=&quot;margin-top: 2rem; margin-bottom: 2.3rem;&quot;&gt;
        &lt;img data-zoomable=&quot;&quot; src=&quot;https://thinkingmachines.ai/news/training-llms-to-predict-world-events/svgs/figure_7.svg&quot; style=&quot;display: block; width: 100%; max-width: 800px; height: auto; margin: 0 auto;&quot; loading=&quot;lazy&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;
        &lt;figcaption&gt;Mean baseline score on binary event questions from the Metaculus Q2 AIB Benchmark with respect to their Jensson Shannon divergence from Gemini 3 Pro for a suite of closed source and open source LLMs. Marker colour indicates each model’s weight in the optimal 5-sample ensemble. The optimal ensemble consists of fine-tuned gpt-oss-120b (40%), gemini 3 pro (20%), gpt-5 (20%) and grok 4 (20%).&lt;/figcaption&gt;
        &lt;/figure&gt;
        &lt;p&gt;The optimal ensemble, on a budget of 5 samples, is 2 samples from our fine-tuned gpt-oss-120b plus 1 sample from each of Gemini 3 Pro, Grok 4, and GPT-5. We can test each model’s contribution by removing it, recomputing the optimal ensemble, and seeing how much the score degrades. We find that Grok 4 is the least replaceable, with fine-tuned gpt-oss-120b in second place. Other models can be replaced with little to no performance degradation (Figure 8).&lt;/p&gt;
        &lt;figure style=&quot;margin-top: 2rem; margin-bottom: 2.3rem;&quot;&gt;
        &lt;img data-zoomable=&quot;&quot; src=&quot;https://thinkingmachines.ai/news/training-llms-to-predict-world-events/svgs/figure_8.svg&quot; style=&quot;display: block; width: 100%; max-width: 800px; height: auto; margin: 0 auto;&quot; loading=&quot;lazy&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;
        &lt;figcaption&gt;When selecting an ensemble of frontier and open-source models, Grok 4 and fine-tuned gpt-oss-120b are the least replaceable. Model replaceability is defined as the reduction in score incurred when removing a model from the optimal ensemble. By definition, if a model is not included in the optimal ensemble,there is no cost to removing it.&lt;/figcaption&gt;
        &lt;/figure&gt;
        &lt;p&gt;GPT-5 and Gemini 3 Pro make similar predictions, and thus don’t benefit much from ensembling with each other. Both models improve from mixing their predictions with either Grok 4 or the fine-tuned gpt-oss-120b, and Grok 4 also benefits most from mixing with the fine-tuned model. In either optimal 3-way ensemble from Figure 9, the fine-tuned model gets about half of the total weight.&lt;/p&gt;
        &lt;figure style=&quot;margin-top: 2rem; margin-bottom: 2.3rem;&quot;&gt;
        &lt;img data-zoomable=&quot;&quot; src=&quot;https://thinkingmachines.ai/news/training-llms-to-predict-world-events/svgs/figure_9.png&quot; style=&quot;display: block; width: 100%; max-width: 800px; height: auto; margin: 0 auto;&quot; loading=&quot;lazy&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;
        &lt;figcaption&gt;Three-way ensembles. Left: The optimal three-way ensemble between the fine-tuned gpt-oss-120b, Gemini 3 Pro and GPT-5 is weighted 56%, 26% and 18% respectively, depicted by the black star. Right: The optimal three-way ensemble between the fine-tuned gpt-oss-120b, Gemini 3 Pro and Grok 4 is weighted 48, 26% and 26% respectively, depicted by the white star.&lt;/figcaption&gt;
        &lt;/figure&gt;
        &lt;h2 id=&quot;conclusions-and-next-steps&quot;&gt;Conclusions and next steps&lt;/h2&gt;
        &lt;p&gt;We have shown that we can elevate the forecasting performance of gpt-oss-120b to match frontier LLMs with RL fine-tuning. This work can be extended in many ways, some of which we have already begun exploring:&lt;/p&gt;
        &lt;ol&gt;
        &lt;li&gt;&lt;strong&gt;Training larger models.&lt;/strong&gt; Tinker enables training larger models with higher initial performance than gpt-oss, specifically Kimi K2.5.&lt;/li&gt;
        &lt;li&gt;&lt;strong&gt;Training on all question formats.&lt;/strong&gt; We are already training models on numerical questions such as economic indicators and multiple choice questions such as election results.&lt;/li&gt;
        &lt;li&gt;&lt;strong&gt;Improved question sets.&lt;/strong&gt; As the models become stronger forecasters, we need more challenging forecasting questions.&lt;/li&gt;
        &lt;li&gt;&lt;strong&gt;Information retrieval inside the loop.&lt;/strong&gt; We could give the prediction LLM tools for information retrieval and include this in the training loop.&lt;/li&gt;
        &lt;/ol&gt;
        &lt;h2 id=&quot;citation&quot;&gt;Citation&lt;/h2&gt;
        &lt;p&gt;Please cite this work as:&lt;/p&gt;
        &lt;pre tabindex=&quot;0&quot;&gt;&lt;code&gt;Jeen, Scott; Aitchison, Matthew; and Mantic, &quot;Training LLMs to Predict World Events&quot;,
        Thinking Machines Lab: News, Mar 2026.
        &lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Or use the BibTeX citation:&lt;/p&gt;
        &lt;pre tabindex=&quot;0&quot;&gt;&lt;code&gt;@article{scott2026forecasting,
        author = {Scott Jeen, Matthew Aitchison, and Mantic},
        title = {Training LLMs to Predict World Events},
        journal = {Thinking Machines Lab: News},
        year = {2026},
        note = {https://thinkingmachines.ai/news/training-llms-to-predict-world-events/}
        }
        &lt;/code&gt;&lt;/pre&gt;
        &lt;/article&gt;
        &lt;div class=&quot;paginator&quot;&gt;
        &lt;a id=&quot;post-prev-link&quot; class=&quot;link&quot; href=&quot;https://thinkingmachines.ai/news/nvidia-partnership/&quot; title=&quot;Thinking Machines Lab and NVIDIA Announce Long-Term Gigawatt-Scale Strategic Partnership&quot;&gt;prev&lt;/a&gt;
        &lt;a href=&quot;https://thinkingmachines.ai/news/training-llms-to-predict-world-events/#&quot; id=&quot;back-to-top&quot; class=&quot;link&quot; title=&quot;Go to top&quot;&gt;back to top&lt;/a&gt;
        &lt;a id=&quot;post-next-link&quot;&gt;&lt;/a&gt;
        &lt;/div&gt;</description>
      <link>https://thinkingmachines.ai/news/training-llms-to-predict-world-events/</link>
      <guid isPermaLink="false">https://thinkingmachines.ai/news/training-llms-to-predict-world-events/</guid>
      <pubDate>Wed, 18 Mar 2026 16:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Thinking Machines Lab and NVIDIA Announce Long-Term Gigawatt-Scale Strategic Partnership</title>
      <description>&lt;div class=&quot;post-heading news-post&quot;&gt;
        &lt;h1 class=&quot;post-title&quot;&gt;Thinking Machines Lab and NVIDIA Announce Long-Term Gigawatt-Scale Strategic Partnership&lt;/h1&gt;
        &lt;div class=&quot;publish-metadata date-only&quot;&gt;
        &lt;span&gt;
        Mar 10, 2026
        &lt;/span&gt;
        &lt;/div&gt;
        &lt;/div&gt;
        &lt;div class=&quot;post-cover compact&quot;&gt;
        &lt;img data-zoomable=&quot;&quot; src=&quot;https://thinkingmachines.ai/news/nvidia-partnership/images/jensen-huang-mira-murati-nvidia-debs-3127-2.jpg&quot; alt=&quot;Cover image for Thinking Machines Lab and NVIDIA Announce Long-Term Gigawatt-Scale Strategic Partnership&quot; class=&quot;cover-image&quot; fetchpriority=&quot;high&quot; decoding=&quot;async&quot; loading=&quot;eager&quot; referrerpolicy=&quot;no-referrer&quot;&gt;
        &lt;p class=&quot;post-cover-caption&quot;&gt;Jensen Huang (NVIDIA) and Mira Murati (Thinking Machines)&lt;/p&gt;
        &lt;/div&gt;
        &lt;article class=&quot;content &quot;&gt;
        &lt;p&gt;Thinking Machines Lab and NVIDIA announced today a multi-year strategic partnership to deploy at least one gigawatt of next-generation NVIDIA Vera Rubin systems to support Thinking Machines’ frontier model training and platforms delivering customizable AI at scale. Deployment on NVIDIA’s Vera Rubin platform is targeted for early next year.&lt;/p&gt;
        &lt;p&gt;The partnership also includes an effort to design training and serving systems for NVIDIA architectures and broaden access to frontier AI and open models for enterprises, research institutions, and the scientific community.&lt;/p&gt;
        &lt;p&gt;NVIDIA has also made a significant investment in Thinking Machines Lab to support the company’s long-term growth.&lt;/p&gt;
        &lt;p&gt;“AI is the most powerful knowledge discovery instrument in human history,” said Jensen Huang, founder and CEO of NVIDIA. “Thinking Machines has brought together a world-class team to advance the frontier of AI. We are thrilled to partner with Thinking Machines to realize their exciting vision for the future of AI.”&lt;/p&gt;
        &lt;p&gt;“NVIDIA’s technology is the foundation on which the entire field is built,” said Mira Murati, cofounder and CEO of Thinking Machines. “This partnership accelerates our capacity to build AI that people can shape and make their own, as it shapes human potential in turn.”&lt;/p&gt;
        &lt;p&gt;Building powerful AI systems that are understandable, customizable, and collaborative demands advances in research, design, and infrastructure at scale. This partnership provides that foundation, with the shared aim of ensuring that the most transformative technology of our time expands human capability.&lt;/p&gt;
        &lt;img data-zoomable=&quot;&quot; src=&quot;https://thinkingmachines.ai/news/nvidia-partnership/cover.svg&quot; style=&quot;display: block; width: 100%; max-width: 800px; height: auto; margin: 0 auto;&quot; loading=&quot;lazy&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;
        &lt;/article&gt;
        &lt;div class=&quot;paginator&quot;&gt;
        &lt;a id=&quot;post-prev-link&quot; class=&quot;link&quot; href=&quot;https://thinkingmachines.ai/news/tinker-general-availability/&quot; title=&quot;Tinker: General Availability and Vision Input&quot;&gt;prev&lt;/a&gt;
        &lt;a href=&quot;https://thinkingmachines.ai/news/nvidia-partnership/#&quot; id=&quot;back-to-top&quot; class=&quot;link&quot; title=&quot;Go to top&quot;&gt;back to top&lt;/a&gt;
        &lt;a id=&quot;post-next-link&quot; class=&quot;link&quot; href=&quot;https://thinkingmachines.ai/news/training-llms-to-predict-world-events/&quot; title=&quot;Training LLMs to Predict World Events (Guest Post with Mantic)&quot;&gt;next&lt;/a&gt;
        &lt;/div&gt;</description>
      <link>https://thinkingmachines.ai/news/nvidia-partnership/</link>
      <guid isPermaLink="false">https://thinkingmachines.ai/news/nvidia-partnership/</guid>
      <pubDate>Mon, 09 Mar 2026 16:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Tinker: General Availability and Vision Input</title>
      <description>&lt;div class=&quot;post-heading news-post&quot;&gt;
        &lt;h1 class=&quot;post-title&quot;&gt;Tinker: General Availability and Vision Input&lt;/h1&gt;
        &lt;div class=&quot;publish-metadata date-only&quot;&gt;
        &lt;span&gt;
        Dec 12, 2025
        &lt;/span&gt;
        &lt;/div&gt;
        &lt;/div&gt;
        &lt;article class=&quot;content &quot;&gt;
        &lt;p&gt;Today we are announcing four updates to Tinker:&lt;/p&gt;
        &lt;ul&gt;
        &lt;li&gt;No more waitlist&lt;/li&gt;
        &lt;li&gt;New reasoning model: Kimi K2 Thinking&lt;/li&gt;
        &lt;li&gt;New inference interface that is compatible with the OpenAI API&lt;/li&gt;
        &lt;li&gt;Vision input support with Qwen3-VL&lt;/li&gt;
        &lt;/ul&gt;
        &lt;h2 id=&quot;general-availability&quot;&gt;General availability&lt;/h2&gt;
        &lt;p&gt;The waitlist is over! Everybody can use Tinker now; &lt;a href=&quot;https://auth.thinkingmachines.ai/sign-up&quot;&gt;sign up here&lt;/a&gt; to get started. See the &lt;a href=&quot;https://thinkingmachines.ai/tinker/&quot;&gt;Tinker homepage&lt;/a&gt; for available models and pricing, and check out the &lt;a href=&quot;https://github.com/thinking-machines-lab/tinker-cookbook&quot;&gt;Tinker cookbook&lt;/a&gt; for code examples.&lt;/p&gt;
        &lt;h2 id=&quot;more-reasoning-with-kimi-k2-thinking&quot;&gt;More reasoning with Kimi K2 Thinking&lt;/h2&gt;
        &lt;p&gt;Users can now fine-tune Kimi K2 Thinking on Tinker. With a trillion parameters, Kimi K2 is the largest model in our lineup so far. It is built for long chains of reasoning and tool use.&lt;/p&gt;
        &lt;h2 id=&quot;openai-api-compatible-sampling&quot;&gt;OpenAI API-compatible sampling&lt;/h2&gt;
        &lt;p&gt;Tinker has a standard function for inference:&lt;/p&gt;
        &lt;div class=&quot;highlight&quot;&gt;&lt;pre tabindex=&quot;0&quot; class=&quot;chroma&quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;line&quot;&gt;&lt;span class=&quot;cl&quot;&gt;&lt;span class=&quot;n&quot;&gt;prompt&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;types&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ModelInput&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;from_ints&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tokenizer&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;encode&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;The capital of France is&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,))&lt;/span&gt;
        &lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;line&quot;&gt;&lt;span class=&quot;cl&quot;&gt;&lt;span class=&quot;n&quot;&gt;params&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;types&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SamplingParams&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;max_tokens&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;20&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;temperature&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;0.0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;stop&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;
        &lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;line&quot;&gt;&lt;span class=&quot;cl&quot;&gt;&lt;span class=&quot;n&quot;&gt;future&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sampling_client&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sample&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;prompt&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;prompt&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sampling_params&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;params&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;With this release, we have added OpenAI API-compatible scaffolding for quickly sampling from a model by specifying a path, even while it’s still training. This also means Tinker can now plug-and-play with any OpenAI API-compatible platform. See more information in our &lt;a href=&quot;https://tinker-docs.thinkingmachines.ai/compatible-apis/openai&quot;&gt;Tinker documentation&lt;/a&gt;.&lt;/p&gt;
        &lt;div class=&quot;highlight&quot;&gt;&lt;pre tabindex=&quot;0&quot; class=&quot;chroma&quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;line&quot;&gt;&lt;span class=&quot;cl&quot;&gt;&lt;span class=&quot;n&quot;&gt;response&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;openai_client&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;completions&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;create&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
        &lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;line&quot;&gt;&lt;span class=&quot;cl&quot;&gt; &lt;span class=&quot;n&quot;&gt;model&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;tinker://0034d8c9-0a88-52a9-b2b7-bce7cb1e6fef:train:0/sampler_weights/000080&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;line&quot;&gt;&lt;span class=&quot;cl&quot;&gt; &lt;span class=&quot;n&quot;&gt;prompt&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;The capital of France is&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;line&quot;&gt;&lt;span class=&quot;cl&quot;&gt; &lt;span class=&quot;n&quot;&gt;max_tokens&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;20&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;line&quot;&gt;&lt;span class=&quot;cl&quot;&gt; &lt;span class=&quot;n&quot;&gt;temperature&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;0.0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;line&quot;&gt;&lt;span class=&quot;cl&quot;&gt; &lt;span class=&quot;n&quot;&gt;stop&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;
        &lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;line&quot;&gt;&lt;span class=&quot;cl&quot;&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id=&quot;vision-input-with-qwen3-vl&quot;&gt;Vision input with Qwen3-VL&lt;/h2&gt;
        &lt;p&gt;We’ve added two vision models to Tinker: Qwen3-VL-30B-A3B-Instruct and Qwen3-VL-235B-A22B-Instruct. With these, users can process pictures, screenshots, and diagrams for a variety of applications.&lt;/p&gt;
        &lt;p&gt;To input images, just interleave together an &lt;a href=&quot;https://tinker-docs.thinkingmachines.ai/api-reference/types#imagechunk-objects&quot;&gt;ImageChunk&lt;/a&gt; – consisting of your image, saved as bytes – with text chunks. For example:&lt;/p&gt;
        &lt;div class=&quot;highlight&quot;&gt;&lt;pre tabindex=&quot;0&quot; class=&quot;chroma&quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;line&quot;&gt;&lt;span class=&quot;cl&quot;&gt;&lt;span class=&quot;n&quot;&gt;model_input&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tinker&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ModelInput&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;chunks&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;
        &lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;line&quot;&gt;&lt;span class=&quot;cl&quot;&gt; &lt;span class=&quot;n&quot;&gt;tinker&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;types&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ImageChunk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;image_data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;format&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;png&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
        &lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;line&quot;&gt;&lt;span class=&quot;cl&quot;&gt; &lt;span class=&quot;n&quot;&gt;tinker&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;types&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;EncodedTextChunk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tokens&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tokenizer&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;encode&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;What is this?&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)),&lt;/span&gt;
        &lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;line&quot;&gt;&lt;span class=&quot;cl&quot;&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;
        &lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;These vision inputs can be used in a variety of applications out-of-the-box, including SFT and RL finetuning.&lt;/p&gt;
        &lt;p&gt;To demonstrate vision understanding in action, we are sharing &lt;a href=&quot;https://github.com/thinking-machines-lab/tinker-cookbook/tree/main/tinker_cookbook/recipes/vlm_classifier&quot;&gt;a new cookbook recipe for fine-tuning VLMs as image classifiers&lt;/a&gt;. Qwen3-VL-235B-A22B-Instruct obtains reasonable accuracy even given just one example per class; performance improves with more labeled data.&lt;/p&gt;
        &lt;h2 id=&quot;training-image-classifiers-with-tinker&quot;&gt;Training image classifiers with Tinker&lt;/h2&gt;
        &lt;p&gt;To showcase Tinker’s new vision capabilities, we finetuned &lt;a href=&quot;https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct&quot;&gt;Qwen3-VL-235B-A22B-Instruct&lt;/a&gt; to classify images on four classic datasets:&lt;/p&gt;
        &lt;ul&gt;
        &lt;li&gt;&lt;a href=&quot;https://data.caltech.edu/records/mzrjq-6wc02&quot;&gt;Caltech 101&lt;/a&gt;, a dataset of 101 general object categories.&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://www.kaggle.com/datasets/eduardo4jesus/stanford-cars-dataset&quot;&gt;Stanford Cars&lt;/a&gt;, a dataset of car makes, models, and years.&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://www.robots.ox.ac.uk/~vgg/data/flowers/102/&quot;&gt;Oxford Flowers&lt;/a&gt;, a dataset of flower species.&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://www.robots.ox.ac.uk/~vgg/data/pets/&quot;&gt;Oxford Pets&lt;/a&gt;, a dataset of pet breeds.&lt;/li&gt;
        &lt;/ul&gt;
        &lt;p&gt;Since Qwen3-VL is a language model, we frame classification as text generation: given an image, the model outputs the class name. We compare this approach against a traditional vision baseline of finetuning a vision-only model — DINOv2-base. &lt;a href=&quot;https://arxiv.org/pdf/2304.07193&quot;&gt;DINOv2&lt;/a&gt; is a self-supervised vision transformer that was trained to encode images, and is commonly used as a backbone for pure computer vision tasks. For DINOv2, we add a classification head that predicts a distribution over all N classes. Both models are fine-tuned with LoRA.&lt;/p&gt;
        &lt;p&gt;Labeled image data is scarce for many real-world use cases, so data efficiency is the primary measure we look at. We show the classification accuracy when sweeping across the number of labeled examples per class, starting with just a single one.&lt;/p&gt;
        &lt;figure id=&quot;fig:qwen-v-dino&quot;&gt;
        &lt;img data-zoomable=&quot;&quot; src=&quot;https://thinkingmachines.ai/news/tinker-general-availability/images/vlm-graphs.png&quot; referrerpolicy=&quot;no-referrer&quot;&gt;
        &lt;figcaption&gt;Comparison of fine-tuned Qwen3-VL-235-A22B and DINOv2 performance on simple image classification tasks.&lt;/figcaption&gt;
        &lt;/figure&gt;
        &lt;p&gt;In the limited-data regime, Qwen3-VL-235-A22B outperforms DINOv2. Not only is it a bigger model, but as a VLM, it also comes with language knowledge out-of-the-box (i.e. what a “golden retriever” or “sunflower” is). This general language-and-vision capability of Qwen3-VL makes it readily available for vision tasks beyond classification.&lt;/p&gt;
        &lt;h2 id=&quot;happy-holidays&quot;&gt;Happy Holidays&lt;/h2&gt;
        &lt;p&gt;Tinker exists to enable builders and researchers to train and customize state-of-the-art models. As always, we look forward to seeing what you build with Tinker. Happy holidays!&lt;/p&gt;
        &lt;/article&gt;
        &lt;div class=&quot;paginator&quot;&gt;
        &lt;a id=&quot;post-prev-link&quot; class=&quot;link&quot; href=&quot;https://thinkingmachines.ai/news/call-for-community-projects/&quot; title=&quot;Tinker: Call for Community Projects&quot;&gt;prev&lt;/a&gt;
        &lt;a href=&quot;https://thinkingmachines.ai/news/tinker-general-availability/#&quot; id=&quot;back-to-top&quot; class=&quot;link&quot; title=&quot;Go to top&quot;&gt;back to top&lt;/a&gt;
        &lt;a id=&quot;post-next-link&quot; class=&quot;link&quot; href=&quot;https://thinkingmachines.ai/news/nvidia-partnership/&quot; title=&quot;Thinking Machines Lab and NVIDIA Announce Long-Term Gigawatt-Scale Strategic Partnership&quot;&gt;next&lt;/a&gt;
        &lt;/div&gt;</description>
      <link>https://thinkingmachines.ai/news/tinker-general-availability/</link>
      <guid isPermaLink="false">https://thinkingmachines.ai/news/tinker-general-availability/</guid>
      <pubDate>Thu, 11 Dec 2025 16:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Tinker: Call for Community Projects</title>
      <description>&lt;div class=&quot;post-heading news-post&quot;&gt;
        &lt;h1 class=&quot;post-title&quot;&gt;Tinker: Call for Community Projects&lt;/h1&gt;
        &lt;div class=&quot;publish-metadata date-only&quot;&gt;
        &lt;span&gt;
        Nov 7, 2025
        &lt;/span&gt;
        &lt;/div&gt;
        &lt;/div&gt;
        &lt;article class=&quot;content &quot;&gt;
        &lt;p&gt;We launched &lt;a href=&quot;https://thinkingmachines.ai/tinker/&quot;&gt;Tinker&lt;/a&gt; to enable builders and researchers to train models their own way, whether they’re conducting studies or customizing models for new applications. We plan to publish regular roundups of the coolest projects from the Tinker community, and &lt;strong&gt;we invite you to submit what you’ve been Tinkering on to be featured on our blog&lt;/strong&gt;.&lt;/p&gt;
        &lt;p&gt;Below are some broad suggestions for what we hope to see from the Tinker featured projects, and some specific research directions we would particularly love to see pursued.&lt;/p&gt;
        &lt;h2 id=&quot;guidelines-for-tinker-featured-projects&quot;&gt;Guidelines for Tinker Featured Projects&lt;/h2&gt;
        &lt;p&gt;We’re interested in featuring ML research projects, AI-enabled research in other domains, custom models, and other contributions. Some examples:&lt;/p&gt;
        &lt;ul&gt;
        &lt;li&gt;A reimplementation of a research project or tech report using Tinker, such as papers that compare algorithmic recipes or datasets.&lt;/li&gt;
        &lt;li&gt;Original research in machine learning, such as exploring new approaches to training or optimization or applying novel benchmarks and evaluations.&lt;/li&gt;
        &lt;li&gt;Research in a non-AI field that uses fine-tuned models, such as the work on mathematical theorem provers and chemistry models we &lt;a href=&quot;https://thinkingmachines.ai/news/announcing-tinker/#:~:text=Groups%20at%20Princeton%2C%20Stanford&quot;&gt;highlighted previously&lt;/a&gt;.&lt;/li&gt;
        &lt;li&gt;Product prototypes built with Tinker, demoing a model that does something fresh or delightful.&lt;/li&gt;
        &lt;li&gt;Novel datasets and task environments for training models.&lt;/li&gt;
        &lt;li&gt;High-level libraries built on top of Tinker that enable less experienced practitioners to perform fine-tuning effectively.&lt;/li&gt;
        &lt;li&gt;Infrastructure contributions, such as a clean self-hosted implementation of the Tinker training API.&lt;/li&gt;
        &lt;/ul&gt;
        &lt;p&gt;Your submission should include a write-up and, preferably, an open-source release of your code. We encourage you to focus on rigor and clear evaluation in your write-ups: crisp charts, raw output examples, clear comparisons to alternative approaches or models on relevant benchmarks and metrics. Tinkering is experimenting — we want to feature diligent work and transparent results over novelty or hype.&lt;/p&gt;
        &lt;p&gt;Please send your projects and any related questions to &lt;a href=&quot;https://thinkingmachines.ai/cdn-cgi/l/email-protection#0f7b6661646a7d4f7b67666164666168626e6c6766616a7c216e66307c7a6d656a6c7b32496a6e7b7a7d6a6b2a3d3f5f7d60656a6c7b2a3d3f&quot;&gt;&lt;span class=&quot;__cf_email__&quot; data-cfemail=&quot;4f3b2621242a3d0f3b27262124262128222e2c2726212a3c612e26&quot;&gt;[email&amp;nbsp;protected]&lt;/span&gt;&lt;/a&gt; with “Featured Project” in the subject line.&lt;/p&gt;
        &lt;h2 id=&quot;suggested-research-projects&quot;&gt;Suggested research projects&lt;/h2&gt;
        &lt;p&gt;Here are some research directions that we would personally love to see explored and that Tinker can enable real progress on. We have &lt;a href=&quot;https://github.com/thinking-machines-lab/tinker-project-ideas&quot;&gt;created a repo&lt;/a&gt; with detailed motivation and guidelines for each; we’ll be adding more resources to it over time to help researchers get started. We expect most project ideas to surprise us, but this short list could serve as inspiration.&lt;/p&gt;
        &lt;p&gt;&lt;strong&gt;&lt;a href=&quot;https://github.com/thinking-machines-lab/tinker-project-ideas/blob/main/replicate-cai-with-base-models.md&quot;&gt;Replicating Constitutional AI, starting from the base model.&lt;/a&gt;&lt;/strong&gt; Though RLAIF is widely used, it’s most often bootstrapped from existing instruction-tuned models. This makes it difficult to separate the impact of the constitution from the impact of the data-generating model that interprets it. A study of Constitutional AI with and without instruction-tuned models in the pipeline would shed light on the use of constitutions and RLAIF.&lt;/p&gt;
        &lt;p&gt;&lt;strong&gt;&lt;a href=&quot;https://github.com/thinking-machines-lab/tinker-project-ideas/blob/main/noisy-student.md&quot;&gt;RLVR with Noisy student.&lt;/a&gt;&lt;/strong&gt; Noisy student self-distillation was a popular technique in an earlier era of machine learning for making use of large unlabeled datasets, but it hasn’t been adapted widely to LLMs. One possible adaptation is to start RLVR with a small labeled training set and a large unlabeled one, then have the student apply labels to the latter set after each RL run and iterate.&lt;/p&gt;
        &lt;p&gt;&lt;strong&gt;&lt;a href=&quot;https://github.com/thinking-machines-lab/tinker-project-ideas/blob/main/on-policy-context-distillation.md&quot;&gt;On-Policy Context Distillation.&lt;/a&gt;&lt;/strong&gt; Context distillation trains a student model with empty context on a teacher model with long and detailed context. Prior work used off-policy distillation — fine-tuning on teacher samples. We have found that &lt;a href=&quot;https://thinkingmachines.ai/blog/on-policy-distillation/&quot;&gt;on-policy distillation&lt;/a&gt; is often much more effective; it would be useful to compare the two approaches for context distillation in particular.&lt;/p&gt;
        &lt;p&gt;&lt;strong&gt;&lt;a href=&quot;https://github.com/thinking-machines-lab/tinker-project-ideas/blob/main/memorization-empirical-study.md&quot;&gt;RL memory test.&lt;/a&gt;&lt;/strong&gt; Our &lt;a href=&quot;https://thinkingmachines.ai/blog/lora/&quot;&gt;post on LoRA&lt;/a&gt; presented theoretical arguments on the rate of information acquisition by both SFT and RL. You can set up a toy environment where RL must learn a completely random number sequence, to compare the empirical learning rate under various reward functions to the theoretical estimate.&lt;/p&gt;
        &lt;p&gt;&lt;strong&gt;&lt;a href=&quot;https://github.com/thinking-machines-lab/tinker-project-ideas/blob/main/direct-rl-on-pairwise-judge.md&quot;&gt;Direct RL on pairwise judge.&lt;/a&gt;&lt;/strong&gt; RLHF and RLAIF use datasets of pairwise preferences, which are used to train a reward model, which is then used in RL. As an alternative “direct” approach, we can do RL using a prompted model that does pairwise comparisons, without training the reward model. It would be interesting to do experiments comparing the direct and indirect approaches.&lt;/p&gt;
        &lt;p&gt;&lt;strong&gt;&lt;a href=&quot;https://github.com/thinking-machines-lab/tinker-project-ideas/blob/main/replicate-open-character-training.md&quot;&gt;Replicate Open Character Training.&lt;/a&gt;&lt;/strong&gt; Replicate the recent paper on &lt;a href=&quot;https://arxiv.org/abs/2511.01689&quot;&gt;Open Character Training&lt;/a&gt; using Tinker.&lt;/p&gt;
        &lt;p&gt;&lt;strong&gt;&lt;a href=&quot;https://github.com/thinking-machines-lab/tinker-project-ideas/blob/main/gan-joke-generation.md&quot;&gt;GAN for jokes.&lt;/a&gt;&lt;/strong&gt; In domains such as humor, it is easier to curate a human-vetted set of demonstrations than to train a reliable judge or reward model. Try implementing GAN-style training for a joke evaluator and joke generator that can craft a joke with a requested subject and keywords.&lt;/p&gt;
        &lt;h2 id=&quot;tips-for-high-quality-ml-experiments&quot;&gt;Tips for high-quality ML experiments&lt;/h2&gt;
        &lt;p&gt;In closing, we want to offer a few guidelines for running quality ML studies, the same guidelines we strive to adhere to internally when running experiments and documenting the results.&lt;/p&gt;
        &lt;p&gt;We encourage researchers to apply multiple analyses for examining each result. When creating datasets or environments, we recommend training a range of models and applying different evals. When developing novel methods, we suggest comparing to simpler baseline methods and sweeping hyperparameters that performance is sensitive to, particularly learning rate.&lt;/p&gt;
        &lt;p&gt;We’d love to see your reasoning in the write-up: assumptions you made, how your approach diverges from previous reports, and what motivated each change. We hope to see examples of the raw data and model rollouts, along with the summarized results. Finally, we appreciate crisp and detailed write-ups with clean and well-labeled charts and illustrations of the inner workings of the methods used.&lt;/p&gt;
        &lt;p&gt;We are excited to see what our community creates with Tinker, and hope that our featured projects will inspire your own work.&lt;/p&gt;
        &lt;/article&gt;
        &lt;div class=&quot;paginator&quot;&gt;
        &lt;a id=&quot;post-prev-link&quot; class=&quot;link&quot; href=&quot;https://thinkingmachines.ai/news/tinker-research-and-teaching-grants/&quot; title=&quot;Tinker: Announcing Research and Teaching Grants&quot;&gt;prev&lt;/a&gt;
        &lt;a href=&quot;https://thinkingmachines.ai/news/call-for-community-projects/#&quot; id=&quot;back-to-top&quot; class=&quot;link&quot; title=&quot;Go to top&quot;&gt;back to top&lt;/a&gt;
        &lt;a id=&quot;post-next-link&quot; class=&quot;link&quot; href=&quot;https://thinkingmachines.ai/news/tinker-general-availability/&quot; title=&quot;Tinker: General Availability and Vision Input&quot;&gt;next&lt;/a&gt;
        &lt;/div&gt;</description>
      <link>https://thinkingmachines.ai/news/call-for-community-projects/</link>
      <guid isPermaLink="false">https://thinkingmachines.ai/news/call-for-community-projects/</guid>
      <pubDate>Thu, 06 Nov 2025 16:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Tinker: Announcing Research and Teaching Grants</title>
      <description>&lt;div class=&quot;post-heading news-post&quot;&gt;
        &lt;h1 class=&quot;post-title&quot;&gt;Tinker: Announcing Research and Teaching Grants&lt;/h1&gt;
        &lt;div class=&quot;publish-metadata date-only&quot;&gt;
        &lt;span&gt;
        Oct 29, 2025
        &lt;/span&gt;
        &lt;/div&gt;
        &lt;/div&gt;
        &lt;article class=&quot;content &quot;&gt;
        &lt;p&gt;We launched &lt;a href=&quot;https://thinkingmachines.ai/news/announcing-tinker/&quot;&gt;Tinker&lt;/a&gt; nearly one month ago. Since then, researchers across academia and non-profits have been using Tinker to train custom models and advance their research.&lt;/p&gt;
        &lt;p&gt;Today, we’re launching research and teaching grants for Tinker access. As part of our commitment to open and collaborative science, we want to make it as easy as possible for students and scholars to use Tinker. If your research or teaching involves training open-weight LLMs, we encourage you to apply.&lt;/p&gt;
        &lt;p&gt;We’re offering two types of grants to support your work:&lt;/p&gt;
        &lt;ul&gt;
        &lt;li&gt;
        &lt;p&gt;&lt;strong&gt;Teaching Grants:&lt;/strong&gt; We provide $250 in free credits per student for academic classes using Tinker, whether you’re integrating it into an assignment or enabling students to use Tinker for self-directed projects, this is sized to support your entire class for the duration of the course.&lt;/p&gt;
        &lt;/li&gt;
        &lt;li&gt;
        &lt;p&gt;&lt;strong&gt;Research Grants:&lt;/strong&gt; We provide grants starting at $5,000 to support research projects and open-source software that uses Tinker.&lt;/p&gt;
        &lt;/li&gt;
        &lt;/ul&gt;
        &lt;p&gt;A selection of early grants we have awarded:&lt;/p&gt;
        &lt;ul&gt;
        &lt;li&gt;
        &lt;p&gt;Diyi Yang’s &lt;a href=&quot;https://web.stanford.edu/class/cs329x/&quot;&gt;Stanford class on Human-Centered LLMs&lt;/a&gt; uses Tinker to compare different approaches for training personalized LLMs that capture unique writing styles and align with user habits.&lt;/p&gt;
        &lt;/li&gt;
        &lt;li&gt;
        &lt;p&gt;Aviral Kumar and Katerina Fragkiadaki’s &lt;a href=&quot;https://cmudeeprl.github.io/703website_f25/&quot;&gt;CMU class on Deep RL&lt;/a&gt; will use Tinker to enable class projects to experiment with state-of-the-art methods for training LLM and VLM based policies via RL.&lt;/p&gt;
        &lt;/li&gt;
        &lt;li&gt;
        &lt;p&gt;&lt;a href=&quot;https://chemistry.stanford.edu/people/grant-m-rotskoff&quot;&gt;Grant Rotskoff&lt;/a&gt;’s lab at Stanford is fine-tuning small-molecule chemistry models with Tinker to help solve problems in computational chemistry.&lt;/p&gt;
        &lt;/li&gt;
        &lt;/ul&gt;
        &lt;p&gt;&lt;strong&gt;Instructors&lt;/strong&gt;, please apply for teaching grants &lt;a href=&quot;https://form.typeform.com/to/JgPkuMvB&quot;&gt;&lt;strong&gt;here.&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;
        &lt;p&gt;&lt;strong&gt;Researchers&lt;/strong&gt;, please apply for research grants &lt;a href=&quot;https://form.typeform.com/to/E9wVFZJJ&quot;&gt;&lt;strong&gt;here.&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;
        &lt;p&gt;We’re assessing applications on a rolling basis and will aim to respond within a week of your application.&lt;/p&gt;
        &lt;/article&gt;
        &lt;div class=&quot;paginator&quot;&gt;
        &lt;a id=&quot;post-prev-link&quot; class=&quot;link&quot; href=&quot;https://thinkingmachines.ai/news/announcing-tinker/&quot; title=&quot;Announcing Tinker&quot;&gt;prev&lt;/a&gt;
        &lt;a href=&quot;https://thinkingmachines.ai/news/tinker-research-and-teaching-grants/#&quot; id=&quot;back-to-top&quot; class=&quot;link&quot; title=&quot;Go to top&quot;&gt;back to top&lt;/a&gt;
        &lt;a id=&quot;post-next-link&quot; class=&quot;link&quot; href=&quot;https://thinkingmachines.ai/news/call-for-community-projects/&quot; title=&quot;Tinker: Call for Community Projects&quot;&gt;next&lt;/a&gt;
        &lt;/div&gt;</description>
      <link>https://thinkingmachines.ai/news/tinker-research-and-teaching-grants/</link>
      <guid isPermaLink="false">https://thinkingmachines.ai/news/tinker-research-and-teaching-grants/</guid>
      <pubDate>Tue, 28 Oct 2025 16:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Announcing Tinker</title>
      <description>&lt;div class=&quot;post-heading news-post&quot;&gt;
        &lt;h1 class=&quot;post-title&quot;&gt;Announcing Tinker&lt;/h1&gt;
        &lt;div class=&quot;publish-metadata date-only&quot;&gt;
        &lt;span&gt;
        Oct 1, 2025
        &lt;/span&gt;
        &lt;/div&gt;
        &lt;/div&gt;
        &lt;div class=&quot;post-cover&quot;&gt;
        &lt;img data-zoomable=&quot;&quot; src=&quot;https://thinkingmachines.ai/news/announcing-tinker/svgs/tinker-cover.svg&quot; style=&quot;display: block; width: 100%; max-width: 800px; height: auto; margin: 0 auto;&quot; loading=&quot;lazy&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;
        &lt;/div&gt;
        &lt;article class=&quot;content &quot;&gt;
        &lt;p class=&quot;image-caption&quot; style=&quot;text-align: center; margin-top: -1rem; margin-bottom: 2rem; font-size: 0.9rem; color: var(--fg-muted, #666);&quot;&gt;
        &lt;a href=&quot;https://www.computerhistory.org/collections/catalog/X39.81/&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;TinkerToy Computer&lt;/a&gt; invented by &lt;a href=&quot;https://en.wikipedia.org/wiki/Danny_Hillis&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;Daniel Hillis&lt;/a&gt; and &lt;a href=&quot;https://en.wikipedia.org/wiki/Brian_Silverman&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;Brian Silverman&lt;/a&gt;
        &lt;/p&gt;
        &lt;p&gt;Today, we are launching &lt;a href=&quot;https://thinkingmachines.ai/tinker&quot;&gt;Tinker&lt;/a&gt;, a flexible API for fine-tuning language models. It empowers researchers and hackers to experiment with models by giving them control over the algorithms and data while we handle the complexity of distributed training. Tinker advances our mission of enabling more people to do research on cutting-edge models and customize them to their needs.&lt;/p&gt;
        &lt;p&gt;Tinker lets you fine-tune a range of large and small open-weight models, including large mixture-of-experts models such as Qwen-235B-A22B. Switching from a small model to a large one is as simple as changing a single string in your Python code.&lt;/p&gt;
        &lt;p&gt;Tinker is a managed service that runs on our internal clusters and training infrastructure. We handle scheduling, resource allocation, and failure recovery. This allows you to get small or large runs started immediately, without worrying about managing infrastructure. We use LoRA so that we can share the same pool of compute between multiple training runs, lowering costs.&lt;/p&gt;
        &lt;p&gt;Tinker’s API gives you low-level primitives like &lt;code&gt;forward_backward&lt;/code&gt; and &lt;code&gt;sample&lt;/code&gt;, which can be used to express most common post-training methods. Even so, achieving good results requires getting many details right. That’s why we’re releasing an open-source library, the &lt;a href=&quot;http://github.com/thinking-machines-lab/tinker-cookbook&quot;&gt;Tinker Cookbook&lt;/a&gt;, with modern implementations of post-training methods that run on top of the Tinker API.&lt;/p&gt;
        &lt;p&gt;Groups at Princeton, Stanford, Berkeley, and Redwood Research have already been using Tinker:&lt;/p&gt;
        &lt;ul&gt;
        &lt;li&gt;The &lt;a href=&quot;https://blog.goedel-prover.com/&quot;&gt;Princeton Goedel Team&lt;/a&gt; trained mathematical theorem provers&lt;/li&gt;
        &lt;li&gt;The &lt;a href=&quot;https://statmech.stanford.edu/&quot;&gt;Rotskoff Chemistry group&lt;/a&gt; at Stanford fine-tuned a model to complete chemistry reasoning tasks&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://sky.cs.berkeley.edu/project/skyrl/&quot;&gt;Berkeley’s SkyRL group&lt;/a&gt; ran experiments on a custom async off-policy RL training loop with multi-agents and multi-turn tool-use.&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://www.redwoodresearch.org/&quot;&gt;Redwood Re

github-actions · 2026-04-03T01:28:05Z

Successfully generated as following:

http://localhost:1200/thinkingmachines/news - Success ✔️

<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:atom="http://www.w3.org/2005/Atom" version="2.0">
  <channel>
    <title>Thinking Machines Lab - News</title>
    <link>https://thinkingmachines.ai/news/</link>
    <atom:link href="http://localhost:1200/thinkingmachines/news" rel="self" type="application/rss+xml"></atom:link>
    <description>Thinking Machines Lab - News - Powered by RSSHub</description>
    <generator>RSSHub</generator>
    <webMaster>contact@rsshub.app (RSSHub)</webMaster>
    <language>en</language>
    <lastBuildDate>Fri, 03 Apr 2026 01:28:04 GMT</lastBuildDate>
    <ttl>5</ttl>
    <item>
      <title>Training LLMs to Predict World Events (Guest Post with Mantic)</title>
      <description>&lt;div class=&quot;post-heading news-post&quot;&gt;
        &lt;h1 class=&quot;post-title&quot;&gt;Training LLMs to Predict World Events (Guest Post with Mantic)&lt;/h1&gt;
        &lt;div class=&quot;publish-metadata&quot;&gt;
        &lt;span class=&quot;author&quot;&gt;
        &lt;a href=&quot;https://enjeeneer.io/&quot; rel=&quot;me noopener&quot; target=&quot;_blank&quot;&gt;Scott Jeen&lt;/a&gt; and &lt;a href=&quot;https://www.linkedin.com/in/matthew-aitchison-16aa8799/&quot; rel=&quot;me noopener&quot; target=&quot;_blank&quot;&gt;Matthew Aitchison&lt;/a&gt; in collaboration with others at &lt;a href=&quot;https://www.mantic.com/&quot; rel=&quot;me noopener&quot; target=&quot;_blank&quot;&gt;Mantic&lt;/a&gt;
        &lt;/span&gt;
        &lt;span&gt;
        Mar 19, 2026
        &lt;/span&gt;
        &lt;/div&gt;
        &lt;/div&gt;
        &lt;div class=&quot;post-cover&quot;&gt;
        &lt;img data-zoomable=&quot;&quot; src=&quot;https://thinkingmachines.ai/news/training-llms-to-predict-world-events/svgs/cover.svg&quot; style=&quot;display: block; width: 100%; max-width: 800px; height: auto; margin: 0 auto;&quot; loading=&quot;lazy&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;
        &lt;/div&gt;
        &lt;article class=&quot;content &quot;&gt;
        &lt;p&gt;&lt;em&gt;&lt;a href=&quot;https://www.mantic.com/&quot;&gt;Mantic&lt;/a&gt; have been using Tinker since it launched. This guest post is a technical deep dive on what they have built so far.&lt;/em&gt;&lt;/p&gt;
        &lt;p&gt;The top AI forecasting systems are approaching superforecaster-level accuracy on geopolitics and current affairs.&lt;label for=&quot;sn1&quot; class=&quot;sidenote-number&quot;&gt;&lt;/label&gt;&lt;input type=&quot;checkbox&quot; id=&quot;sn1&quot; class=&quot;margin-toggle&quot;&gt;&lt;span class=&quot;sidenote&quot;&gt;&lt;a href=&quot;https://news.polymarket.com/p/its-over&quot;&gt;It’s over&lt;/a&gt; (Polymarket, 2026)&lt;/span&gt; This is exciting because scalable, automated forecasting could significantly improve the quality of decision-making across the economy and in government.&lt;/p&gt;
        &lt;p&gt;To date, the most successful recipe in forecasting tournaments has been to combine an off-the-shelf LLM (like Gemini 3 or GPT-5) with forecasting-specific context-gathering. These models, to our knowledge, have not been explicitly trained for forecasting. Can we improve the recipe by replacing them with models fine-tuned specifically for forecasting?&lt;/p&gt;
        &lt;p&gt;We target “judgmental forecasting”: prediction problems that require human-like research and reasoning. Judgmental forecasting is needed for domains like geopolitics, politics, technology, business, and economic policy, where there often isn’t enough data for a standard statistical approach like time-series extrapolation. It was popularized by the book &lt;a href=&quot;https://books.google.com/books?hl=en&amp;amp;lr=&amp;amp;id=_lMPDAAAQBAJ&amp;amp;oi=fnd&amp;amp;pg=PA1&amp;amp;ots=5avwaBT55R&amp;amp;sig=B2i7FYispHz5PtB6KVu2Ndt58cE&quot;&gt;Superforecasting&lt;/a&gt;, and now prediction markets like Polymarket and Kalshi.&lt;/p&gt;
        &lt;p&gt;In this post, we show it’s possible to significantly improve the forecasting performance of gpt-oss-120b using reinforcement learning. With Tinker, we fine-tune a model on around 10,000 binary questions of the form &lt;em&gt;“Will [event] occur before [date]?”&lt;/em&gt;. We reward the model for putting greater probability on the correct real-world outcome.&lt;/p&gt;
        &lt;p&gt;In a head-to-head contest, the fine-tuned model achieves marginally superior performance to the frontier LLMs (see Figure 1), despite much lower initial performance. We find that providing forecast-specific context increases the gains from fine-tuning.&lt;/p&gt;
        &lt;p&gt;In the optimal &lt;em&gt;ensemble&lt;/em&gt; of different models (which outperforms any single model), Grok 4 and our fine-tuned model are the most important contributors. The fine-tuned model learns a forecasting policy that is as accurate as the frontier LLMs, yet decorrelated from them.&lt;/p&gt;
        &lt;p&gt;Together, the results demonstrate that on-task training can extend the state-of-the-art in AI forecasting.&lt;/p&gt;
        &lt;figure style=&quot;margin-top: 2rem; margin-bottom: 2.3rem;&quot;&gt;
        &lt;img data-zoomable=&quot;&quot; src=&quot;https://thinkingmachines.ai/news/training-llms-to-predict-world-events/svgs/figure_1.svg&quot; style=&quot;display: block; width: 100%; max-width: 800px; height: auto; margin: 0 auto;&quot; loading=&quot;lazy&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;
        &lt;figcaption&gt;RL fine-tuning makes gpt-oss-120b competitive with the frontier LLMs on questions from the Metaculus AI Benchmark Q2 2025. Naively predicting 50% on every question would get a score of 0, and perfect foresight would get a score of 100, per the construction of the Metaculus “baseline score”. Naively predicting 18.8% on every question (the rate at which the equivalent questions resolved “yes” in the previous tournament, Q1 2025) yields a score of 22.3, which we use to truncate the Y-axis. Fine-tuning improves gpt-oss-120b‘s score from 38.6 to 45.8, on par with the best general models.&lt;/figcaption&gt;
        &lt;/figure&gt;
        &lt;h2 id=&quot;the-best-existing-recipe-uses-off-the-shelf-llms&quot;&gt;The best existing recipe uses off-the-shelf LLMs&lt;/h2&gt;
        &lt;p&gt;The past two years have seen considerable progress in AI judgmental forecasting capabilities. In the Metaculus Cup, a major tournament for amateur and professional forecasters, the best AI systems now rival the top humans (Figure 2).&lt;/p&gt;
        &lt;figure style=&quot;margin-top: 2rem; margin-bottom: 2.3rem;&quot;&gt;
        &lt;img data-zoomable=&quot;&quot; src=&quot;https://thinkingmachines.ai/news/training-llms-to-predict-world-events/svgs/figure_2.svg&quot; style=&quot;display: block; width: 100%; max-width: 800px; height: auto; margin: 0 auto;&quot; loading=&quot;lazy&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;
        &lt;figcaption&gt;Human and AI scores in the Metaculus Cup, a premier forecasting competition. Scores from the top 5 AI forecasters have been steadily improving since they first entered in the Summer of 2024. Mantic first entered in the Summer of 2025 and then in Fall 2025 beat the community prediction and the majority of professional forecasters. These results were without fine-tuning.&lt;/figcaption&gt;
        &lt;/figure&gt;
        &lt;p&gt;The trend has been driven by more capable off-the-shelf LLMs, and accelerated by more sophisticated forecasting architectures. Our architecture – which has performed well in recent Metaculus tournaments – consists of two standard phases: (1) a research phase, and (2) a prediction phase (Figure 3).&lt;label for=&quot;sn3&quot; class=&quot;sidenote-number&quot;&gt;&lt;/label&gt;&lt;input type=&quot;checkbox&quot; id=&quot;sn3&quot; class=&quot;margin-toggle&quot;&gt;&lt;span class=&quot;sidenote&quot;&gt;This two-phase process appears in early work on AI forecasting. See: &lt;a href=&quot;https://proceedings.neurips.cc/paper_files/paper/2024/hash/5a5acfd0876c940d81619c1dc60e7748-Abstract-Conference.html&quot;&gt;Approaching human-level forecasting with language models&lt;/a&gt; (Halawi et al., 2024); &lt;a href=&quot;https://arxiv.org/abs/2206.15474&quot;&gt;Forecasting Future World Events with Neural Networks&lt;/a&gt; (Zou et al, 2022).&lt;/span&gt;&lt;/p&gt;
        &lt;figure style=&quot;margin-top: 2rem; margin-bottom: 2.3rem;&quot;&gt;
        &lt;img data-zoomable=&quot;&quot; src=&quot;https://thinkingmachines.ai/news/training-llms-to-predict-world-events/svgs/figure_3.png&quot; style=&quot;display: block; width: 100%; max-width: 800px; height: auto; margin: 0 auto;&quot; loading=&quot;lazy&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;
        &lt;figcaption&gt;Mantic’s architecture. The research phase takes the forecasting question as input and performs deep research to collect information relevant to the question which goes into the prompt for the prediction LLM. The prediction LLM outputs chain-of-thought reasoning and specifies a probability distribution using specialized tools.&lt;/figcaption&gt;
        &lt;/figure&gt;
        &lt;p&gt;The research phase is conducted by deep research agents that collect the context needed to make a good prediction. For example, for the question “Will the United States attack Venezuela before 2026?”, search agents will find information about military buildup in the Caribbean, statements from President Trump, the health of the Venezuelan economy, and so on. The collected research is summarized into a prompt for the prediction phase.&lt;/p&gt;
        &lt;p&gt;The model’s task at the prediction phase is to use our specialized tools to output a probability distribution. In this post, we consider a canonical type of forecasting question: “Will [event] occur before [date]?”. We instruct the LLM to parameterize a mixture model for when the event will next occur – illustrated in Figure 4. The mixture model defines a cumulative distribution function, and from that we can read off the probability of the event occurring before the date specified in the original question.&lt;/p&gt;
        &lt;figure style=&quot;margin-top: 2rem; margin-bottom: 2.3rem;&quot;&gt;
        &lt;img data-zoomable=&quot;&quot; src=&quot;https://thinkingmachines.ai/news/training-llms-to-predict-world-events/svgs/figure_4.svg&quot; style=&quot;display: block; width: 100%; max-width: 800px; height: auto; margin: 0 auto;&quot; loading=&quot;lazy&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;
        &lt;figcaption&gt;Illustrative mixture model. The LLM selects: the number of components in the mixture, their parameters, and their respective weights. The LLM is prompted to select components capturing different scenarios that could lead to the event occurring. The final prediction is a weighted combination of the components.&lt;/figcaption&gt;
        &lt;/figure&gt;
        &lt;p&gt;In past tournaments, we’ve used off-the-shelf models as the prediction LLM. Existing literature has shown promising results from RL fine-tuning small models using a simple architecture.&lt;label for=&quot;sn4&quot; class=&quot;sidenote-number&quot;&gt;&lt;/label&gt;&lt;input type=&quot;checkbox&quot; id=&quot;sn4&quot; class=&quot;margin-toggle&quot;&gt;&lt;span class=&quot;sidenote&quot;&gt;&lt;a href=&quot;https://scholar.google.com/citations?view_op=view_citation&amp;amp;hl=en&amp;amp;user=oQuvQ0sAAAAJ&amp;amp;citation_for_view=oQuvQ0sAAAAJ:u5HHmVD_uO8C&quot;&gt;Outcome-based Reinforcement Learning to Predict the Future&lt;/a&gt; (Turtel et al, 2025)&lt;/span&gt; Can we improve frontier AI forecasting systems through on-task fine-tuning?&lt;/p&gt;
        &lt;h2 id=&quot;training-details&quot;&gt;Training details&lt;/h2&gt;
        &lt;h3 id=&quot;datasets&quot;&gt;Datasets&lt;/h3&gt;
        &lt;p&gt;We train the prediction LLM on ~10k questions about whether an event will happen by a given date. The questions are from August 2024 to December 2025 and the model’s knowledge cutoff is prior to that, so the resolution is known to us but not to the model. We generated the training set using an LLM pipeline similar to existing work.&lt;label for=&quot;sn5&quot; class=&quot;sidenote-number&quot;&gt;&lt;/label&gt;&lt;input type=&quot;checkbox&quot; id=&quot;sn5&quot; class=&quot;margin-toggle&quot;&gt;&lt;span class=&quot;sidenote&quot;&gt;&lt;a href=&quot;https://arxiv.org/abs/2601.22444&quot;&gt;Automating Forecasting Question Generation and Resolution for AI Evaluation&lt;/a&gt; (Bosse et al, 2026); &lt;a href=&quot;https://arxiv.org/abs/2601.06336&quot;&gt;Future-as-Label: Scalable Supervision from Real-World Outcomes&lt;/a&gt; (Turtel et al, 2026)&lt;/span&gt; Before training, we run the research phase for each question and store static prompts for the prediction LLM.&lt;/p&gt;
        &lt;p&gt;We test on unseen questions from the Q2 2025 Metaculus AI Benchmark Tournament.&lt;label for=&quot;sn6&quot; class=&quot;sidenote-number&quot;&gt;&lt;/label&gt;&lt;input type=&quot;checkbox&quot; id=&quot;sn6&quot; class=&quot;margin-toggle&quot;&gt;&lt;span class=&quot;sidenote&quot;&gt; The Fall 2025 iteration would have been a more obvious choice (for being more recent) but contains lower quality questions, indicated by less performance differentiated between strong and weak forecasters.&lt;/span&gt; We compare models whose knowledge cutoff is before this tournament’s start date. The full list of questions can be accessed on &lt;a href=&quot;https://gist.github.com/enjeeneer/86e24a52e6041a3d78e333bcab16984d&quot;&gt;Github&lt;/a&gt;.&lt;/p&gt;
        &lt;figure style=&quot;margin-top: 2rem; margin-bottom: 2.3rem;&quot;&gt;
        &lt;img data-zoomable=&quot;&quot; src=&quot;https://thinkingmachines.ai/news/training-llms-to-predict-world-events/svgs/figure_5.png&quot; style=&quot;display: block; width: 100%; max-width: 800px; height: auto; margin: 0 auto;&quot; loading=&quot;lazy&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;
        &lt;figcaption&gt;Three example binary event questions from the Metaculus Q2 2025 AI Benchmark.&lt;/figcaption&gt;
        &lt;/figure&gt;
        &lt;p&gt;We evaluate using the baseline score, following the Metaculus platform. This is log scoring, i.e. ln(probability assigned to true outcome), rescaled such that 100 is the maximum possible score and 0 is the score for a uniform prediction (in our setting, 50%).&lt;/p&gt;
        &lt;h3 id=&quot;implementation&quot;&gt;Implementation&lt;/h3&gt;
        &lt;p&gt;We run the experiments on &lt;a href=&quot;https://thinkingmachines.ai/tinker/&quot;&gt;Tinker&lt;/a&gt;. Of the models available through the API, we choose to train &lt;a href=&quot;https://huggingface.co/openai/gpt-oss-120b&quot;&gt;gpt-oss-120b&lt;/a&gt; because of its strong initial performance — second only to &lt;a href=&quot;https://huggingface.co/moonshotai/Kimi-K2.5&quot;&gt;Kimi K2.5&lt;/a&gt; — while being cheaper and faster.&lt;/p&gt;
        &lt;p&gt;We use a standard policy gradient algorithm with &lt;a href=&quot;https://arxiv.org/abs/2402.03300&quot;&gt;GRPO-style advantage normalisation&lt;/a&gt; and &lt;a href=&quot;https://arxiv.org/abs/2503.20783&quot;&gt;no division by the standard deviation&lt;/a&gt;. For rewards we use the Brier score which is &lt;a href=&quot;https://sites.stat.washington.edu/raftery/Research/PDF/Gneiting2007jasa.pdf&quot;&gt;strictly proper&lt;/a&gt;. We found that the Brier score leads to more stable training than the log score, even though the log score is also strictly proper. This could be because the Brier score is bounded in [0, 1] and so produces lower-variance policy gradient estimates.&lt;/p&gt;
        &lt;p&gt;Open-source RL packages often use &lt;a href=&quot;https://arxiv.org/abs/2309.06180&quot;&gt;vLLM&lt;/a&gt; for the sampling backend and &lt;a href=&quot;https://arxiv.org/abs/2304.11277&quot;&gt;FSDP&lt;/a&gt; for the training backend. These can disagree on token probabilities produced by identical policies, which biases policy gradient estimates and destabilises training. We found these discrepancies to be lower on Tinker’s integrated infrastructure, but further mitigate them with &lt;a href=&quot;https://fengyao.notion.site/off-policy-rl&quot;&gt;an importance sampling correction on the advantages&lt;/a&gt;.&lt;/p&gt;
        &lt;p&gt;The Brier score reward function is strictly monotonic in the predicted probability for a fixed outcome, so different rollouts almost always produce different rewards. This makes within-group reward ties extremely unlikely and lets us train with a relatively small group size (8) without needing to break ties or induce variance. We use a batch size of 64, as we find that &lt;a href=&quot;https://thinkingmachines.ai/blog/lora/&quot;&gt;larger batch sizes tend to destabilise training&lt;/a&gt;.&lt;/p&gt;
        &lt;h2 id=&quot;results&quot;&gt;Results&lt;/h2&gt;
        &lt;h3 id=&quot;fine-tuning-elevates-gpt-oss-120b-to-frontier-llm-performance&quot;&gt;Fine-tuning elevates gpt-oss-120b to frontier LLM performance&lt;/h3&gt;
        &lt;p&gt;The model’s test set score improves through training (Figure 6), moving from an initial score of 38.6 mean baseline points per question (below all frontier models) to a final score of 45.8 mean baseline points per question (marginally above). This demonstrates that on-task forecasting training can provide a large performance uplift.&lt;/p&gt;
        &lt;figure style=&quot;margin-top: 2rem; margin-bottom: 2.3rem;&quot;&gt;
        &lt;img data-zoomable=&quot;&quot; src=&quot;https://thinkingmachines.ai/news/training-llms-to-predict-world-events/svgs/figure_6.svg&quot; style=&quot;display: block; width: 100%; max-width: 800px; height: auto; margin: 0 auto;&quot; loading=&quot;lazy&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;
        &lt;figcaption&gt;Test set baseline score of gpt-oss-120b with and without Mantic research and tools. In the model-only setup, test set performance improves, but never reaches the initial score of gpt-oss-120b with Mantic research and tools. With Mantic research and tools, gpt-oss-120b climbs 7 points through training and marginally exceeds the performance of Gemini 3 Pro. Training continued for further steps but performance no longer improved.&lt;/figcaption&gt;
        &lt;/figure&gt;
        &lt;p&gt;The LLM fine-tuned without the benefit of our pre-generated research phase, and without the tools to construct mixture models, gains only 3 points from training instead of 7. This suggests these prerequisites positively influence the optimization dynamics, in addition to improving initialization.&lt;/p&gt;
        &lt;h3 id=&quot;the-fine-tuned-model-is-an-important-member-of-the-optimal-ensemble&quot;&gt;The fine-tuned model is an important member of the optimal ensemble&lt;/h3&gt;
        &lt;p&gt;In human forecasting, there is a well-known “wisdom of the crowd” effect: aggregate forecasts from multiple people often outperforms any one individual. This effect, in part, explains the &lt;a href=&quot;https://polymarket.com/accuracy&quot;&gt;impressive accuracy of prediction markets&lt;/a&gt;. Can we get the same benefit from ensembling the predictions of different LLMs?&lt;label for=&quot;sn7&quot; class=&quot;sidenote-number&quot;&gt;&lt;/label&gt;&lt;input type=&quot;checkbox&quot; id=&quot;sn1&quot; class=&quot;margin-toggle&quot;&gt;&lt;span class=&quot;sidenote&quot;&gt;See also &lt;a href=&quot;https://arxiv.org/abs/2402.19379&quot;&gt;Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Rival Human Crowd Accuracy&lt;/a&gt; (Schoenegger et al, 2024)&lt;/span&gt;&lt;/p&gt;
        &lt;p&gt;To contribute to an ensemble, an LLM must be sufficiently accurate on its own and also decorrelated from other LLMs in the ensemble. Predictions from most frontier LLMs, while accurate, contribute little diversity to the top-performing model (in our case, Gemini 3 Pro) – Figure 7. Among the frontier LLMs, Grok 4 is the exception: its predictions score well whilst correlating less with other frontier LLMs.&lt;/p&gt;
        &lt;figure style=&quot;margin-top: 2rem; margin-bottom: 2.3rem;&quot;&gt;
        &lt;img data-zoomable=&quot;&quot; src=&quot;https://thinkingmachines.ai/news/training-llms-to-predict-world-events/svgs/figure_7.svg&quot; style=&quot;display: block; width: 100%; max-width: 800px; height: auto; margin: 0 auto;&quot; loading=&quot;lazy&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;
        &lt;figcaption&gt;Mean baseline score on binary event questions from the Metaculus Q2 AIB Benchmark with respect to their Jensson Shannon divergence from Gemini 3 Pro for a suite of closed source and open source LLMs. Marker colour indicates each model’s weight in the optimal 5-sample ensemble. The optimal ensemble consists of fine-tuned gpt-oss-120b (40%), gemini 3 pro (20%), gpt-5 (20%) and grok 4 (20%).&lt;/figcaption&gt;
        &lt;/figure&gt;
        &lt;p&gt;The optimal ensemble, on a budget of 5 samples, is 2 samples from our fine-tuned gpt-oss-120b plus 1 sample from each of Gemini 3 Pro, Grok 4, and GPT-5. We can test each model’s contribution by removing it, recomputing the optimal ensemble, and seeing how much the score degrades. We find that Grok 4 is the least replaceable, with fine-tuned gpt-oss-120b in second place. Other models can be replaced with little to no performance degradation (Figure 8).&lt;/p&gt;
        &lt;figure style=&quot;margin-top: 2rem; margin-bottom: 2.3rem;&quot;&gt;
        &lt;img data-zoomable=&quot;&quot; src=&quot;https://thinkingmachines.ai/news/training-llms-to-predict-world-events/svgs/figure_8.svg&quot; style=&quot;display: block; width: 100%; max-width: 800px; height: auto; margin: 0 auto;&quot; loading=&quot;lazy&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;
        &lt;figcaption&gt;When selecting an ensemble of frontier and open-source models, Grok 4 and fine-tuned gpt-oss-120b are the least replaceable. Model replaceability is defined as the reduction in score incurred when removing a model from the optimal ensemble. By definition, if a model is not included in the optimal ensemble,there is no cost to removing it.&lt;/figcaption&gt;
        &lt;/figure&gt;
        &lt;p&gt;GPT-5 and Gemini 3 Pro make similar predictions, and thus don’t benefit much from ensembling with each other. Both models improve from mixing their predictions with either Grok 4 or the fine-tuned gpt-oss-120b, and Grok 4 also benefits most from mixing with the fine-tuned model. In either optimal 3-way ensemble from Figure 9, the fine-tuned model gets about half of the total weight.&lt;/p&gt;
        &lt;figure style=&quot;margin-top: 2rem; margin-bottom: 2.3rem;&quot;&gt;
        &lt;img data-zoomable=&quot;&quot; src=&quot;https://thinkingmachines.ai/news/training-llms-to-predict-world-events/svgs/figure_9.png&quot; style=&quot;display: block; width: 100%; max-width: 800px; height: auto; margin: 0 auto;&quot; loading=&quot;lazy&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;
        &lt;figcaption&gt;Three-way ensembles. Left: The optimal three-way ensemble between the fine-tuned gpt-oss-120b, Gemini 3 Pro and GPT-5 is weighted 56%, 26% and 18% respectively, depicted by the black star. Right: The optimal three-way ensemble between the fine-tuned gpt-oss-120b, Gemini 3 Pro and Grok 4 is weighted 48, 26% and 26% respectively, depicted by the white star.&lt;/figcaption&gt;
        &lt;/figure&gt;
        &lt;h2 id=&quot;conclusions-and-next-steps&quot;&gt;Conclusions and next steps&lt;/h2&gt;
        &lt;p&gt;We have shown that we can elevate the forecasting performance of gpt-oss-120b to match frontier LLMs with RL fine-tuning. This work can be extended in many ways, some of which we have already begun exploring:&lt;/p&gt;
        &lt;ol&gt;
        &lt;li&gt;&lt;strong&gt;Training larger models.&lt;/strong&gt; Tinker enables training larger models with higher initial performance than gpt-oss, specifically Kimi K2.5.&lt;/li&gt;
        &lt;li&gt;&lt;strong&gt;Training on all question formats.&lt;/strong&gt; We are already training models on numerical questions such as economic indicators and multiple choice questions such as election results.&lt;/li&gt;
        &lt;li&gt;&lt;strong&gt;Improved question sets.&lt;/strong&gt; As the models become stronger forecasters, we need more challenging forecasting questions.&lt;/li&gt;
        &lt;li&gt;&lt;strong&gt;Information retrieval inside the loop.&lt;/strong&gt; We could give the prediction LLM tools for information retrieval and include this in the training loop.&lt;/li&gt;
        &lt;/ol&gt;
        &lt;h2 id=&quot;citation&quot;&gt;Citation&lt;/h2&gt;
        &lt;p&gt;Please cite this work as:&lt;/p&gt;
        &lt;pre tabindex=&quot;0&quot;&gt;&lt;code&gt;Jeen, Scott; Aitchison, Matthew; and Mantic, &quot;Training LLMs to Predict World Events&quot;,
        Thinking Machines Lab: News, Mar 2026.
        &lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Or use the BibTeX citation:&lt;/p&gt;
        &lt;pre tabindex=&quot;0&quot;&gt;&lt;code&gt;@article{scott2026forecasting,
        author = {Scott Jeen, Matthew Aitchison, and Mantic},
        title = {Training LLMs to Predict World Events},
        journal = {Thinking Machines Lab: News},
        year = {2026},
        note = {https://thinkingmachines.ai/news/training-llms-to-predict-world-events/}
        }
        &lt;/code&gt;&lt;/pre&gt;
        &lt;/article&gt;
        &lt;div class=&quot;paginator&quot;&gt;
        &lt;a id=&quot;post-prev-link&quot; class=&quot;link&quot; href=&quot;https://thinkingmachines.ai/news/nvidia-partnership/&quot; title=&quot;Thinking Machines Lab and NVIDIA Announce Long-Term Gigawatt-Scale Strategic Partnership&quot;&gt;prev&lt;/a&gt;
        &lt;a href=&quot;https://thinkingmachines.ai/news/training-llms-to-predict-world-events/#&quot; id=&quot;back-to-top&quot; class=&quot;link&quot; title=&quot;Go to top&quot;&gt;back to top&lt;/a&gt;
        &lt;a id=&quot;post-next-link&quot;&gt;&lt;/a&gt;
        &lt;/div&gt;</description>
      <link>https://thinkingmachines.ai/news/training-llms-to-predict-world-events/</link>
      <guid isPermaLink="false">https://thinkingmachines.ai/news/training-llms-to-predict-world-events/</guid>
      <pubDate>Wed, 18 Mar 2026 16:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Thinking Machines Lab and NVIDIA Announce Long-Term Gigawatt-Scale Strategic Partnership</title>
      <description>&lt;div class=&quot;post-heading news-post&quot;&gt;
        &lt;h1 class=&quot;post-title&quot;&gt;Thinking Machines Lab and NVIDIA Announce Long-Term Gigawatt-Scale Strategic Partnership&lt;/h1&gt;
        &lt;div class=&quot;publish-metadata date-only&quot;&gt;
        &lt;span&gt;
        Mar 10, 2026
        &lt;/span&gt;
        &lt;/div&gt;
        &lt;/div&gt;
        &lt;div class=&quot;post-cover compact&quot;&gt;
        &lt;img data-zoomable=&quot;&quot; src=&quot;https://thinkingmachines.ai/news/nvidia-partnership/images/jensen-huang-mira-murati-nvidia-debs-3127-2.jpg&quot; alt=&quot;Cover image for Thinking Machines Lab and NVIDIA Announce Long-Term Gigawatt-Scale Strategic Partnership&quot; class=&quot;cover-image&quot; fetchpriority=&quot;high&quot; decoding=&quot;async&quot; loading=&quot;eager&quot; referrerpolicy=&quot;no-referrer&quot;&gt;
        &lt;p class=&quot;post-cover-caption&quot;&gt;Jensen Huang (NVIDIA) and Mira Murati (Thinking Machines)&lt;/p&gt;
        &lt;/div&gt;
        &lt;article class=&quot;content &quot;&gt;
        &lt;p&gt;Thinking Machines Lab and NVIDIA announced today a multi-year strategic partnership to deploy at least one gigawatt of next-generation NVIDIA Vera Rubin systems to support Thinking Machines’ frontier model training and platforms delivering customizable AI at scale. Deployment on NVIDIA’s Vera Rubin platform is targeted for early next year.&lt;/p&gt;
        &lt;p&gt;The partnership also includes an effort to design training and serving systems for NVIDIA architectures and broaden access to frontier AI and open models for enterprises, research institutions, and the scientific community.&lt;/p&gt;
        &lt;p&gt;NVIDIA has also made a significant investment in Thinking Machines Lab to support the company’s long-term growth.&lt;/p&gt;
        &lt;p&gt;“AI is the most powerful knowledge discovery instrument in human history,” said Jensen Huang, founder and CEO of NVIDIA. “Thinking Machines has brought together a world-class team to advance the frontier of AI. We are thrilled to partner with Thinking Machines to realize their exciting vision for the future of AI.”&lt;/p&gt;
        &lt;p&gt;“NVIDIA’s technology is the foundation on which the entire field is built,” said Mira Murati, cofounder and CEO of Thinking Machines. “This partnership accelerates our capacity to build AI that people can shape and make their own, as it shapes human potential in turn.”&lt;/p&gt;
        &lt;p&gt;Building powerful AI systems that are understandable, customizable, and collaborative demands advances in research, design, and infrastructure at scale. This partnership provides that foundation, with the shared aim of ensuring that the most transformative technology of our time expands human capability.&lt;/p&gt;
        &lt;img data-zoomable=&quot;&quot; src=&quot;https://thinkingmachines.ai/news/nvidia-partnership/cover.svg&quot; style=&quot;display: block; width: 100%; max-width: 800px; height: auto; margin: 0 auto;&quot; loading=&quot;lazy&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;
        &lt;/article&gt;
        &lt;div class=&quot;paginator&quot;&gt;
        &lt;a id=&quot;post-prev-link&quot; class=&quot;link&quot; href=&quot;https://thinkingmachines.ai/news/tinker-general-availability/&quot; title=&quot;Tinker: General Availability and Vision Input&quot;&gt;prev&lt;/a&gt;
        &lt;a href=&quot;https://thinkingmachines.ai/news/nvidia-partnership/#&quot; id=&quot;back-to-top&quot; class=&quot;link&quot; title=&quot;Go to top&quot;&gt;back to top&lt;/a&gt;
        &lt;a id=&quot;post-next-link&quot; class=&quot;link&quot; href=&quot;https://thinkingmachines.ai/news/training-llms-to-predict-world-events/&quot; title=&quot;Training LLMs to Predict World Events (Guest Post with Mantic)&quot;&gt;next&lt;/a&gt;
        &lt;/div&gt;</description>
      <link>https://thinkingmachines.ai/news/nvidia-partnership/</link>
      <guid isPermaLink="false">https://thinkingmachines.ai/news/nvidia-partnership/</guid>
      <pubDate>Mon, 09 Mar 2026 16:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Tinker: General Availability and Vision Input</title>
      <description>&lt;div class=&quot;post-heading news-post&quot;&gt;
        &lt;h1 class=&quot;post-title&quot;&gt;Tinker: General Availability and Vision Input&lt;/h1&gt;
        &lt;div class=&quot;publish-metadata date-only&quot;&gt;
        &lt;span&gt;
        Dec 12, 2025
        &lt;/span&gt;
        &lt;/div&gt;
        &lt;/div&gt;
        &lt;article class=&quot;content &quot;&gt;
        &lt;p&gt;Today we are announcing four updates to Tinker:&lt;/p&gt;
        &lt;ul&gt;
        &lt;li&gt;No more waitlist&lt;/li&gt;
        &lt;li&gt;New reasoning model: Kimi K2 Thinking&lt;/li&gt;
        &lt;li&gt;New inference interface that is compatible with the OpenAI API&lt;/li&gt;
        &lt;li&gt;Vision input support with Qwen3-VL&lt;/li&gt;
        &lt;/ul&gt;
        &lt;h2 id=&quot;general-availability&quot;&gt;General availability&lt;/h2&gt;
        &lt;p&gt;The waitlist is over! Everybody can use Tinker now; &lt;a href=&quot;https://auth.thinkingmachines.ai/sign-up&quot;&gt;sign up here&lt;/a&gt; to get started. See the &lt;a href=&quot;https://thinkingmachines.ai/tinker/&quot;&gt;Tinker homepage&lt;/a&gt; for available models and pricing, and check out the &lt;a href=&quot;https://github.com/thinking-machines-lab/tinker-cookbook&quot;&gt;Tinker cookbook&lt;/a&gt; for code examples.&lt;/p&gt;
        &lt;h2 id=&quot;more-reasoning-with-kimi-k2-thinking&quot;&gt;More reasoning with Kimi K2 Thinking&lt;/h2&gt;
        &lt;p&gt;Users can now fine-tune Kimi K2 Thinking on Tinker. With a trillion parameters, Kimi K2 is the largest model in our lineup so far. It is built for long chains of reasoning and tool use.&lt;/p&gt;
        &lt;h2 id=&quot;openai-api-compatible-sampling&quot;&gt;OpenAI API-compatible sampling&lt;/h2&gt;
        &lt;p&gt;Tinker has a standard function for inference:&lt;/p&gt;
        &lt;div class=&quot;highlight&quot;&gt;&lt;pre tabindex=&quot;0&quot; class=&quot;chroma&quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;line&quot;&gt;&lt;span class=&quot;cl&quot;&gt;&lt;span class=&quot;n&quot;&gt;prompt&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;types&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ModelInput&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;from_ints&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tokenizer&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;encode&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;The capital of France is&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,))&lt;/span&gt;
        &lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;line&quot;&gt;&lt;span class=&quot;cl&quot;&gt;&lt;span class=&quot;n&quot;&gt;params&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;types&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SamplingParams&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;max_tokens&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;20&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;temperature&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;0.0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;stop&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;
        &lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;line&quot;&gt;&lt;span class=&quot;cl&quot;&gt;&lt;span class=&quot;n&quot;&gt;future&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sampling_client&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sample&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;prompt&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;prompt&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sampling_params&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;params&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;With this release, we have added OpenAI API-compatible scaffolding for quickly sampling from a model by specifying a path, even while it’s still training. This also means Tinker can now plug-and-play with any OpenAI API-compatible platform. See more information in our &lt;a href=&quot;https://tinker-docs.thinkingmachines.ai/compatible-apis/openai&quot;&gt;Tinker documentation&lt;/a&gt;.&lt;/p&gt;
        &lt;div class=&quot;highlight&quot;&gt;&lt;pre tabindex=&quot;0&quot; class=&quot;chroma&quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;line&quot;&gt;&lt;span class=&quot;cl&quot;&gt;&lt;span class=&quot;n&quot;&gt;response&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;openai_client&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;completions&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;create&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
        &lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;line&quot;&gt;&lt;span class=&quot;cl&quot;&gt; &lt;span class=&quot;n&quot;&gt;model&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;tinker://0034d8c9-0a88-52a9-b2b7-bce7cb1e6fef:train:0/sampler_weights/000080&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;line&quot;&gt;&lt;span class=&quot;cl&quot;&gt; &lt;span class=&quot;n&quot;&gt;prompt&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;The capital of France is&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;line&quot;&gt;&lt;span class=&quot;cl&quot;&gt; &lt;span class=&quot;n&quot;&gt;max_tokens&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;20&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;line&quot;&gt;&lt;span class=&quot;cl&quot;&gt; &lt;span class=&quot;n&quot;&gt;temperature&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;0.0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;line&quot;&gt;&lt;span class=&quot;cl&quot;&gt; &lt;span class=&quot;n&quot;&gt;stop&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;
        &lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;line&quot;&gt;&lt;span class=&quot;cl&quot;&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id=&quot;vision-input-with-qwen3-vl&quot;&gt;Vision input with Qwen3-VL&lt;/h2&gt;
        &lt;p&gt;We’ve added two vision models to Tinker: Qwen3-VL-30B-A3B-Instruct and Qwen3-VL-235B-A22B-Instruct. With these, users can process pictures, screenshots, and diagrams for a variety of applications.&lt;/p&gt;
        &lt;p&gt;To input images, just interleave together an &lt;a href=&quot;https://tinker-docs.thinkingmachines.ai/api-reference/types#imagechunk-objects&quot;&gt;ImageChunk&lt;/a&gt; – consisting of your image, saved as bytes – with text chunks. For example:&lt;/p&gt;
        &lt;div class=&quot;highlight&quot;&gt;&lt;pre tabindex=&quot;0&quot; class=&quot;chroma&quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;line&quot;&gt;&lt;span class=&quot;cl&quot;&gt;&lt;span class=&quot;n&quot;&gt;model_input&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tinker&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ModelInput&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;chunks&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;
        &lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;line&quot;&gt;&lt;span class=&quot;cl&quot;&gt; &lt;span class=&quot;n&quot;&gt;tinker&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;types&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ImageChunk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;image_data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;format&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;png&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
        &lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;line&quot;&gt;&lt;span class=&quot;cl&quot;&gt; &lt;span class=&quot;n&quot;&gt;tinker&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;types&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;EncodedTextChunk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tokens&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tokenizer&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;encode&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;What is this?&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)),&lt;/span&gt;
        &lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;line&quot;&gt;&lt;span class=&quot;cl&quot;&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;
        &lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;These vision inputs can be used in a variety of applications out-of-the-box, including SFT and RL finetuning.&lt;/p&gt;
        &lt;p&gt;To demonstrate vision understanding in action, we are sharing &lt;a href=&quot;https://github.com/thinking-machines-lab/tinker-cookbook/tree/main/tinker_cookbook/recipes/vlm_classifier&quot;&gt;a new cookbook recipe for fine-tuning VLMs as image classifiers&lt;/a&gt;. Qwen3-VL-235B-A22B-Instruct obtains reasonable accuracy even given just one example per class; performance improves with more labeled data.&lt;/p&gt;
        &lt;h2 id=&quot;training-image-classifiers-with-tinker&quot;&gt;Training image classifiers with Tinker&lt;/h2&gt;
        &lt;p&gt;To showcase Tinker’s new vision capabilities, we finetuned &lt;a href=&quot;https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct&quot;&gt;Qwen3-VL-235B-A22B-Instruct&lt;/a&gt; to classify images on four classic datasets:&lt;/p&gt;
        &lt;ul&gt;
        &lt;li&gt;&lt;a href=&quot;https://data.caltech.edu/records/mzrjq-6wc02&quot;&gt;Caltech 101&lt;/a&gt;, a dataset of 101 general object categories.&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://www.kaggle.com/datasets/eduardo4jesus/stanford-cars-dataset&quot;&gt;Stanford Cars&lt;/a&gt;, a dataset of car makes, models, and years.&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://www.robots.ox.ac.uk/~vgg/data/flowers/102/&quot;&gt;Oxford Flowers&lt;/a&gt;, a dataset of flower species.&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://www.robots.ox.ac.uk/~vgg/data/pets/&quot;&gt;Oxford Pets&lt;/a&gt;, a dataset of pet breeds.&lt;/li&gt;
        &lt;/ul&gt;
        &lt;p&gt;Since Qwen3-VL is a language model, we frame classification as text generation: given an image, the model outputs the class name. We compare this approach against a traditional vision baseline of finetuning a vision-only model — DINOv2-base. &lt;a href=&quot;https://arxiv.org/pdf/2304.07193&quot;&gt;DINOv2&lt;/a&gt; is a self-supervised vision transformer that was trained to encode images, and is commonly used as a backbone for pure computer vision tasks. For DINOv2, we add a classification head that predicts a distribution over all N classes. Both models are fine-tuned with LoRA.&lt;/p&gt;
        &lt;p&gt;Labeled image data is scarce for many real-world use cases, so data efficiency is the primary measure we look at. We show the classification accuracy when sweeping across the number of labeled examples per class, starting with just a single one.&lt;/p&gt;
        &lt;figure id=&quot;fig:qwen-v-dino&quot;&gt;
        &lt;img data-zoomable=&quot;&quot; src=&quot;https://thinkingmachines.ai/news/tinker-general-availability/images/vlm-graphs.png&quot; referrerpolicy=&quot;no-referrer&quot;&gt;
        &lt;figcaption&gt;Comparison of fine-tuned Qwen3-VL-235-A22B and DINOv2 performance on simple image classification tasks.&lt;/figcaption&gt;
        &lt;/figure&gt;
        &lt;p&gt;In the limited-data regime, Qwen3-VL-235-A22B outperforms DINOv2. Not only is it a bigger model, but as a VLM, it also comes with language knowledge out-of-the-box (i.e. what a “golden retriever” or “sunflower” is). This general language-and-vision capability of Qwen3-VL makes it readily available for vision tasks beyond classification.&lt;/p&gt;
        &lt;h2 id=&quot;happy-holidays&quot;&gt;Happy Holidays&lt;/h2&gt;
        &lt;p&gt;Tinker exists to enable builders and researchers to train and customize state-of-the-art models. As always, we look forward to seeing what you build with Tinker. Happy holidays!&lt;/p&gt;
        &lt;/article&gt;
        &lt;div class=&quot;paginator&quot;&gt;
        &lt;a id=&quot;post-prev-link&quot; class=&quot;link&quot; href=&quot;https://thinkingmachines.ai/news/call-for-community-projects/&quot; title=&quot;Tinker: Call for Community Projects&quot;&gt;prev&lt;/a&gt;
        &lt;a href=&quot;https://thinkingmachines.ai/news/tinker-general-availability/#&quot; id=&quot;back-to-top&quot; class=&quot;link&quot; title=&quot;Go to top&quot;&gt;back to top&lt;/a&gt;
        &lt;a id=&quot;post-next-link&quot; class=&quot;link&quot; href=&quot;https://thinkingmachines.ai/news/nvidia-partnership/&quot; title=&quot;Thinking Machines Lab and NVIDIA Announce Long-Term Gigawatt-Scale Strategic Partnership&quot;&gt;next&lt;/a&gt;
        &lt;/div&gt;</description>
      <link>https://thinkingmachines.ai/news/tinker-general-availability/</link>
      <guid isPermaLink="false">https://thinkingmachines.ai/news/tinker-general-availability/</guid>
      <pubDate>Thu, 11 Dec 2025 16:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Tinker: Call for Community Projects</title>
      <description>&lt;div class=&quot;post-heading news-post&quot;&gt;
        &lt;h1 class=&quot;post-title&quot;&gt;Tinker: Call for Community Projects&lt;/h1&gt;
        &lt;div class=&quot;publish-metadata date-only&quot;&gt;
        &lt;span&gt;
        Nov 7, 2025
        &lt;/span&gt;
        &lt;/div&gt;
        &lt;/div&gt;
        &lt;article class=&quot;content &quot;&gt;
        &lt;p&gt;We launched &lt;a href=&quot;https://thinkingmachines.ai/tinker/&quot;&gt;Tinker&lt;/a&gt; to enable builders and researchers to train models their own way, whether they’re conducting studies or customizing models for new applications. We plan to publish regular roundups of the coolest projects from the Tinker community, and &lt;strong&gt;we invite you to submit what you’ve been Tinkering on to be featured on our blog&lt;/strong&gt;.&lt;/p&gt;
        &lt;p&gt;Below are some broad suggestions for what we hope to see from the Tinker featured projects, and some specific research directions we would particularly love to see pursued.&lt;/p&gt;
        &lt;h2 id=&quot;guidelines-for-tinker-featured-projects&quot;&gt;Guidelines for Tinker Featured Projects&lt;/h2&gt;
        &lt;p&gt;We’re interested in featuring ML research projects, AI-enabled research in other domains, custom models, and other contributions. Some examples:&lt;/p&gt;
        &lt;ul&gt;
        &lt;li&gt;A reimplementation of a research project or tech report using Tinker, such as papers that compare algorithmic recipes or datasets.&lt;/li&gt;
        &lt;li&gt;Original research in machine learning, such as exploring new approaches to training or optimization or applying novel benchmarks and evaluations.&lt;/li&gt;
        &lt;li&gt;Research in a non-AI field that uses fine-tuned models, such as the work on mathematical theorem provers and chemistry models we &lt;a href=&quot;https://thinkingmachines.ai/news/announcing-tinker/#:~:text=Groups%20at%20Princeton%2C%20Stanford&quot;&gt;highlighted previously&lt;/a&gt;.&lt;/li&gt;
        &lt;li&gt;Product prototypes built with Tinker, demoing a model that does something fresh or delightful.&lt;/li&gt;
        &lt;li&gt;Novel datasets and task environments for training models.&lt;/li&gt;
        &lt;li&gt;High-level libraries built on top of Tinker that enable less experienced practitioners to perform fine-tuning effectively.&lt;/li&gt;
        &lt;li&gt;Infrastructure contributions, such as a clean self-hosted implementation of the Tinker training API.&lt;/li&gt;
        &lt;/ul&gt;
        &lt;p&gt;Your submission should include a write-up and, preferably, an open-source release of your code. We encourage you to focus on rigor and clear evaluation in your write-ups: crisp charts, raw output examples, clear comparisons to alternative approaches or models on relevant benchmarks and metrics. Tinkering is experimenting — we want to feature diligent work and transparent results over novelty or hype.&lt;/p&gt;
        &lt;p&gt;Please send your projects and any related questions to &lt;a href=&quot;https://thinkingmachines.ai/cdn-cgi/l/email-protection#56223f383d332416223e3f383d3f38313b37353e3f38332578373f692523343c3335226b10333722232433327364660624393c333522736466&quot;&gt;&lt;span class=&quot;__cf_email__&quot; data-cfemail=&quot;06726f686d637446726e6f686d6f68616b67656e6f68637528676f&quot;&gt;[email&amp;nbsp;protected]&lt;/span&gt;&lt;/a&gt; with “Featured Project” in the subject line.&lt;/p&gt;
        &lt;h2 id=&quot;suggested-research-projects&quot;&gt;Suggested research projects&lt;/h2&gt;
        &lt;p&gt;Here are some research directions that we would personally love to see explored and that Tinker can enable real progress on. We have &lt;a href=&quot;https://github.com/thinking-machines-lab/tinker-project-ideas&quot;&gt;created a repo&lt;/a&gt; with detailed motivation and guidelines for each; we’ll be adding more resources to it over time to help researchers get started. We expect most project ideas to surprise us, but this short list could serve as inspiration.&lt;/p&gt;
        &lt;p&gt;&lt;strong&gt;&lt;a href=&quot;https://github.com/thinking-machines-lab/tinker-project-ideas/blob/main/replicate-cai-with-base-models.md&quot;&gt;Replicating Constitutional AI, starting from the base model.&lt;/a&gt;&lt;/strong&gt; Though RLAIF is widely used, it’s most often bootstrapped from existing instruction-tuned models. This makes it difficult to separate the impact of the constitution from the impact of the data-generating model that interprets it. A study of Constitutional AI with and without instruction-tuned models in the pipeline would shed light on the use of constitutions and RLAIF.&lt;/p&gt;
        &lt;p&gt;&lt;strong&gt;&lt;a href=&quot;https://github.com/thinking-machines-lab/tinker-project-ideas/blob/main/noisy-student.md&quot;&gt;RLVR with Noisy student.&lt;/a&gt;&lt;/strong&gt; Noisy student self-distillation was a popular technique in an earlier era of machine learning for making use of large unlabeled datasets, but it hasn’t been adapted widely to LLMs. One possible adaptation is to start RLVR with a small labeled training set and a large unlabeled one, then have the student apply labels to the latter set after each RL run and iterate.&lt;/p&gt;
        &lt;p&gt;&lt;strong&gt;&lt;a href=&quot;https://github.com/thinking-machines-lab/tinker-project-ideas/blob/main/on-policy-context-distillation.md&quot;&gt;On-Policy Context Distillation.&lt;/a&gt;&lt;/strong&gt; Context distillation trains a student model with empty context on a teacher model with long and detailed context. Prior work used off-policy distillation — fine-tuning on teacher samples. We have found that &lt;a href=&quot;https://thinkingmachines.ai/blog/on-policy-distillation/&quot;&gt;on-policy distillation&lt;/a&gt; is often much more effective; it would be useful to compare the two approaches for context distillation in particular.&lt;/p&gt;
        &lt;p&gt;&lt;strong&gt;&lt;a href=&quot;https://github.com/thinking-machines-lab/tinker-project-ideas/blob/main/memorization-empirical-study.md&quot;&gt;RL memory test.&lt;/a&gt;&lt;/strong&gt; Our &lt;a href=&quot;https://thinkingmachines.ai/blog/lora/&quot;&gt;post on LoRA&lt;/a&gt; presented theoretical arguments on the rate of information acquisition by both SFT and RL. You can set up a toy environment where RL must learn a completely random number sequence, to compare the empirical learning rate under various reward functions to the theoretical estimate.&lt;/p&gt;
        &lt;p&gt;&lt;strong&gt;&lt;a href=&quot;https://github.com/thinking-machines-lab/tinker-project-ideas/blob/main/direct-rl-on-pairwise-judge.md&quot;&gt;Direct RL on pairwise judge.&lt;/a&gt;&lt;/strong&gt; RLHF and RLAIF use datasets of pairwise preferences, which are used to train a reward model, which is then used in RL. As an alternative “direct” approach, we can do RL using a prompted model that does pairwise comparisons, without training the reward model. It would be interesting to do experiments comparing the direct and indirect approaches.&lt;/p&gt;
        &lt;p&gt;&lt;strong&gt;&lt;a href=&quot;https://github.com/thinking-machines-lab/tinker-project-ideas/blob/main/replicate-open-character-training.md&quot;&gt;Replicate Open Character Training.&lt;/a&gt;&lt;/strong&gt; Replicate the recent paper on &lt;a href=&quot;https://arxiv.org/abs/2511.01689&quot;&gt;Open Character Training&lt;/a&gt; using Tinker.&lt;/p&gt;
        &lt;p&gt;&lt;strong&gt;&lt;a href=&quot;https://github.com/thinking-machines-lab/tinker-project-ideas/blob/main/gan-joke-generation.md&quot;&gt;GAN for jokes.&lt;/a&gt;&lt;/strong&gt; In domains such as humor, it is easier to curate a human-vetted set of demonstrations than to train a reliable judge or reward model. Try implementing GAN-style training for a joke evaluator and joke generator that can craft a joke with a requested subject and keywords.&lt;/p&gt;
        &lt;h2 id=&quot;tips-for-high-quality-ml-experiments&quot;&gt;Tips for high-quality ML experiments&lt;/h2&gt;
        &lt;p&gt;In closing, we want to offer a few guidelines for running quality ML studies, the same guidelines we strive to adhere to internally when running experiments and documenting the results.&lt;/p&gt;
        &lt;p&gt;We encourage researchers to apply multiple analyses for examining each result. When creating datasets or environments, we recommend training a range of models and applying different evals. When developing novel methods, we suggest comparing to simpler baseline methods and sweeping hyperparameters that performance is sensitive to, particularly learning rate.&lt;/p&gt;
        &lt;p&gt;We’d love to see your reasoning in the write-up: assumptions you made, how your approach diverges from previous reports, and what motivated each change. We hope to see examples of the raw data and model rollouts, along with the summarized results. Finally, we appreciate crisp and detailed write-ups with clean and well-labeled charts and illustrations of the inner workings of the methods used.&lt;/p&gt;
        &lt;p&gt;We are excited to see what our community creates with Tinker, and hope that our featured projects will inspire your own work.&lt;/p&gt;
        &lt;/article&gt;
        &lt;div class=&quot;paginator&quot;&gt;
        &lt;a id=&quot;post-prev-link&quot; class=&quot;link&quot; href=&quot;https://thinkingmachines.ai/news/tinker-research-and-teaching-grants/&quot; title=&quot;Tinker: Announcing Research and Teaching Grants&quot;&gt;prev&lt;/a&gt;
        &lt;a href=&quot;https://thinkingmachines.ai/news/call-for-community-projects/#&quot; id=&quot;back-to-top&quot; class=&quot;link&quot; title=&quot;Go to top&quot;&gt;back to top&lt;/a&gt;
        &lt;a id=&quot;post-next-link&quot; class=&quot;link&quot; href=&quot;https://thinkingmachines.ai/news/tinker-general-availability/&quot; title=&quot;Tinker: General Availability and Vision Input&quot;&gt;next&lt;/a&gt;
        &lt;/div&gt;</description>
      <link>https://thinkingmachines.ai/news/call-for-community-projects/</link>
      <guid isPermaLink="false">https://thinkingmachines.ai/news/call-for-community-projects/</guid>
      <pubDate>Thu, 06 Nov 2025 16:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Tinker: Announcing Research and Teaching Grants</title>
      <description>&lt;div class=&quot;post-heading news-post&quot;&gt;
        &lt;h1 class=&quot;post-title&quot;&gt;Tinker: Announcing Research and Teaching Grants&lt;/h1&gt;
        &lt;div class=&quot;publish-metadata date-only&quot;&gt;
        &lt;span&gt;
        Oct 29, 2025
        &lt;/span&gt;
        &lt;/div&gt;
        &lt;/div&gt;
        &lt;article class=&quot;content &quot;&gt;
        &lt;p&gt;We launched &lt;a href=&quot;https://thinkingmachines.ai/news/announcing-tinker/&quot;&gt;Tinker&lt;/a&gt; nearly one month ago. Since then, researchers across academia and non-profits have been using Tinker to train custom models and advance their research.&lt;/p&gt;
        &lt;p&gt;Today, we’re launching research and teaching grants for Tinker access. As part of our commitment to open and collaborative science, we want to make it as easy as possible for students and scholars to use Tinker. If your research or teaching involves training open-weight LLMs, we encourage you to apply.&lt;/p&gt;
        &lt;p&gt;We’re offering two types of grants to support your work:&lt;/p&gt;
        &lt;ul&gt;
        &lt;li&gt;
        &lt;p&gt;&lt;strong&gt;Teaching Grants:&lt;/strong&gt; We provide $250 in free credits per student for academic classes using Tinker, whether you’re integrating it into an assignment or enabling students to use Tinker for self-directed projects, this is sized to support your entire class for the duration of the course.&lt;/p&gt;
        &lt;/li&gt;
        &lt;li&gt;
        &lt;p&gt;&lt;strong&gt;Research Grants:&lt;/strong&gt; We provide grants starting at $5,000 to support research projects and open-source software that uses Tinker.&lt;/p&gt;
        &lt;/li&gt;
        &lt;/ul&gt;
        &lt;p&gt;A selection of early grants we have awarded:&lt;/p&gt;
        &lt;ul&gt;
        &lt;li&gt;
        &lt;p&gt;Diyi Yang’s &lt;a href=&quot;https://web.stanford.edu/class/cs329x/&quot;&gt;Stanford class on Human-Centered LLMs&lt;/a&gt; uses Tinker to compare different approaches for training personalized LLMs that capture unique writing styles and align with user habits.&lt;/p&gt;
        &lt;/li&gt;
        &lt;li&gt;
        &lt;p&gt;Aviral Kumar and Katerina Fragkiadaki’s &lt;a href=&quot;https://cmudeeprl.github.io/703website_f25/&quot;&gt;CMU class on Deep RL&lt;/a&gt; will use Tinker to enable class projects to experiment with state-of-the-art methods for training LLM and VLM based policies via RL.&lt;/p&gt;
        &lt;/li&gt;
        &lt;li&gt;
        &lt;p&gt;&lt;a href=&quot;https://chemistry.stanford.edu/people/grant-m-rotskoff&quot;&gt;Grant Rotskoff&lt;/a&gt;’s lab at Stanford is fine-tuning small-molecule chemistry models with Tinker to help solve problems in computational chemistry.&lt;/p&gt;
        &lt;/li&gt;
        &lt;/ul&gt;
        &lt;p&gt;&lt;strong&gt;Instructors&lt;/strong&gt;, please apply for teaching grants &lt;a href=&quot;https://form.typeform.com/to/JgPkuMvB&quot;&gt;&lt;strong&gt;here.&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;
        &lt;p&gt;&lt;strong&gt;Researchers&lt;/strong&gt;, please apply for research grants &lt;a href=&quot;https://form.typeform.com/to/E9wVFZJJ&quot;&gt;&lt;strong&gt;here.&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;
        &lt;p&gt;We’re assessing applications on a rolling basis and will aim to respond within a week of your application.&lt;/p&gt;
        &lt;/article&gt;
        &lt;div class=&quot;paginator&quot;&gt;
        &lt;a id=&quot;post-prev-link&quot; class=&quot;link&quot; href=&quot;https://thinkingmachines.ai/news/announcing-tinker/&quot; title=&quot;Announcing Tinker&quot;&gt;prev&lt;/a&gt;
        &lt;a href=&quot;https://thinkingmachines.ai/news/tinker-research-and-teaching-grants/#&quot; id=&quot;back-to-top&quot; class=&quot;link&quot; title=&quot;Go to top&quot;&gt;back to top&lt;/a&gt;
        &lt;a id=&quot;post-next-link&quot; class=&quot;link&quot; href=&quot;https://thinkingmachines.ai/news/call-for-community-projects/&quot; title=&quot;Tinker: Call for Community Projects&quot;&gt;next&lt;/a&gt;
        &lt;/div&gt;</description>
      <link>https://thinkingmachines.ai/news/tinker-research-and-teaching-grants/</link>
      <guid isPermaLink="false">https://thinkingmachines.ai/news/tinker-research-and-teaching-grants/</guid>
      <pubDate>Tue, 28 Oct 2025 16:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Announcing Tinker</title>
      <description>&lt;div class=&quot;post-heading news-post&quot;&gt;
        &lt;h1 class=&quot;post-title&quot;&gt;Announcing Tinker&lt;/h1&gt;
        &lt;div class=&quot;publish-metadata date-only&quot;&gt;
        &lt;span&gt;
        Oct 1, 2025
        &lt;/span&gt;
        &lt;/div&gt;
        &lt;/div&gt;
        &lt;div class=&quot;post-cover&quot;&gt;
        &lt;img data-zoomable=&quot;&quot; src=&quot;https://thinkingmachines.ai/news/announcing-tinker/svgs/tinker-cover.svg&quot; style=&quot;display: block; width: 100%; max-width: 800px; height: auto; margin: 0 auto;&quot; loading=&quot;lazy&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;
        &lt;/div&gt;
        &lt;article class=&quot;content &quot;&gt;
        &lt;p class=&quot;image-caption&quot; style=&quot;text-align: center; margin-top: -1rem; margin-bottom: 2rem; font-size: 0.9rem; color: var(--fg-muted, #666);&quot;&gt;
        &lt;a href=&quot;https://www.computerhistory.org/collections/catalog/X39.81/&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;TinkerToy Computer&lt;/a&gt; invented by &lt;a href=&quot;https://en.wikipedia.org/wiki/Danny_Hillis&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;Daniel Hillis&lt;/a&gt; and &lt;a href=&quot;https://en.wikipedia.org/wiki/Brian_Silverman&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;Brian Silverman&lt;/a&gt;
        &lt;/p&gt;
        &lt;p&gt;Today, we are launching &lt;a href=&quot;https://thinkingmachines.ai/tinker&quot;&gt;Tinker&lt;/a&gt;, a flexible API for fine-tuning language models. It empowers researchers and hackers to experiment with models by giving them control over the algorithms and data while we handle the complexity of distributed training. Tinker advances our mission of enabling more people to do research on cutting-edge models and customize them to their needs.&lt;/p&gt;
        &lt;p&gt;Tinker lets you fine-tune a range of large and small open-weight models, including large mixture-of-experts models such as Qwen-235B-A22B. Switching from a small model to a large one is as simple as changing a single string in your Python code.&lt;/p&gt;
        &lt;p&gt;Tinker is a managed service that runs on our internal clusters and training infrastructure. We handle scheduling, resource allocation, and failure recovery. This allows you to get small or large runs started immediately, without worrying about managing infrastructure. We use LoRA so that we can share the same pool of compute between multiple training runs, lowering costs.&lt;/p&gt;
        &lt;p&gt;Tinker’s API gives you low-level primitives like &lt;code&gt;forward_backward&lt;/code&gt; and &lt;code&gt;sample&lt;/code&gt;, which can be used to express most common post-training methods. Even so, achieving good results requires getting many details right. That’s why we’re releasing an open-source library, the &lt;a href=&quot;http://github.com/thinking-machines-lab/tinker-cookbook&quot;&gt;Tinker Cookbook&lt;/a&gt;, with modern implementations of post-training methods that run on top of the Tinker API.&lt;/p&gt;
        &lt;p&gt;Groups at Princeton, Stanford, Berkeley, and Redwood Research have already been using Tinker:&lt;/p&gt;
        &lt;ul&gt;
        &lt;li&gt;The &lt;a href=&quot;https://blog.goedel-prover.com/&quot;&gt;Princeton Goedel Team&lt;/a&gt; trained mathematical theorem provers&lt;/li&gt;
        &lt;li&gt;The &lt;a href=&quot;https://statmech.stanford.edu/&quot;&gt;Rotskoff Chemistry group&lt;/a&gt; at Stanford fine-tuned a model to complete chemistry reasoning tasks&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://sky.cs.berkeley.edu/project/skyrl/&quot;&gt;Berkeley’s SkyRL group&lt;/a&gt; ran experiments on a custom async off-policy RL training loop with multi-agents and multi-turn tool-use.&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://www.redwoodresearch.org/&quot;&gt;Redwood Re

route(thinkingmachines): add news route for Thinking Machines Lab

8a7a4a9

Add route for Thinking Machines Lab (thinkingmachines.ai) news page. Founded by Mira Murati (ex-OpenAI CTO), the lab publishes news about their AI research and products. Closes #0

github-actions bot added the route label Apr 3, 2026

github-advanced-security bot found potential problems Apr 3, 2026

View reviewed changes

lib/routes/thinkingmachines/news.ts Fixed Show fixed Hide fixed

lib/routes/thinkingmachines/news.ts Fixed Show fixed Hide fixed

w3nhao changed the title ~~route(thinkingmachines): add news route for Thinking Machines Lab~~ feat(route/thinkingmachines): add news route for Thinking Machines Lab Apr 3, 2026

github-actions bot added the auto: ready to review label Apr 3, 2026

fix: use import type and sort imports for oxlint

f41c1ab

github-advanced-security bot found potential problems Apr 3, 2026

View reviewed changes

lib/routes/thinkingmachines/news.ts Fixed Show fixed Hide fixed

fix: correct import sort order per simple-import-sort

7933eb2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(route/thinkingmachines): add news route for Thinking Machines Lab#21609

feat(route/thinkingmachines): add news route for Thinking Machines Lab#21609
w3nhao wants to merge 3 commits intoDIYgod:masterfrom
w3nhao:feat/thinkingmachines-news

w3nhao commented Apr 3, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Apr 3, 2026

Uh oh!

Uh oh!

github-actions bot commented Apr 3, 2026

Uh oh!

github-actions bot commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

w3nhao commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Involved Issue / 该 PR 相关 Issue

Example for the Proposed Route(s) / 路由地址示例

New RSS Route Checklist / 新 RSS 路由检查表

Note / 说明

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Apr 3, 2026

Uh oh!

Uh oh!

github-actions bot commented Apr 3, 2026

Uh oh!

github-actions bot commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

w3nhao commented Apr 3, 2026 •

edited

Loading