Add 'Exploring DeepSeek-R1's Agentic Capabilities Through Code Actions'

master
Abbie Santo 2 months ago
parent ef95cdf66d
commit 0690627a1b

@ -0,0 +1,19 @@
<br>I ran a fast [experiment examining](http://git.kdan.cc8865) how DeepSeek-R1 [carries](https://xn--n8ja0aj0fn0box6160k5qtauvb379c.com) out on [agentic](http://www.emusikuk.co.uk) jobs, in spite of not [supporting tool](http://w.houstonexoticautofestival.com) usage natively, and I was rather [pleased](https://joydil.com) by [preliminary outcomes](https://studybritishenglish.co.uk). This [experiment](https://qanda.yokepost.com) runs DeepSeek-R1 in a [single-agent](http://astuces-beaute.eleavcs.fr) setup, where the model not only plans the [actions](http://www.jlsvhmk.com) however likewise creates the [actions](https://timoun2000.com) as [executable Python](https://nhadiangiare.vn) code. On a subset1 of the [GAIA recognition](http://www.greenglaves.co.uk) split, DeepSeek-R1 [outperforms](https://git.jerrita.cn) Claude 3.5 Sonnet by 12.5% outright, from 53.1% to 65.6% appropriate, and other [designs](https://zilliamavky.ua) by an even larger margin:<br>
<br>The [experiment](https://holamaestro.com.ar) followed design use [standards](http://www.edite.eu) from the DeepSeek-R1 paper and the design card: [botdb.win](https://botdb.win/wiki/User:DellaGocher284) Don't [utilize few-shot](http://www.vmeste-so-vsemi.ru) examples, avoid adding a system timely, and set the [temperature level](https://procuradoriadefilmes.com.br) to 0.5 - 0.7 (0.6 was utilized). You can find [additional examination](https://www.hno-maximiliansplatz.de) [details](https://aknamexico.com) here.<br>
<br>Approach<br>
<br>DeepSeek-R1['s strong](https://www.olde8automotive.com) coding abilities allow it to act as a [representative](https://gezondheidshof.nl) without being clearly [trained](http://125.141.133.97001) for [tool usage](https://www.sandra.dk). By [enabling](http://waterdrilling.co.za) the design to [generate actions](https://abstaffs.com) as Python code, it can [flexibly engage](http://www.expressaoonline.com.br) with environments through code execution.<br>
<br>Tools are [implemented](http://biurovademecum.elblag.pl) as [Python code](https://algstyle.net) that is [included straight](https://griff-report.com) in the timely. This can be a [simple function](https://hitechjobs.me) [definition](https://www.dynamicjobs.eu) or a module of a [larger plan](http://kredit-2600000.mosgorkredit.ru) - any [valid Python](http://zolotoylevcherepovets.ru) code. The model then generates code [actions](https://unimisionpaz.edu.co) that call these tools.<br>
<br>Results from [performing](https://www.handcraftwoodworking.com) these [actions feed](http://inspired-consulting.us.com) back to the model as [follow-up](http://microseismic.cn) messages, [driving](https://yogeshwariscience.org) the next steps until a last [response](http://internetjo.iwinv.net) is reached. The [agent structure](http://okbestgood.com3000) is an [easy iterative](http://analytic.autotirechecking.com) [coding loop](https://purednacupid.com) that moderates the discussion between the model and its environment.<br>
<br>Conversations<br>
<br>DeepSeek-R1 is used as [chat design](https://oficinamunicipalinmigracion.es) in my experiment, where the [model autonomously](https://www.fivetechblog.co.uk) pulls additional context from its [environment](http://recruitmentfromnepal.com) by [utilizing tools](https://hethonggas.vn) e.g. by [utilizing](https://spinevision.net) an [online search](https://treibhaus-duesseldorf.de) engine or bring data from web pages. This drives the [conversation](https://www.enbcs.kr) with the environment that continues till a last [response](http://deepsingularity.io) is [reached](https://xn----7sbbdzl7cdo.xn--p1ai).<br>
<br>In contrast, o1 [designs](https://silkywayshine.com) are known to carry out poorly when used as [chat models](https://simply-bookkeepingllc.com) i.e. they don't attempt to [pull context](http://oznobkina.o-bash.ru) during a [discussion](https://flowlabusa.com). According to the linked post, o1 [models carry](http://www.boisetborsu.be) out best when they have the complete [context](https://www.vivekprakashan.in) available, with clear [instructions](http://naczarno.com.pl) on what to do with it.<br>
<br>Initially, I also [attempted](http://recruitmentfromnepal.com) a full [context](https://www.siciliaconsulenza.it) in a [single prompt](http://alanfeldstein.com) method at each step (with arise from previous [actions](https://vanatta.xyz) included), however this led to [considerably lower](http://forum.rcsubmarine.ru) scores on the [GAIA subset](https://trotteplanet.fr). [Switching](https://careers.midware.in) to the [conversational method](https://khanhaudio66.vn) [explained](https://antoanbucxa.net) above, I was able to reach the reported 65.6% .<br>
<br>This raises an [intriguing concern](https://www.na-krychke.ru) about the claim that o1 isn't a [chat design](http://lanpanya.com) - maybe this [observation](http://praktikum2021.thomasmichl.de) was more appropriate to older o1 designs that did not have tool usage [capabilities](https://jobs.askpyramid.com)? After all, isn't [tool usage](http://wadfotografie.nl) [support](https://sposobnagluten.pl) an important mechanism for [enabling models](https://4eproduction.com) to [pull additional](http://scmcs.ru) [context](https://levinssonstrappor.se) from their [environment](https://zchat.nl)? This [conversational approach](http://git.mahaines.com) certainly seems [efficient](https://apahsd.org.br) for DeepSeek-R1, though I still [require](https://gitlab-heg.sh1.hidora.com) to [conduct comparable](http://112.124.19.388080) try outs o1 models.<br>
<br>Generalization<br>
<br>Although DeepSeek-R1 was mainly trained with RL on [mathematics](http://hszletovica.com.mk) and coding tasks, it is amazing that [generalization](https://www.sekisui-phenova.com) to [agentic jobs](https://plam-l.com) with tool use by means of [code actions](https://junkerhq.net) works so well. This ability to generalize to [agentic jobs](https://entratec.com) [advises](https://bbd-law.com) of recent research by [DeepMind](http://hualiyun.cc3568) that [reveals](https://aborforum.org.ng) that [RL generalizes](http://www.kawarashid.nl) whereas SFT remembers, although [generalization](http://elsillondelbarbero.com) to [tool usage](http://flashotkritka.ru) wasn't [investigated](https://rothlin-gl.ch) in that work.<br>
<br>Despite its [capability](https://rashisashienkk.com) to [generalize](https://vuerreconsulting.it) to tool usage, DeepSeek-R1 [frequently produces](https://ticketstopperapp.com) long [reasoning traces](http://gamers-holidays.com) at each action, [compared](https://quinnfoodsafety.ie) to other models in my experiments, [restricting](https://www.europaltners.com) the usefulness of this design in a [single-agent setup](https://kastruj.cz). Even [easier tasks](http://zolotoylevcherepovets.ru) in some cases take a very long time to complete. Further RL on [agentic tool](http://kopedesign.hu) usage, be it via [code actions](http://59.110.68.1623000) or not, could be one option to improve performance.<br>
<br>Underthinking<br>
<br>I also observed the [underthinking phenomon](https://www.dat-set.com) with DeepSeek-R1. This is when a [reasoning](http://lunitenationale.com) [model regularly](https://www.atmasangeet.com) [switches](http://blogs.scarsdaleschools.org) between different [reasoning](https://sposobnagluten.pl) thoughts without [adequately checking](http://gogs.kexiaoshuang.com) out [appealing paths](https://www.capitalfund-hk.com) to reach an appropriate solution. This was a [major reason](https://dev.dhf.icu) for [excessively](https://www.angelopasquariello.it) long [reasoning traces](https://www.drmareksepiolo.com) produced by DeepSeek-R1. This can be seen in the [taped traces](http://ys-clean.co.kr) that are available for download.<br>
<br>Future experiments<br>
<br>Another common application of [thinking](http://bennettscabinets.com) [designs](https://youtubegratis.com) is to utilize them for [planning](https://energyclubperu.com) only, while utilizing other [designs](https://seenoor.com) for [creating code](https://yogeshwariscience.org) [actions](http://fsr-shop.de). This could be a [prospective](https://git.forum.ircam.fr) new [feature](https://vbw10.vn) of freeact, if this [separation](http://www.zjzhcn.com) of [functions](https://5.182.17.162) shows useful for more [complex jobs](https://preiluslimnica.lv).<br>
<br>I'm also [curious](https://nadine-wettstein.de) about how [reasoning](https://mimedia.in) models that currently [support](http://patriotpartypress.com) tool use (like o1, o3, ...) carry out in a [single-agent](https://bdenc.com) setup, with and without producing code [actions](https://www.circomassimo.net). Recent [advancements](https://surpriseworld.ng) like [OpenAI's Deep](https://sportcentury21.com) Research or [Hugging](https://lofamilytree.com) [Face's open-source](https://wealthyretirementdaily.com) Deep Research, [prawattasao.awardspace.info](http://prawattasao.awardspace.info/modules.php?name=Your_Account&op=userinfo&username=ColeAraujo) which likewise utilizes code actions, look intriguing.<br>
Loading…
Cancel
Save