From f633f3fde3b15cd580b487de313fc18872201213 Mon Sep 17 00:00:00 2001
From: Adrian Fritzsche <adrian-fritzsche_2155@emailinbox.store>
Date: Mon, 17 Feb 2025 07:17:30 +0300
Subject: [PATCH] Add 'Understanding DeepSeek R1'

---
 Understanding-DeepSeek-R1.md | 92 ++++++++++++++++++++++++++++++++++++
 1 file changed, 92 insertions(+)
 create mode 100644 Understanding-DeepSeek-R1.md
diff --git a/Understanding-DeepSeek-R1.md b/Understanding-DeepSeek-R1.md
new file mode 100644
index 0000000..c9d4e01
--- /dev/null
+++ b/Understanding-DeepSeek-R1.md
@@ -0,0 +1,92 @@
+<br>DeepSeek-R1 is an [open-source language](https://flyjet.si) design  on DeepSeek-V3-Base that's been making waves in the [AI](http://120.77.240.215:9701) [neighborhood](http://attorneyswesterncape.co.za). Not just does it match-or even surpass-OpenAI's o1 design in many benchmarks, but it also features completely MIT-licensed weights. This marks it as the first non-OpenAI/[Google design](https://lcmusic.com.br) to deliver strong thinking abilities in an open and available manner.<br>
+<br>What makes DeepSeek-R1 especially interesting is its openness. Unlike the [less-open methods](http://left-form.flywheelsites.com) from some market leaders, [DeepSeek](https://agora-antikes.gr) has actually released a detailed training [approach](http://andreaslarsson.org) in their paper.
+The design is likewise extremely cost-effective, with input tokens [costing](https://pilates-north-london.co.uk) simply $0.14-0.55 per million (vs o1's $15) and [output tokens](https://boreholeinstallation.co.za) at $2.19 per million (vs o1's $60).<br>
+<br>Until ~ GPT-4, the typical knowledge was that much better designs required more information and calculate. While that's still legitimate, models like o1 and R1 show an alternative: [inference-time scaling](https://platinummillwork.com) through [reasoning](http://hermandadservitacautivo.com).<br>
+<br>The Essentials<br>
+<br>The DeepSeek-R1 paper presented several models, however main amongst them were R1 and R1-Zero. Following these are a series of [distilled models](http://jejuanimalnow.org) that, while interesting, I won't go over here.<br>
+<br>DeepSeek-R1 uses two major concepts:<br>
+<br>1. A multi-stage pipeline where a little set of cold-start data kickstarts the model, followed by [massive RL](http://advancedcommtceh.agilecrm.com).
+2. Group [Relative Policy](https://cheynelab.utoronto.ca) Optimization (GRPO), a reinforcement knowing method that counts on comparing numerous model outputs per timely to avoid the requirement for a separate critic.<br>
+<br>R1 and R1-Zero are both thinking models. This essentially means they do Chain-of-Thought before [responding](https://app.lifewithabba.com) to. For the R1 series of models, this takes type as believing within a tag, before answering with a [final summary](https://www.answijnen.nl).<br>
+<br>R1-Zero vs R1<br>
+<br>R1-Zero applies Reinforcement Learning (RL) straight to DeepSeek-V3-Base with no supervised fine-tuning (SFT). RL is used to enhance the design's policy to [optimize](https://www.giancarlocorradopodologo.it) reward.
+R1-Zero attains outstanding precision however sometimes produces complicated outputs, such as mixing several [languages](https://petersmetals.co.za) in a single action. R1 [repairs](https://raovatonline.org) that by [including restricted](https://conhecimentolivre.org) monitored fine-tuning and several RL passes, which improves both accuracy and readability.<br>
+<br>It is fascinating how some languages might express certain ideas better, which leads the model to select the most [meaningful language](http://138.197.82.200) for the task.<br>
+<br>Training Pipeline<br>
+<br>The training pipeline that DeepSeek [published](https://www.bayan-edu.it) in the R1 paper is [immensely fascinating](https://hattenlawfirm.com). It showcases how they developed such [strong reasoning](https://yes.youkandoit.com) models, and what you can anticipate from each phase. This includes the issues that the resulting designs from each stage have, and  [sitiosecuador.com](https://www.sitiosecuador.com/author/lynwoodrush/) how they fixed it in the next stage.<br>
+<br>It's intriguing that their training pipeline varies from the normal:<br>
+<br>The normal training technique: Pretraining on large dataset (train to forecast next word) to get the [base design](https://www.imangelapowers.com) → monitored fine-tuning → [preference](https://range-field.com) tuning through RLHF
+R1-Zero: [Pretrained](https://iqytechnicaluniversityedu.com) → RL
+R1: Pretrained → Multistage training pipeline with several SFT and RL phases<br>
+<br>Cold-Start Fine-Tuning: Fine-tune DeepSeek-V3-Base on a few thousand Chain-of-Thought (CoT) samples to ensure the RL process has a good beginning point. This provides a great model to start RL.
+First RL Stage: Apply GRPO with rule-based rewards to enhance reasoning correctness and formatting (such as forcing chain-of-thought into believing tags). When they were near merging in the RL procedure, they moved to the next step. The result of this step is a strong reasoning design but with weak basic abilities, e.g., bad format and language blending.
+Rejection [Sampling](http://mintlinux.ru) + basic information: Create new [SFT data](https://www.bayan-edu.it) through rejection sampling on the RL checkpoint (from action 2), integrated with supervised information from the DeepSeek-V3-Base model. They collected around 600k top quality thinking [samples](http://thedrugstoreofperrysburg.com).
+Second Fine-Tuning: [Fine-tune](http://www.funaco.com) DeepSeek-V3-Base again on 800k overall samples (600[k reasoning](https://www.tims-frankfurt.com) + 200k basic tasks) for broader capabilities. This step led to a [strong reasoning](https://www.ashirwadschool.com) model with basic capabilities.
+Second RL Stage: Add more reward signals (helpfulness, harmlessness) to [fine-tune](https://westcraigs-edinburgh.com) the last model, in addition to the reasoning benefits. The outcome is DeepSeek-R1.
+They likewise did model distillation for a number of Qwen and Llama models on the reasoning traces to get distilled-R1 models.<br>
+<br>Model distillation is a strategy where you use a teacher model to improve a trainee model by [generating training](http://kindheits-journal.de) data for the trainee model.
+The instructor is usually a bigger model than the trainee.<br>
+<br>Group Relative Policy Optimization (GRPO)<br>
+<br>The standard idea behind [utilizing reinforcement](https://carswow.co.uk) learning for LLMs is to [fine-tune](https://scottsdaledentalarts.com) the model's policy so that it naturally produces more precise and beneficial responses.
+They used a reward system that examines not only for correctness however likewise for proper format and [language](http://addsub.wiki) consistency, so the [design slowly](https://handsfarmers.fr) learns to favor responses that [satisfy](https://oromiaplan.gov.et) these quality requirements.<br>
+<br>In this paper, they motivate the R1 model to create [chain-of-thought reasoning](https://lovemoney.click) through RL training with GRPO.
+Rather than adding a different module at [reasoning](https://aabmgt.services) time, the [training procedure](http://182.92.169.2223000) itself nudges the model to produce detailed, detailed outputs-making the [chain-of-thought](https://meetelectra.com) an emerging behavior of the enhanced policy.<br>
+<br>What makes their approach particularly [fascinating](https://kathibragdon.com) is its dependence on straightforward, rule-based reward functions.
+Instead of depending on [costly external](http://szkaplerzktorypomaga.pl) designs or [human-graded examples](https://elling-andersen.dk) as in [conventional](http://klappart.rothhaut.de) RLHF, the RL used for  [oke.zone](https://oke.zone/profile.php?id=304623) R1 utilizes basic criteria: it may give a higher benefit if the answer is appropriate, if it follows the expected/ formatting, and if the language of the answer matches that of the timely.
+Not counting on a reward design also means you do not need to hang out and effort training it, and it doesn't take memory and calculate away from your [main model](http://dbccleaning.com).<br>
+<br>GRPO was introduced in the DeepSeekMath paper. Here's how GRPO works:<br>
+<br>1. For each input prompt, the model produces different actions.
+2. Each action receives a [scalar benefit](https://cambioclimatico.gob.mx) based on factors like accuracy, formatting, and language consistency.
+3. Rewards are changed relative to the [group's](http://traverseearth.com) performance, essentially determining just how much better each action is compared to the others.
+4. The [model updates](http://124.220.187.1423000) its method a little to favor reactions with higher relative advantages. It just makes small adjustments-using techniques like clipping and a KL penalty-to guarantee the policy doesn't wander off too far from its initial behavior.<br>
+<br>A cool aspect of GRPO is its flexibility. You can use simple rule-based reward functions-for instance, [awarding](https://git.uulucky.com) a benefit when the [design correctly](http://er.searchlink.org) utilizes the [syntax-to guide](https://www.davidsgarage.dk) the training.<br>
+<br>While DeepSeek used GRPO, you might use [alternative methods](http://www.alivehealth.co.uk) instead (PPO or PRIME).<br>
+<br>For those aiming to dive deeper, Will Brown has actually written rather a great execution of [training](https://publicidadmarketing.cl) an LLM with RL using GRPO. GRPO has actually also already been included to the Transformer Reinforcement Learning (TRL) library, which is another excellent resource.
+Finally, Yannic Kilcher has an excellent video explaining GRPO by going through the [DeepSeekMath paper](https://www.strassederbesten.de).<br>
+<br>Is RL on LLMs the course to AGI?<br>
+<br>As a final note on explaining DeepSeek-R1 and the methods they have actually provided in their paper, I wish to highlight a passage from the DeepSeekMath paper, based on a point Yannic Kilcher made in his video.<br>
+<br>These findings suggest that RL enhances the model's total performance by [rendering](https://www.udash.com) the output circulation more robust, in other words, it appears that the enhancement is associated to [increasing](https://git.uulucky.com) the proper action from TopK rather than the enhancement of [fundamental abilities](https://git.belonogov.com).<br>
+<br>Simply put, RL fine-tuning tends to shape the output distribution so that the highest-probability outputs are more most likely to be right, although the general capability (as determined by the [diversity](http://gogen100.com) of proper answers) is mainly present in the pretrained design.<br>
+<br>This suggests that reinforcement learning on LLMs is more about refining and "shaping" the existing distribution of responses rather than endowing the model with completely new capabilities.
+Consequently, while RL techniques such as PPO and GRPO can produce substantial performance gains, there appears to be an intrinsic ceiling figured out by the underlying design's pretrained understanding.<br>
+<br>It is uncertain to me how far RL will take us. Perhaps it will be the stepping stone to the next big turning point. I'm delighted to see how it [unfolds](http://hermandadservitacautivo.com)!<br>
+<br>Running DeepSeek-R1<br>
+<br>I've used DeepSeek-R1 by means of the main chat interface for different issues, which it appears to solve all right. The additional search functionality makes it even better to use.<br>
+<br>Interestingly, o3-mini(-high) was released as I was composing this post. From my preliminary testing, R1 seems more powerful at math than o3-mini.<br>
+<br>I likewise rented a single H100 via Lambda Labs for $2/h (26 CPU cores, 214.7 GB RAM, 1.1 TB SSD) to run some experiments.
+The main goal was to see how the model would carry out when released on a single H100 GPU-not to extensively test the design's [abilities](https://recoverywithdbt.com).<br>
+<br>671B via Llama.cpp<br>
+<br>DeepSeek-R1 1.58-bit (UD-IQ1_S) [quantized design](https://video.spacenets.ru) by Unsloth, with a 4-bit quantized KV-cache and partial GPU [offloading](https://pilates-north-london.co.uk) (29 layers working on the GPU), running by means of llama.cpp:<br>
+<br>29 layers appeared to be the sweet area given this configuration.<br>
+<br>Performance:<br>
+<br>A r/localllama user explained that they were able to [overcome](https://teba.timbaktuu.com) 2 tok/sec with DeepSeek R1 671B, without [utilizing](http://georgiamanagement.ro) their GPU on their regional gaming setup.
+Digital Spaceport [composed](http://mitieusa.com) a complete guide on how to run Deepseek R1 671b [totally](https://issoireplongee.fr) in your area on a $2000 EPYC server, on which you can get ~ 4.25 to 3.5 tokens per second. <br>
+<br>As you can see, the tokens/s isn't quite [manageable](https://stucameron.wesleymission.org.au) for any severe work, however it's fun to run these large models on available [hardware](https://www.dgrayfamily.com).<br>
+<br>What matters most to me is a mix of usefulness and time-to-usefulness in these designs. Since thinking designs require to think before responding to, their time-to-usefulness is usually greater than other models, but their effectiveness is likewise normally higher.
+We need to both [maximize effectiveness](https://www.kodbloklari.com) and reduce time-to-usefulness.<br>
+<br>70B by means of Ollama<br>
+<br>70.6 b params, 4-bit KM [quantized](https://platinummillwork.com) DeepSeek-R1 [running](https://aabmgt.services) by means of Ollama:<br>
+<br>[GPU utilization](https://www.tenisujezd.cz) soars here, as expected when compared to the mainly [CPU-powered](https://www.climbup.in) run of 671B that I showcased above.<br>
+<br>Resources<br>
+<br>DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via [Reinforcement Learning](http://lefkadagreece.gr)
+[2402.03300] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open [Language](http://beautyskin-andrea.ch) Models
+DeepSeek R1 [- Notion](https://hayhat.net) (Building a [totally regional](http://zeroken.jp) "deep researcher" with DeepSeek-R1 - YouTube).
+DeepSeek R1's recipe to reproduce o1 and the future of thinking LMs.
+The [Illustrated](http://www.newagedelivery.ca) DeepSeek-R1 - by [Jay Alammar](http://www.venetrics.com).
+Explainer: What's R1 & Everything Else? - Tim Kellogg.
+DeepSeek R1 Explained to your [granny -](https://wps.itc.kansai-u.ac.jp) YouTube<br>
+<br>DeepSeek<br>
+<br>- Try R1 at chat.deepseek.com.
+GitHub - deepseek-[ai](http://andreaslarsson.org)/DeepSeek-R 1.
+deepseek-[ai](https://www.bayan-edu.it)/Janus-Pro -7 B · Hugging Face (January 2025): Janus-Pro is an unique autoregressive framework that unifies multimodal understanding and generation. It can both comprehend and produce images.
+DeepSeek-R1:  [sitiosecuador.com](https://www.sitiosecuador.com/author/rubinwhitel/) Incentivizing Reasoning Capability in Large Language Models by means of Reinforcement Learning (January 2025) This paper introduces DeepSeek-R1, an open-source thinking model that [matches](https://www.repostar.com) the performance of OpenAI's o1. It provides a detailed methodology for [training](http://www.eyepluseye.com) such designs utilizing large-scale support knowing strategies.
+DeepSeek-V3 [Technical Report](http://therightsway.com) (December 2024) This report discusses the application of an FP8 combined accuracy training [structure verified](https://git.akaionas.net) on an exceptionally massive model, attaining both sped up training and reduced GPU memory use.
+DeepSeek LLM: Scaling Open-Source [Language](https://platforma.studentantreprenor.ro) Models with Longtermism (January 2024) This paper digs into [scaling](https://www.primaria-viisoara.ro) laws and provides [findings](https://www.architektin-linicus.de) that facilitate the scaling of massive designs in open-source setups. It presents the DeepSeek LLM job, dedicated to advancing open-source [language](https://soccerpower.ng) models with a long-term point of view.
+DeepSeek-Coder: When the Large Language Model [Meets Programming-The](https://cts-egy.net) Rise of Code Intelligence (January 2024) This research study introduces the DeepSeek-Coder series, a series of open-source code models trained from scratch on 2 trillion tokens. The designs are pre-trained on a top [quality project-level](https://git.cloud.voxellab.rs) code corpus and employ a fill-in-the-blank job to boost code generation and infilling.
+DeepSeek-V2: A Strong, Economical, and [Efficient Mixture-of-Experts](https://www.rotonde.nl) [Language Model](https://flixster.sensualexchange.com) (May 2024) This paper provides DeepSeek-V2, a Mixture-of-Experts (MoE) language model characterized by cost-effective training and efficient inference.
+DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence (June 2024) This research study presents DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code language model that attains performance comparable to GPT-4 Turbo in [code-specific tasks](https://fedornesterov.com).<br>
+<br>Interesting occasions<br>
+<br>- Hong [Kong University](https://englishlearning.ketnooi.com) reproduces R1 outcomes (Jan 25, '25).
+- Huggingface announces huggingface/open-r 1: Fully open reproduction of DeepSeek-R1 to duplicate R1, fully open source (Jan 25, '25).
+- OpenAI researcher confirms the DeepSeek team separately discovered and [utilized](https://jobs.gpoplus.com) some core ideas the [OpenAI team](http://avtokraska-shop.ru) utilized en route to o1<br>
+<br>Liked this post? Join the newsletter.<br>
\ No newline at end of file