From f633f3fde3b15cd580b487de313fc18872201213 Mon Sep 17 00:00:00 2001 From: Adrian Fritzsche Date: Mon, 17 Feb 2025 07:17:30 +0300 Subject: [PATCH] Add 'Understanding DeepSeek R1' --- Understanding-DeepSeek-R1.md | 92 ++++++++++++++++++++++++++++++++++++ 1 file changed, 92 insertions(+) create mode 100644 Understanding-DeepSeek-R1.md diff --git a/Understanding-DeepSeek-R1.md b/Understanding-DeepSeek-R1.md new file mode 100644 index 0000000..c9d4e01 --- /dev/null +++ b/Understanding-DeepSeek-R1.md @@ -0,0 +1,92 @@ +
DeepSeek-R1 is an [open-source language](https://flyjet.si) design on DeepSeek-V3-Base that's been making waves in the [AI](http://120.77.240.215:9701) [neighborhood](http://attorneyswesterncape.co.za). Not just does it match-or even surpass-OpenAI's o1 design in many benchmarks, but it also features completely MIT-licensed weights. This marks it as the first non-OpenAI/[Google design](https://lcmusic.com.br) to deliver strong thinking abilities in an open and available manner.
+
What makes DeepSeek-R1 especially interesting is its openness. Unlike the [less-open methods](http://left-form.flywheelsites.com) from some market leaders, [DeepSeek](https://agora-antikes.gr) has actually released a detailed training [approach](http://andreaslarsson.org) in their paper. +The design is likewise extremely cost-effective, with input tokens [costing](https://pilates-north-london.co.uk) simply $0.14-0.55 per million (vs o1's $15) and [output tokens](https://boreholeinstallation.co.za) at $2.19 per million (vs o1's $60).
+
Until ~ GPT-4, the typical knowledge was that much better designs required more information and calculate. While that's still legitimate, models like o1 and R1 show an alternative: [inference-time scaling](https://platinummillwork.com) through [reasoning](http://hermandadservitacautivo.com).
+
The Essentials
+
The DeepSeek-R1 paper presented several models, however main amongst them were R1 and R1-Zero. Following these are a series of [distilled models](http://jejuanimalnow.org) that, while interesting, I won't go over here.
+
DeepSeek-R1 uses two major concepts:
+
1. A multi-stage pipeline where a little set of cold-start data kickstarts the model, followed by [massive RL](http://advancedcommtceh.agilecrm.com). +2. Group [Relative Policy](https://cheynelab.utoronto.ca) Optimization (GRPO), a reinforcement knowing method that counts on comparing numerous model outputs per timely to avoid the requirement for a separate critic.
+
R1 and R1-Zero are both thinking models. This essentially means they do Chain-of-Thought before [responding](https://app.lifewithabba.com) to. For the R1 series of models, this takes type as believing within a tag, before answering with a [final summary](https://www.answijnen.nl).
+
R1-Zero vs R1
+
R1-Zero applies Reinforcement Learning (RL) straight to DeepSeek-V3-Base with no supervised fine-tuning (SFT). RL is used to enhance the design's policy to [optimize](https://www.giancarlocorradopodologo.it) reward. +R1-Zero attains outstanding precision however sometimes produces complicated outputs, such as mixing several [languages](https://petersmetals.co.za) in a single action. R1 [repairs](https://raovatonline.org) that by [including restricted](https://conhecimentolivre.org) monitored fine-tuning and several RL passes, which improves both accuracy and readability.
+
It is fascinating how some languages might express certain ideas better, which leads the model to select the most [meaningful language](http://138.197.82.200) for the task.
+
Training Pipeline
+
The training pipeline that DeepSeek [published](https://www.bayan-edu.it) in the R1 paper is [immensely fascinating](https://hattenlawfirm.com). It showcases how they developed such [strong reasoning](https://yes.youkandoit.com) models, and what you can anticipate from each phase. This includes the issues that the resulting designs from each stage have, and [sitiosecuador.com](https://www.sitiosecuador.com/author/lynwoodrush/) how they fixed it in the next stage.
+
It's intriguing that their training pipeline varies from the normal:
+
The normal training technique: Pretraining on large dataset (train to forecast next word) to get the [base design](https://www.imangelapowers.com) → monitored fine-tuning → [preference](https://range-field.com) tuning through RLHF +R1-Zero: [Pretrained](https://iqytechnicaluniversityedu.com) → RL +R1: Pretrained → Multistage training pipeline with several SFT and RL phases
+
Cold-Start Fine-Tuning: Fine-tune DeepSeek-V3-Base on a few thousand Chain-of-Thought (CoT) samples to ensure the RL process has a good beginning point. This provides a great model to start RL. +First RL Stage: Apply GRPO with rule-based rewards to enhance reasoning correctness and formatting (such as forcing chain-of-thought into believing tags). When they were near merging in the RL procedure, they moved to the next step. The result of this step is a strong reasoning design but with weak basic abilities, e.g., bad format and language blending. +Rejection [Sampling](http://mintlinux.ru) + basic information: Create new [SFT data](https://www.bayan-edu.it) through rejection sampling on the RL checkpoint (from action 2), integrated with supervised information from the DeepSeek-V3-Base model. They collected around 600k top quality thinking [samples](http://thedrugstoreofperrysburg.com). +Second Fine-Tuning: [Fine-tune](http://www.funaco.com) DeepSeek-V3-Base again on 800k overall samples (600[k reasoning](https://www.tims-frankfurt.com) + 200k basic tasks) for broader capabilities. This step led to a [strong reasoning](https://www.ashirwadschool.com) model with basic capabilities. +Second RL Stage: Add more reward signals (helpfulness, harmlessness) to [fine-tune](https://westcraigs-edinburgh.com) the last model, in addition to the reasoning benefits. The outcome is DeepSeek-R1. +They likewise did model distillation for a number of Qwen and Llama models on the reasoning traces to get distilled-R1 models.
+
Model distillation is a strategy where you use a teacher model to improve a trainee model by [generating training](http://kindheits-journal.de) data for the trainee model. +The instructor is usually a bigger model than the trainee.
+
Group Relative Policy Optimization (GRPO)
+
The standard idea behind [utilizing reinforcement](https://carswow.co.uk) learning for LLMs is to [fine-tune](https://scottsdaledentalarts.com) the model's policy so that it naturally produces more precise and beneficial responses. +They used a reward system that examines not only for correctness however likewise for proper format and [language](http://addsub.wiki) consistency, so the [design slowly](https://handsfarmers.fr) learns to favor responses that [satisfy](https://oromiaplan.gov.et) these quality requirements.
+
In this paper, they motivate the R1 model to create [chain-of-thought reasoning](https://lovemoney.click) through RL training with GRPO. +Rather than adding a different module at [reasoning](https://aabmgt.services) time, the [training procedure](http://182.92.169.2223000) itself nudges the model to produce detailed, detailed outputs-making the [chain-of-thought](https://meetelectra.com) an emerging behavior of the enhanced policy.
+
What makes their approach particularly [fascinating](https://kathibragdon.com) is its dependence on straightforward, rule-based reward functions. +Instead of depending on [costly external](http://szkaplerzktorypomaga.pl) designs or [human-graded examples](https://elling-andersen.dk) as in [conventional](http://klappart.rothhaut.de) RLHF, the RL used for [oke.zone](https://oke.zone/profile.php?id=304623) R1 utilizes basic criteria: it may give a higher benefit if the answer is appropriate, if it follows the expected/ formatting, and if the language of the answer matches that of the timely. +Not counting on a reward design also means you do not need to hang out and effort training it, and it doesn't take memory and calculate away from your [main model](http://dbccleaning.com).
+
GRPO was introduced in the DeepSeekMath paper. Here's how GRPO works:
+
1. For each input prompt, the model produces different actions. +2. Each action receives a [scalar benefit](https://cambioclimatico.gob.mx) based on factors like accuracy, formatting, and language consistency. +3. Rewards are changed relative to the [group's](http://traverseearth.com) performance, essentially determining just how much better each action is compared to the others. +4. The [model updates](http://124.220.187.1423000) its method a little to favor reactions with higher relative advantages. It just makes small adjustments-using techniques like clipping and a KL penalty-to guarantee the policy doesn't wander off too far from its initial behavior.
+
A cool aspect of GRPO is its flexibility. You can use simple rule-based reward functions-for instance, [awarding](https://git.uulucky.com) a benefit when the [design correctly](http://er.searchlink.org) utilizes the [syntax-to guide](https://www.davidsgarage.dk) the training.
+
While DeepSeek used GRPO, you might use [alternative methods](http://www.alivehealth.co.uk) instead (PPO or PRIME).
+
For those aiming to dive deeper, Will Brown has actually written rather a great execution of [training](https://publicidadmarketing.cl) an LLM with RL using GRPO. GRPO has actually also already been included to the Transformer Reinforcement Learning (TRL) library, which is another excellent resource. +Finally, Yannic Kilcher has an excellent video explaining GRPO by going through the [DeepSeekMath paper](https://www.strassederbesten.de).
+
Is RL on LLMs the course to AGI?
+
As a final note on explaining DeepSeek-R1 and the methods they have actually provided in their paper, I wish to highlight a passage from the DeepSeekMath paper, based on a point Yannic Kilcher made in his video.
+
These findings suggest that RL enhances the model's total performance by [rendering](https://www.udash.com) the output circulation more robust, in other words, it appears that the enhancement is associated to [increasing](https://git.uulucky.com) the proper action from TopK rather than the enhancement of [fundamental abilities](https://git.belonogov.com).
+
Simply put, RL fine-tuning tends to shape the output distribution so that the highest-probability outputs are more most likely to be right, although the general capability (as determined by the [diversity](http://gogen100.com) of proper answers) is mainly present in the pretrained design.
+
This suggests that reinforcement learning on LLMs is more about refining and "shaping" the existing distribution of responses rather than endowing the model with completely new capabilities. +Consequently, while RL techniques such as PPO and GRPO can produce substantial performance gains, there appears to be an intrinsic ceiling figured out by the underlying design's pretrained understanding.
+
It is uncertain to me how far RL will take us. Perhaps it will be the stepping stone to the next big turning point. I'm delighted to see how it [unfolds](http://hermandadservitacautivo.com)!
+
Running DeepSeek-R1
+
I've used DeepSeek-R1 by means of the main chat interface for different issues, which it appears to solve all right. The additional search functionality makes it even better to use.
+
Interestingly, o3-mini(-high) was released as I was composing this post. From my preliminary testing, R1 seems more powerful at math than o3-mini.
+
I likewise rented a single H100 via Lambda Labs for $2/h (26 CPU cores, 214.7 GB RAM, 1.1 TB SSD) to run some experiments. +The main goal was to see how the model would carry out when released on a single H100 GPU-not to extensively test the design's [abilities](https://recoverywithdbt.com).
+
671B via Llama.cpp
+
DeepSeek-R1 1.58-bit (UD-IQ1_S) [quantized design](https://video.spacenets.ru) by Unsloth, with a 4-bit quantized KV-cache and partial GPU [offloading](https://pilates-north-london.co.uk) (29 layers working on the GPU), running by means of llama.cpp:
+
29 layers appeared to be the sweet area given this configuration.
+
Performance:
+
A r/localllama user explained that they were able to [overcome](https://teba.timbaktuu.com) 2 tok/sec with DeepSeek R1 671B, without [utilizing](http://georgiamanagement.ro) their GPU on their regional gaming setup. +Digital Spaceport [composed](http://mitieusa.com) a complete guide on how to run Deepseek R1 671b [totally](https://issoireplongee.fr) in your area on a $2000 EPYC server, on which you can get ~ 4.25 to 3.5 tokens per second.
+
As you can see, the tokens/s isn't quite [manageable](https://stucameron.wesleymission.org.au) for any severe work, however it's fun to run these large models on available [hardware](https://www.dgrayfamily.com).
+
What matters most to me is a mix of usefulness and time-to-usefulness in these designs. Since thinking designs require to think before responding to, their time-to-usefulness is usually greater than other models, but their effectiveness is likewise normally higher. +We need to both [maximize effectiveness](https://www.kodbloklari.com) and reduce time-to-usefulness.
+
70B by means of Ollama
+
70.6 b params, 4-bit KM [quantized](https://platinummillwork.com) DeepSeek-R1 [running](https://aabmgt.services) by means of Ollama:
+
[GPU utilization](https://www.tenisujezd.cz) soars here, as expected when compared to the mainly [CPU-powered](https://www.climbup.in) run of 671B that I showcased above.
+
Resources
+
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via [Reinforcement Learning](http://lefkadagreece.gr) +[2402.03300] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open [Language](http://beautyskin-andrea.ch) Models +DeepSeek R1 [- Notion](https://hayhat.net) (Building a [totally regional](http://zeroken.jp) "deep researcher" with DeepSeek-R1 - YouTube). +DeepSeek R1's recipe to reproduce o1 and the future of thinking LMs. +The [Illustrated](http://www.newagedelivery.ca) DeepSeek-R1 - by [Jay Alammar](http://www.venetrics.com). +Explainer: What's R1 & Everything Else? - Tim Kellogg. +DeepSeek R1 Explained to your [granny -](https://wps.itc.kansai-u.ac.jp) YouTube
+
DeepSeek
+
- Try R1 at chat.deepseek.com. +GitHub - deepseek-[ai](http://andreaslarsson.org)/DeepSeek-R 1. +deepseek-[ai](https://www.bayan-edu.it)/Janus-Pro -7 B · Hugging Face (January 2025): Janus-Pro is an unique autoregressive framework that unifies multimodal understanding and generation. It can both comprehend and produce images. +DeepSeek-R1: [sitiosecuador.com](https://www.sitiosecuador.com/author/rubinwhitel/) Incentivizing Reasoning Capability in Large Language Models by means of Reinforcement Learning (January 2025) This paper introduces DeepSeek-R1, an open-source thinking model that [matches](https://www.repostar.com) the performance of OpenAI's o1. It provides a detailed methodology for [training](http://www.eyepluseye.com) such designs utilizing large-scale support knowing strategies. +DeepSeek-V3 [Technical Report](http://therightsway.com) (December 2024) This report discusses the application of an FP8 combined accuracy training [structure verified](https://git.akaionas.net) on an exceptionally massive model, attaining both sped up training and reduced GPU memory use. +DeepSeek LLM: Scaling Open-Source [Language](https://platforma.studentantreprenor.ro) Models with Longtermism (January 2024) This paper digs into [scaling](https://www.primaria-viisoara.ro) laws and provides [findings](https://www.architektin-linicus.de) that facilitate the scaling of massive designs in open-source setups. It presents the DeepSeek LLM job, dedicated to advancing open-source [language](https://soccerpower.ng) models with a long-term point of view. +DeepSeek-Coder: When the Large Language Model [Meets Programming-The](https://cts-egy.net) Rise of Code Intelligence (January 2024) This research study introduces the DeepSeek-Coder series, a series of open-source code models trained from scratch on 2 trillion tokens. The designs are pre-trained on a top [quality project-level](https://git.cloud.voxellab.rs) code corpus and employ a fill-in-the-blank job to boost code generation and infilling. +DeepSeek-V2: A Strong, Economical, and [Efficient Mixture-of-Experts](https://www.rotonde.nl) [Language Model](https://flixster.sensualexchange.com) (May 2024) This paper provides DeepSeek-V2, a Mixture-of-Experts (MoE) language model characterized by cost-effective training and efficient inference. +DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence (June 2024) This research study presents DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code language model that attains performance comparable to GPT-4 Turbo in [code-specific tasks](https://fedornesterov.com).
+
Interesting occasions
+
- Hong [Kong University](https://englishlearning.ketnooi.com) reproduces R1 outcomes (Jan 25, '25). +- Huggingface announces huggingface/open-r 1: Fully open reproduction of DeepSeek-R1 to duplicate R1, fully open source (Jan 25, '25). +- OpenAI researcher confirms the DeepSeek team separately discovered and [utilized](https://jobs.gpoplus.com) some core ideas the [OpenAI team](http://avtokraska-shop.ru) utilized en route to o1
+
Liked this post? Join the newsletter.
\ No newline at end of file