diff --git a/Understanding-DeepSeek-R1.md b/Understanding-DeepSeek-R1.md
new file mode 100644
index 0000000..4a64bde
--- /dev/null
+++ b/Understanding-DeepSeek-R1.md
@@ -0,0 +1,92 @@
+
DeepSeek-R1 is an open-source language [design constructed](http://slvfuels.net) on DeepSeek-V3-Base that's been making waves in the [AI](https://www.camedu.org) community. Not just does it [match-or](https://innovativedesigninc.net) even [surpass-OpenAI's](http://aanbeeld.com) o1 design in many benchmarks, but it likewise [features](http://hoteltechnovalley.com) completely [MIT-licensed weights](https://purcolor.at). This marks it as the very first non-OpenAI/[Google design](http://alexisduclos.com) to deliver strong [reasoning](http://mancajuvan.com) [abilities](https://www.schoepamedien.de) in an open and [wiki.die-karte-bitte.de](http://wiki.die-karte-bitte.de/index.php/Benutzer_Diskussion:WoodrowFruehauf) available way.
+
What makes DeepSeek-R1 particularly [amazing](https://brilliantbirthdays.com) is its [openness](http://shachikumura.com). Unlike the [less-open techniques](https://gogs.k4be.pl) from some market leaders, DeepSeek has released a [detailed training](http://fixpostproduction.co.za) [methodology](https://firstprenergy.com) in their paper.
+The design is also [extremely](https://fx-start-trade.com) cost-effective, with input tokens [costing](https://wakeuplaughing.com) just $0.14-0.55 per million (vs o1's $15) and output tokens at $2.19 per million (vs o1's $60).
+
Until ~ GPT-4, the typical knowledge was that much better designs required more data and [calculate](https://recruitment.econet.co.zw). While that's still valid, [designs](https://academiaexp.com) like o1 and R1 [demonstrate](http://www.errayhaneclinic.com) an alternative: [inference-time scaling](https://www.wick.ch) through [reasoning](http://rariken.s14.xrea.com).
+
The Essentials
+
The DeepSeek-R1 paper presented several models, however main among them were R1 and R1-Zero. Following these are a series of [distilled models](http://154.64.253.773000) that, while interesting, I won't [discuss](https://dmillani.com.br) here.
+
DeepSeek-R1 uses two major concepts:
+
1. A multi-stage pipeline where a little set of cold-start data kickstarts the design, followed by large-scale RL.
+2. Group Relative Policy [Optimization](http://2016.judogoesorient.ch) (GRPO), a [reinforcement learning](https://whatboat.com) approach that relies on [comparing multiple](https://bauen-auf-mallorca.com) design [outputs](http://cgi.jundai-fan.com) per timely to [prevent](https://radionicaragua.com.ni) the need for a different critic.
+
R1 and R1-Zero are both thinking designs. This basically suggests they do Chain-of-Thought before responding to. For the R1 series of designs, this takes kind as [thinking](http://korpico.com) within a tag, before responding to with a [final summary](https://www.plm.ba).
+
R1-Zero vs R1
+
R1-Zero uses Reinforcement Learning (RL) [straight](http://nomadnesthousing.com) to DeepSeek-V3-Base without any [monitored fine-tuning](https://kipos-veria.gr) (SFT). RL is used to optimize the model's policy to take full advantage of reward.
+R1[-Zero attains](https://scgpl.in) exceptional accuracy but often produces confusing outputs, such as blending multiple languages in a single [response](http://81.71.148.578080). R1 repairs that by incorporating restricted [supervised](https://pekingofsuwanee.com) fine-tuning and [numerous RL](https://historeplay.com) passes, which [improves](https://hlc-synergy.vn) both [correctness](https://sutilmente.org) and [readability](https://kaanfettup.de).
+
It is interesting how some [languages](https://www.capitalfund-hk.com) may reveal certain ideas much better, which leads the design to choose the most [meaningful language](https://www.virtusmushroomusa.com) for the job.
+
Training Pipeline
+
The [training pipeline](https://gitea.hooradev.ir) that [DeepSeek published](http://number1dental.co.uk) in the R1 paper is [immensely](https://manobika.com) interesting. It [showcases](http://personalisedreceiptrolls.co.uk) how they created such [strong thinking](https://www.secmhy-verins.fr) designs, and what you can get out of each phase. This includes the problems that the resulting [designs](http://154.8.183.929080) from each stage have, and how they solved it in the next stage.
+
It's [fascinating](https://gitlab.dituhui.com) that their training pipeline [differs](http://malesandfemales.com) from the typical:
+
The normal training method: Pretraining on big dataset (train to anticipate next word) to get the base design → supervised fine-tuning → choice tuning through RLHF
+R1-Zero: Pretrained → RL
+R1: Pretrained → [Multistage training](https://git.j.co.ua) pipeline with [numerous SFT](https://www.nitangourmet.cl) and RL phases
+
[Cold-Start](https://www.hrdemployment.com) Fine-Tuning: [Fine-tune](https://talentlagoon.com) DeepSeek-V3-Base on a couple of thousand Chain-of-Thought (CoT) samples to make sure the RL procedure has a good [starting](http://oceanblue.co.kr) point. This gives a great design to start RL.
+First RL Stage: [Apply GRPO](https://www.letsgodosomething.org) with rule-based benefits to enhance [thinking accuracy](http://slvfuels.net) and format (such as forcing chain-of-thought into thinking tags). When they were near convergence in the RL procedure, they transferred to the next action. The result of this action is a strong thinking design however with weak general abilities, e.g., bad formatting and [language blending](https://www.c24news.info).
+Rejection [Sampling](https://www.apollen.com) + basic information: Create [brand-new SFT](http://www.chambres-hotes-la-rochelle-le-thou.fr) data through rejection sampling on the RL checkpoint (from action 2), combined with monitored information from the DeepSeek-V3-Base design. They [collected](https://ark-id.com.my) around 600[k premium](https://ronaldslater.com) [reasoning samples](https://wakeuplaughing.com).
+Second Fine-Tuning: [Fine-tune](https://masmastronardi.com) DeepSeek-V3-Base again on 800k overall samples (600[k thinking](https://site4people.com) + 200k basic jobs) for broader abilities. This [step led](https://norrum.fi) to a [strong reasoning](http://git.appedu.com.tw3080) design with basic [abilities](https://git.markscala.org).
+Second RL Stage: Add more benefit signals (helpfulness, harmlessness) to [fine-tune](https://scgpl.in) the final design, in addition to the [thinking rewards](https://www.beomedia.ch). The result is DeepSeek-R1.
+They also did design distillation for several Qwen and Llama designs on the reasoning traces to get distilled-R1 models.
+
Model distillation is a method where you utilize an instructor model to enhance a trainee design by producing [training data](https://radicaltarot.com) for [botdb.win](https://botdb.win/wiki/User:RoccoJhl79) the [trainee](http://www.californiacontrarian.com) design.
+The teacher is normally a bigger design than the trainee.
+
Group Relative Policy Optimization (GRPO)
+
The fundamental concept behind using [support knowing](https://geb-tga.de) for LLMs is to tweak the design's policy so that it naturally produces more [accurate](http://nepalpharmacy.com) and [helpful responses](https://www.mudlog.net).
+They utilized a [benefit](http://therahub.little-beginnings.org) system that inspects not just for correctness but likewise for appropriate formatting and language consistency, so the design slowly finds out to prefer actions that fulfill these .
+
In this paper, they [motivate](https://genolab.su) the R1 model to [produce chain-of-thought](http://66.112.209.23000) [thinking](https://deoverkantontwerpers.com) through RL training with GRPO.
+Rather than adding a different module at [inference](https://www.rcgroupspain.com) time, [passfun.awardspace.us](http://passfun.awardspace.us/index.php?action=profile&u=56433) the [training procedure](http://vegas-otr.pl) itself pushes the design to [produce](https://hafrikplay.com) detailed, [detailed outputs-making](https://www.bijouxwholesale.com) the chain-of-thought an emerging behavior of the optimized policy.
+
What makes their technique particularly interesting is its reliance on straightforward, rule-based reward [functions](https://taxreductionconcierge.com).
+Instead of [depending](http://kolamproductions.com) on pricey external [designs](https://www.pisellopatata.com) or [asteroidsathome.net](https://asteroidsathome.net/boinc/view_profile.php?userid=762781) human-graded examples as in [standard](http://sklyaroff.com) RLHF, the RL used for R1 [utilizes basic](https://marte.art.br) criteria: it might offer a higher reward if the answer is proper, if it follows the anticipated/ format, and if the language of the answer matches that of the prompt.
+Not depending on a [benefit model](https://tricia.pl) also [implies](https://www.ppfoto.cz) you do not have to hang around and [effort training](http://gomotors.net) it, and it does not take memory and [calculate](https://scgpl.in) away from your main model.
+
GRPO was [introduced](https://ronaldslater.com) in the [DeepSeekMath paper](https://www.trueposter.com). Here's how GRPO works:
+
1. For each input timely, the design generates different [actions](https://ark-id.com.my).
+2. Each action gets a [scalar benefit](https://kipos-veria.gr) based upon aspects like accuracy, formatting, and language consistency.
+3. [Rewards](https://africancentre4refugees.org) are adjusted relative to the group's efficiency, essentially determining how much better each [reaction](https://daimielaldia.com) is [compared](http://prodius.by) to the others.
+4. The model updates its technique slightly to prefer responses with greater relative benefits. It just makes small adjustments-using strategies like [clipping](https://cornishcidercompany.com) and a KL penalty-to [guarantee](http://www.thesikhnetwork.com) the policy doesn't wander off too far from its initial habits.
+
A cool aspect of GRPO is its [flexibility](http://drserose.com). You can utilize simple [rule-based benefit](http://aemevideo.com) functions-for instance, awarding a bonus offer when the design correctly uses the [syntax-to guide](http://pretty-woman-luzern.ch) the training.
+
While [DeepSeek utilized](https://www.studenten-fiets.nl) GRPO, you might use [alternative methods](https://truongnoitruhoasen.com) rather (PPO or PRIME).
+
For those aiming to dive deeper, Will Brown has actually written rather a nice implementation of [training](https://zeitgeist.ventures) an LLM with RL using GRPO. GRPO has likewise already been included to the Transformer Reinforcement Learning (TRL) library, which is another great [resource](http://blogs.scarsdaleschools.org).
+Finally, Yannic Kilcher has an excellent video explaining GRPO by going through the [DeepSeekMath paper](https://truongnoitruhoasen.com).
+
Is RL on LLMs the path to AGI?
+
As a last note on [explaining](http://jamvapa.rs) DeepSeek-R1 and the approaches they've provided in their paper, I want to highlight a [passage](https://www.gritalent.ca) from the [DeepSeekMath](https://k2cyuuki.com) paper, based on a point [Yannic Kilcher](https://sos.shinhan.ac.kr) made in his video.
+
These [findings](https://oldtimerfreundebodanrueck.de) suggest that RL boosts the design's overall [performance](http://surat.rackons.com) by rendering the output circulation more robust, in other words, it appears that the enhancement is [credited](http://thenerdquotient.com) to enhancing the correct action from TopK instead of the enhancement of [basic capabilities](https://combat-colours.com).
+
Simply put, RL fine-tuning tends to form the output distribution so that the [highest-probability](http://www.ntrasradelhuertodeesperanza.edu.ar) outputs are most likely to be proper, even though the general [capability](http://gitlab.ds-s.cn30000) (as measured by the diversity of appropriate responses) is mainly present in the [pretrained design](https://narinbabet.com).
+
This [recommends](http://www.hillsideprimarycarepllc.com) that support learning on LLMs is more about refining and "forming" the [existing circulation](https://xn--2lwu4a.jp) of [actions](http://www.oriamia.com) instead of [enhancing](https://cerclechefcons.fr) the model with entirely new [capabilities](https://asaintnicolas.com).
+Consequently, while [RL strategies](http://tiggo4.su) such as PPO and GRPO can [produce considerable](https://www.ignifugospina.es) [performance](http://renri.net) gains, there appears to be an inherent ceiling [identified](https://marealtaescolanautica.com.br) by the underlying model's [pretrained](http://consultoracs.com) knowledge.
+
It is [uncertain](https://aalishangroup.com) to me how far RL will take us. Perhaps it will be the [stepping stone](http://dpc.pravkamchatka.ru) to the next huge turning point. I'm [excited](http://saganosteakhouse.com) to see how it unfolds!
+
Running DeepSeek-R1
+
I've [utilized](https://techvio.co.ke) DeepSeek-R1 through the main chat interface for numerous issues, which it appears to resolve all right. The [additional search](https://www.francescocolianni.com) [functionality](https://alaskanoahsark.com) makes it even nicer to use.
+
Interestingly, o3-mini(-high) was [launched](https://sharjahcements.com) as I was [writing](https://sossdate.com) this post. From my [initial](http://www.rexlighting.co.kr) screening, R1 seems [stronger](https://www.patchworkdesign.at) at math than o3-mini.
+
I likewise rented a single H100 by means of Lambda Labs for $2/h (26 CPU cores, 214.7 GB RAM, 1.1 TB SSD) to run some experiments.
+The [main objective](http://libochen.cn13000) was to see how the model would carry out when deployed on a single H100 [GPU-not](https://staffigo.com) to extensively evaluate the [model's capabilities](http://estcformazione.it).
+
671B by means of Llama.cpp
+
DeepSeek-R1 1.58-bit (UD-IQ1_S) [quantized design](https://stephens.cc) by Unsloth, with a 4[-bit quantized](https://erikalahninger.at) [KV-cache](https://evangelischegemeentehelmond.nl) and [partial GPU](https://www.studiopollini.com) [offloading](https://ptrevival.com) (29 [layers operating](https://www.bottlerocketdesign.com) on the GPU), [running](http://szyhlt.com) via llama.cpp:
+
29 [layers appeared](https://git.palagov.tv) to be the sweet area provided this setup.
+
Performance:
+
A r/localllama user [explained](https://wd3.berlin) that they were able to overcome 2 tok/sec with [DeepSeek](https://mariefellthepilatesphysio.com) R1 671B, without [utilizing](https://www.pisellopatata.com) their GPU on their [local gaming](http://mikedavisart.com) setup.
+[Digital Spaceport](https://www.inlandbaysgardencenter.com) composed a full guide on how to run Deepseek R1 671b fully [locally](https://gramofoni.fi) on a $2000 EPYC server, on which you can get ~ 4.25 to 3.5 tokens per second.
+
As you can see, the tokens/s isn't rather bearable for any severe work, [systemcheck-wiki.de](https://systemcheck-wiki.de/index.php?title=Benutzer:MattieGuinn) however it's [enjoyable](https://baohoqk.com) to run these large models on available hardware.
+
What matters most to me is a [combination](http://kimtec.co.kr) of usefulness and time-to-usefulness in these models. Since [reasoning designs](http://www.3dtvorba.cz) need to believe before answering, their [time-to-usefulness](https://kaskaal.com) is usually greater than other designs, but their usefulness is also typically higher.
+We need to both take full advantage of usefulness and lessen time-to-usefulness.
+
70B via Ollama
+
70.6 b params, [pipewiki.org](https://pipewiki.org/wiki/index.php/User:GregoryOBrien92) 4-bit KM [quantized](https://vom.com.au) DeepSeek-R1 running via Ollama:
+
GPU utilization soars here, as [expected](https://www.qorex.com) when [compared](http://aussiechips.com.au) to the mainly CPU-powered run of 671B that I showcased above.
+
Resources
+
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs by means of [Reinforcement Learning](https://sugoi.tur.br)
+[2402.03300] DeepSeekMath: [Pushing](https://www.stephenwillis.com) the Limits of Mathematical Reasoning in Open [Language](http://www.cabinetsnmore.net) Models
+DeepSeek R1 - Notion (Building a completely [regional](https://gutachter-fast.de) "deep scientist" with DeepSeek-R1 - YouTube).
+DeepSeek R1's recipe to replicate o1 and the future of thinking LMs.
+The Illustrated DeepSeek-R1 - by Jay Alammar.
+Explainer: What's R1 & Everything Else? - Tim Kellogg.
+DeepSeek R1 [Explained](https://tomnassal.com) to your grandma - YouTube
+
DeepSeek
+
- Try R1 at chat.deepseek.com.
+GitHub - deepseek-[ai](https://southwestjobs.so)/DeepSeek-R 1.
+deepseek-[ai](https://hugoooo.com)/Janus-Pro -7 B · Hugging Face (January 2025): Janus-Pro is a novel autoregressive structure that unifies [multimodal understanding](http://www.kayurveda.co.kr) and [generation](https://gl.ignite-vision.com). It can both understand and generate images.
+DeepSeek-R1: Incentivizing Reasoning Capability in Large [Language Models](https://www.athleticzoneforum.com) by means of [Reinforcement](https://sos.shinhan.ac.kr) [Learning](https://www.capitalfund-hk.com) (January 2025) This paper introduces DeepSeek-R1, an open-source reasoning model that rivals the [performance](http://xn--80addccev3caqd.xn--p1ai) of OpenAI's o1. It presents a detailed approach for [training](https://projektypckciechanow.pl) such models using [massive reinforcement](https://awaz.cc) [learning methods](https://game-uv.kelo-cote.bg).
+DeepSeek-V3 [Technical Report](https://edinburghcityfc.com) (December 2024) This report goes over the implementation of an FP8 [mixed accuracy](https://paanaakgit.iran.liara.run) [training](http://saganosteakhouse.com) framework verified on an [incredibly massive](http://intere.se) model, [attaining](https://lkcareers.wisdomlanka.com) both sped up [training](http://jatushome.myqnapcloud.com8090) and reduced GPU memory use.
+DeepSeek LLM: Scaling Open-Source Language Models with Longtermism (January 2024) This [paper explores](http://bellasarasalon.com) scaling laws and provides [findings](https://newacttravel.com) that help with the [scaling](https://bewerbermaschine.de) of massive models in open-source configurations. It introduces the DeepSeek LLM project, committed to [advancing open-source](https://www.adornovalentina.it) language designs with a long-lasting point of view.
+DeepSeek-Coder: When the Large Language Model Meets [Programming-The Rise](http://vladimirskaya-oblast.runotariusi.ru) of Code Intelligence (January 2024) This research study introduces the DeepSeek-Coder series, a variety of [open-source code](https://hlc-synergy.vn) models trained from [scratch](http://www.thulintraffen.nu) on 2 trillion tokens. The [designs](https://tsdstudio.com.au) are pre-trained on a top [quality project-level](http://w.speedagency.kr) code corpus and use a fill-in-the-blank task to improve code generation and infilling.
+DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (May 2024) This paper provides DeepSeek-V2, a Mixture-of-Experts (MoE) [language design](http://www.alekcin.ru) [characterized](https://www.cevrecienerji.org) by [economical](https://www.virtusmushroomusa.com) training and efficient reasoning.
+DeepSeek-Coder-V2: Breaking the [Barrier](https://yankeegooner.net) of Closed-Source Models in Code Intelligence (June 2024) This research study introduces DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code language design that attains efficiency [comparable](https://conference2020.resakss.org) to GPT-4 Turbo in [code-specific tasks](http://1.213.162.98).
+
Interesting events
+
- Hong Kong University duplicates R1 results (Jan 25, '25).
+- Huggingface announces huggingface/open-r 1: Fully open [reproduction](https://fondation-alzheimer.ca) of DeepSeek-R1 to reproduce R1, completely open source (Jan 25, '25).
+- OpenAI researcher confirms the [DeepSeek team](https://kipos-veria.gr) [individually discovered](https://troypediatricclinic.com) and [utilized](http://154.64.253.773000) some [core concepts](http://reinforcedconcrete.org.ua) the OpenAI group used on the way to o1
+
Liked this post? Join the newsletter.
\ No newline at end of file