🎷🐛

Search

❯

optimal

Oct 03, 2024, 1 min read

Policy gradient methods work by computing an estimator of the policy gradient and plugging it into a stochastic gradient ascent algorithm. The most commonly used gradient estimator has the form.

g^{τ} = E^{τ} [\nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) A_{t}]

Graph View

Backlinks

No backlinks found

GitHub
Discord Community

🎷🐛

Explorer

optimal

Graph View

Backlinks