# Reinforcement Learning with Action Chunking & Human-in-the-Loop in LeRobot
Juan Pizarro https://linkedin.com/in/jpizarrom/
Hacker Room Demo Day - Dec 2025
--- # [Juan Pizarro](https://linkedin.com/in/jpizarrom/) - Deep dive and R&D focused on RL & Robotics - Former (Selected): - MLE @ [Holocene](https://holocene.eu/) - SWE @ [FlixMobility](https://flixbus.com/) - Master's in AI @ [Universitat Politècnica de València](https://www.upv.es/estudios/master/muiarfid/) - Lead Software Engineer @ [Admetricks](https://admetricks.com/)
recently acquired by Similarweb
- Lead Software Engineer @ [A-Dedo](https://allrideapp.com/) - Computer Science Engineer @ [Universidad de Talca](https://www.utalca.cl/en/) --- # LeRobot merged HIL-SERL Jun 2025
HIL-SERL: Precise and Dexterous Robotic Manipulation via Human-in-the-Loop Reinforcement Learning (Oct 2024) https://arxiv.org/abs/2410.21845
--- # Push Cube: HIL-SERL SAC
Your browser does not support the video tag.
[More info](https://www.linkedin.com/posts/jpizarrom_lerobot-lerobot-hackathon-activity-7340867714963451904-4TyW?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAGa8sABWPeYPEmnbrttY0kcm7RRjf9QG_0)
-- Top 10 at Hugging Face LeRobot Worldwide Hackathon. **Task:** - Push cylinder into target & retract without disturbing it. Avoid excessive force on boundary tape. **Methodology:** - **Algorithm:** HIL-SERL (SAC) on SO-100 Arms. - **Training:** 40 episodes, 80k steps (~10k seconds). -- **Outcome:** - Model learned to handle physical constraints. - Successfully completed the sequence autonomously. --- # Flow Q-Learning
Flow Q-Learning (Feb 2025) https://arxiv.org/abs/2502.02538
--- # Reinforcement Learning with Action Chunking
Reinforcement Learning with Action Chunking (Jul 2025) https://arxiv.org/abs/2507.07969
-- # QC-FQL
Reinforcement Learning with Action Chunking (Jul 2025) https://arxiv.org/abs/2507.07969
--- # ACFQL Pick-and-Place
Your browser does not support the video tag.
https://github.com/huggingface/lerobot/pull/1818
[More info](https://www.linkedin.com/posts/jpizarrom_reinforcementlearning-machinelearning-robotics-activity-7372300995369836545-uPQ4?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAGa8sABWPeYPEmnbrttY0kcm7RRjf9QG_0)
-- **Goal:** Pick up cubes and place them in a designated area. **Methodology:** QC-FQL (ACFQL) in LeRobot via WSRL (Offline -> Online). **Training:** - **Offline:** 100k optimization on 40 successful episodes - **Online:** 3 phases - pick one cube: - 12k optimization steps online at 0.7Hz: - 781 episodes, 126422 transitions - pick two cubes: - 12k optimization steps online at 0.7Hz - 449 episodes, 130000 transitions - pick three and four cubes: - 12k optimization steps online at 0.7Hz - 522 episodes, 127103 transitions -- **Notes:** - Frequent human interventions required needed. - Switched teleop from leader arm to keyboard - Changed gripper orientation multiple times - Adjusted end_effector_bounds, end_effector_step_sizes, and fixed_reset_joint_positions during online finetuning many times --- # ACFQL on Franka arm 8d
Your browser does not support the video tag.
Humanoid Manipulation Hackathon https://luma.com/0bwvwejk?tk=ZrmKdy
[More info](https://www.linkedin.com/posts/jpizarrom_robotics-reinforcementlearning-embodiedai-activity-7375897955939074050-VLhm?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAGa8sABWPeYPEmnbrttY0kcm7RRjf9QG_0)
-- **Event:** Humanoid Manipulation & AI Hackathon, Munich. **Methodology:** - **Setup:** Defined safety limits (virtual fence). - **Training:** Overnight offline training (40 demos) + Online HIL interventions. - **Teleop:** Keyboard with delta actions. -- **Achievement:** Integrated Franka arms into ACFQL in LeRobot in **3 days**. **Outcome:** - Trained pick-and-place policy on industrial hardware in hackathon timeframe. --- # ACFQL + gripper rotation 7d
Your browser does not support the video tag.
[More info](https://www.linkedin.com/posts/jpizarrom_lerobot-robotics-ai-activity-7391492153967005697-NVzJ?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAGa8sABWPeYPEmnbrttY0kcm7RRjf9QG_0)
-- **Behavior:** Policy rotates gripper to align with cube geometry. **Configuration:** - **Input:** 2 Cameras (128x128), Proprioception. - **Teleop:** Gamepad with delta actions. **Training:** - **Offline:** - 400 successful episodes - 500k optimization steps - **Online:** - 5k initial transitions from the offline RL policy - no offline data is used during online RL - 5k optimization steps at 1.4hz
[s1lent4gnt](https://github.com/s1lent4gnt) refactored ACFQL to follow the paper's code more strictly and improved training speed
--- # Multi-Stage Pick-and-Place 20FPS
Your browser does not support the video tag.
[More info](https://www.linkedin.com/posts/jpizarrom_lerobot-robotics-ai-activity-7395201262834618368-eAkn?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAGa8sABWPeYPEmnbrttY0kcm7RRjf9QG_0)
-- **Task:** Pick object -> Place in drawer -> Close drawer. **Training:** - **Offline:** - 160 successful episodes - trained for 500k optimization steps - **Online:** - 5k initial transitions from the offline RL policy - 100k optimization steps at 8.5hz - 298 episodes, 132153 transitions - Interventions focused on "closing" motion and precise grasping. --- # Pick-and-Place and Stacking
Your browser does not support the video tag.
[More info](https://www.linkedin.com/posts/jpizarrom_lerobot-huggingface-robotics-activity-7398748715060137986-8L9d?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAGa8sABWPeYPEmnbrttY0kcm7RRjf9QG_0)
-- **Task:** Stack two cubes with text facing outward (40s horizon). **Iterative Pipeline:** - **Teleop:** 100 demos with leader SO101 arm. - **Teleop:** 100 more demos with leader SO101 arm. - **Offline RL:** Trained initial policy on 200 demos. - **Online RL:** - Interventions with SO100 arm and gamepad teleop. - Generated 181 additional successful episodes - **Offline RL:** Combined 381 episodes. - **Online RL:** Further refinements. - Interventions gamepad teleop. -- **Outcome:** Iterative RL - Demo -> Offline -> Online -> Offline -> Online - increase dataset progressively with policy rollouts with interventions. - demonstrations and initial interventions using leader SO101 arm. - Transitioned to gamepad teleop when interventions became simpler. --- # VLA and RECAP-style Advantage Signals as BC
Your browser does not support the video tag.
Inspired on π∗0.6: a VLA That Learns From Experience (Nov 2025) https://arxiv.org/pdf/2511.14759
[More info](https://www.linkedin.com/posts/jpizarrom_lerobot-huggingface-robotics-activity-7401974115819069440-_Dqm?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAGa8sABWPeYPEmnbrttY0kcm7RRjf9QG_0)
-- **Experiment:** VLA model as Behavior Cloning actor in ACFQL + HIL-SERL. **Methodology:** - **BC Actor:** SmolVLA finetuned with **RECAP-style advantage signals**. **Work in Progress:** - Train value net with return to go/MC returns, continous V(s) estimates. - Train a critic with Q-Chunking: Q(s_t, a_t ... a_t+H) - Calculate advantages for A(s,a) = Q(s,a) - V(s) - Binarization of advantages for RECAP-style BC loss. - Label Dropout: Randomly set ~10% of labels to null to learn unconditional distribution for BC/VLA - Guided denoise from CFG (conditional (good) path) for distilling VLA/BC into the one-step actor - QWR on the one-step actor with Flow Q-learning **Preliminary Results:** - **Efficiency:** Improved behavior after just **10k optimization steps**. - **Insight:** Pretrained knowledge accelerates learning vs learning from scratch. - **Distillation:** VLA (BC) guided policy learning. --- # ToDo's - Current focus: Feasibility of ACFQL + HIL-SERL learning approaches - Next steps: Enhance system reliability and robustness - Future work: Upgrade architecture for multi-task and long-horizon capabilities --- # Learnings **Buffer Management:** - Single replay buffer initialized with offline policy rollouts + interventions - Policy autonomy maximized during warm-start (requires strong initial model: 200-400 episodes) - Latest experiments: successful episodes only for warm-start (vs. mixed success/failure previously) **Online RL Session Strategy:** - Multiple sequential sessions (max 120k-240k transitions) - Buffer reset between sessions with fresh warm-start from latest policy - Challenge: Larger buffers → delayed correction effects due to random sampling **Iterative Training Pipeline:** - Demo → Offline RL → Online RL → Offline RL (incremental dataset growth) - Successful online episodes integrated into offline dataset - Reduced sequential online sessions, increased dataset progressively **Intervention Evolution:** - Early sessions: SO101 arm for complex interventions - Later sessions: Gamepad teleop for simpler, cleaner interventions - Warm-start transitions increased: 5k → 10k for better initialization --- # References
https://github.com/huggingface/lerobot
https://huggingface.co/docs/lerobot/hilserl
HIL-SERL: Precise and Dexterous Robotic Manipulation via Human-in-the-Loop Reinforcement Learning (Oct 2024) https://arxiv.org/abs/2410.21845
Flow Q-Learning (Feb 2025) https://arxiv.org/abs/2502.02538
Reinforcement Learning with Action Chunking (Jul 2025) https://arxiv.org/abs/2507.07969
π∗0.6: a VLA That Learns From Experience (Nov 2025) https://arxiv.org/pdf/2511.14759
WSRL: Efficient Online Reinforcement Learning Fine-Tuning Need Not Retain Offline Data https://arxiv.org/abs/2412.07762
[HIL-SERL] Add Flow Q-learning (FQL) agent with action chunking https://github.com/huggingface/lerobot/pull/1818 [refactored by s1lent4gnt to follow paper more strictly](https://github.com/s1lent4gnt)
UCBerkeley CS 285: Deep Reinforcement Learning 2023 (UC Berkeley) https://www.youtube.com/playlist?list=PL_iWQOsE6TfVYGEGiAOMaOzzv41Jfm_Ps
Robot Learning 2025: Foundational Models for Robotics and Scaling DeepRL https://www.youtube.com/playlist?list=PLMe2pHxzxHp-UJ1jd-uuGSGK7P7Phtm-f