Decoding Humanoids: Robots Driving Auto Factories
- Prathamesh Khedekar
- Jul 2
- 11 min read
July 02, 2025.

Automotive, chip manufacturing, and space exploration are the three areas where humanoids are being heavily adopted today. From BYD to Tesla, BMW to Audi, and Hyundai to Mercedes, auto manufacturing plants are swarmed with humanoids. Why?
If you’ve ever visited an automotive plant, you’ll notice these factories operate on three core principles: mobility, efficiency, and repetition. These are precisely the qualities that make humanoids an ideal fit. You may ask: How?
If we take a step back and try to understand the anatomy of a humanoid, you'll notice that at a high-level it is equipped with two legs that allow it to navigate diverse indoor terrain profiles—staircases, uneven surfaces, and a wide range of obstacles, resulting in an extremely high level of terrain mobility. Then it’s equipped with two highly dexterous arms, or what we refer to in the robotics world as spatial manipulators. They enable spatial agility. This means our humanoid can use these robotic arms to grab an apple or thread a needle through fabric. These two features, combined with the ability to teach humanoids to perform repetitive tasks using modern AI models, add up to remarkable efficiency.
Hence, humanoids are becoming an increasingly compelling solution in the manufacturing domain. It's no surprise that NVIDIA is partnering with Foxconn to deploy humanoids in its chip manufacturing plants. Beyond that, nine automakers around the world are already embracing humanoids on their production lines. This includes BMW, Mercedes-Benz, Tesla, Hyundai, Audi, Volkswagen, Nio, and BYD.
Now, given the scale of adoption we are seeing for humanoids today, it's important to understand how humanoids work. To better understand humanoid robots, we need to distill their functionality down to the core systems that form their foundational layer. So what are these core systems?
Core Systems Of Humanoid Robots
If you take a look at humanoid robots today, you will notice their design and functionality can be distilled down to precisely 7 core systems. If you understand these, you'll essentially have a solid foundation to build a humanoid that can solve real-world problems with efficiency, precision, and reliability. So what are these seven core systems?

Core System 1 – Anthropomorphic Mechanical Design
Well, the first one is anthropomorphic mechanical design or in simple words human body-inspired mechanical design. At its core, a humanoid robot is built with a body plan inspired by our own—two legs, two arms, a torso, and a head. Each limb is made up of multiple joints, which are powered by actuators, i.e., high-power motors that allow the robot to walk, bend, jump, run, and handle objects. To keep it simple, we will distill the mechanical design into three sub-systems.
The first one is the bipedal locomotion system—essentially a fancy way to describe the two legs that each of these humanoids is equipped with. The second sub-system comprises the two spatial manipulators, or the two hands of the robot, equipped with end-effectors, or as some call them, grippers. This is the technical term for fingers with human-like dexterity. Then the third sub-system comprises the torso and neck. They play a key role in ensuring our humanoid has range and stability. This mechanical architecture is carefully engineered to balance strength, agility, endurance, precision, and reliability.

Now that we understand the first core system, which is the mechanical design, we need to understand the second core system, which is the application of the first core system. What do we mean by that?
Core System 2 – Mobility System: Bipedal Locomotion (Legs)
Just because a humanoid robot is equipped with two mechanical legs doesn’t mean it’s going to be able to walk and retain its balance. Locomotion is a right that a humanoid earns after it learns how to retain its balance. So the question is: How does it do that? How does a humanoid robot ensure that at every step it takes, it doesn’t fall?
Well, to ensure stability, a humanoid robot must understand the terrain profile it’s navigating on, the slope of the terrain, the center of gravity of the robot itself, and then make a decision on what the next step should look like. This means that at each step, our robot has to take those parameters, namely terrain, slope, center of gravity, and more, as inputs and then make
a calculated decision around how it’s going to craft its next step. We call this gait planning in the world of robotics. Basically, it means determining what kind of footstep pattern the robot will choose for the next step based on the current parameters for the current step.
How does it do that?

Well, there are two ways to approach this challenge. One is to simply take the data from sensors such as force measurement sensors located on the feet, which measure how the ground profile is changing, and then respond to that by activating corresponding actuators or motors located around the legs. These actuators apply appropriate torque or in simple words a combination of velocity and force to ensure the robot has a firm grip on the ground. Some refer to this approach as a reactive approach.
An approach that is more proactive, on the other side, relies on the principle of prediction or in mathematical terms probability. So instead of just responding to the current step and the data measured from sensors at that moment, our humanoid scans the terrain, takes the current and historical data from the past few steps, and then predicts how the terrain is about to change. It then adjusts its next step or selects an appropriate gait (type of footstep) based on these predictions.
Long story short, this core system 2 is fully responsible for retaining balance and ensuring the humanoid is able to move around in factories, warehouses, and homes with the highest levels of ground grip, precision, strength, and stability. Now you might ask, well, that’s for mobile navigation—what about if our humanoid has to lift an object or pick a fruit from a basket? How does it do that?
Core System 3 – Spatial Manipulators & End Effectors (Hands)

This is where the spatial manipulators equipped with end-effectors come in handy. A spatial manipulator is a technical term used to describe arms with human-like dexterity, and an end-effector is a technical term used to describe fingers or grippers that can be controlled electronically to grasp an object.
Most humanoids today are equipped with two spatial manipulators, each equipped with strong end-effectors. From a mechanical perspective, these are essentially metal parts or limbs connected with high-torque motors that provide a range of motion. We refer to this as degrees of freedom in robotics, or DOF. So if you hear folks say 3-DOF or 4-DOF, just know that means those are the number of motor-controlled joints present in each spatial manipulator. Why does that number matter?
Well, if a humanoid arm has more joints, that means each of those joints can be moved in different directions and at different speeds, and hence your humanoid can potentially perform thousands of types of movements. We humans have 7 degrees of freedom in each arm—that means we have 7 core joints. Three of them are located in the shoulder, one in the elbow, and three in the wrist. Same applies to humanoids.
While most humanoids today have 6–7 DOFs, Tesla Optimus is one of the outliers with 22 DOFs. That offers divine levels of dexterity with ultra-human precision and efficiency. Now, we have covered the first three core systems, namely the mechanical design, the bipedal locomotion, and the spatial manipulators. The question is: How does our robot make decisions? How does it know how to move those spatial manipulators to pick an object? Objects come in different shapes and sizes. An apple is different from a car part that a humanoid may have to move on an assembly line. So how does it make sense of these objects in the real world? That’s where we need sensors and the perception system.
Core System 4 – Sensors & Perception System (Eyes)

The fourth core system comprises sensors and the perception stack, essentially the robot’s way of understanding its surroundings. A humanoid is typically equipped with multiple sensor modalities: cameras for visual recognition, depth sensors or LiDAR for 3D geometry, inertial measurement units for orientation, and microphones for audio input.
Data captured from these sensors is processed by the perception stack of our humanoid. A perception stack is simply a series of AI models or more specifically artificial neural networks trained to to detect objects, segment scenes, estimate 3D poses, and track motion of nearby objects over time. The output of this perception pipeline is a structured representation of the world—basically a real-time semantic map that tells the robot where things are and what they are.
This sensor and perception system is foundational to every humanoid because it is this vision-centric system that allows our humanoid to develop a coherent and high-precision understanding of its surroundings. Now, with this understanding of its environment, you might ask: How does it make decisions? How does it decide how to move its arms and legs to complete its task? That's where the next core system comes into play.
Core System 5 – Training & Planning System (Brain)

If you think about a task that our humanoid has to complete in this world, each of these tasks can be broken down into a series of actions i.e. sub-tasks that it can be trained on and optimized for independently. To better understand this principle, let’s take an example.
Let’s say you have a humanoid deployed in an automobile factory. You want it to identify and collect 10 parts that are released on the conveyor belt every 10 minutes and transfer those to an inventory area in your factory. This task can be broken down into four sets of actions or sub-tasks. The first sub-task in this case is to recognize the parts, the second is to lift 10 of them, the third is to navigate to the inventory area, and the fourth is to place them there. This is obviously a simplistic way of looking at this task. The principle here is that each task is a combination of a series of actions or sub-tasks.
Now, if we need to train our robot to perform this task in a reliable and safe manner, we need to train it on each of these sub-tasks first and then on the task itself.
So how do we do that? Historically, we used to hand-code robots with if-else loops. Those days are long gone. Modern humanoids rely on reinforcement learning to learn these tasks. The way it’s done is that for each action or a sub-task, there is a set of inputs and a required output. The input is usually the control signals that drive the motors located in the arms and the legs of the humanoid. The required output is the action itself, such as picking an object. So the humanoid is trained using an artificial neural network, which runs in a loop and varies the inputs to consistently produce the required output.
For every accurate instance of output, this artificial neural function is rewarded, and for every inaccurate output, meaning if the object falls, it’s not rewarded or, as some call it, negatively rewarded. Over a period of its training horizon, which could last for a few weeks or months, the humanoid arm gets really good at performing the given action. This process is then repeated for a series of actions, and finally for the end-to-end task.
The output of this process is essentially an artificial neural network that is fully trained to perform a specific task. Some call it the policy. Once trained on one robot, these neural networks can be deployed to thousands of humanoids at scale. Some of you might ask, where do we train humanoids? In labs or in simulation environments?
This training can be conducted in the real world i.e. lab equipped with a physical humanoid and objects or in a simulation environment. While there are pros and cons to each approach, it’s this training governed by artificial neural networks that offers the humanoids the ability to perform tasks with speed, precision, and efficiency. This same training can be used to teach a robot to walk without losing its balance, to climb stairs, to pick objects, and to perform all the other tasks you can think of in the real world.
This system of AI-enabled training allows humanoids to adapt to a series of scenarios that it might have to face in the real-world. Now you might ask: Well, now that it has been trained to perform that task, how does it actually translate the output of these AI-models i.e. intended action into motion? How does it move its arm and legs in alignment with the specific action?
That’s where the next core system comes into play.
Core System 6 – Control System (Action)
The sixth core system is the control system, which you can think of as the robot’s nervous system. This is the layer that takes all the high-level plans and policies dictated by the artificial neural networks used to train the humanoids and translates them into precise motor commands that drive motors located in the joints of the humanoid. Thus, they turn those decisions dictated by the artificial neural networks into actions.
This system is composed of two core components. The first one is the whole-body controller, sometimes called the parent controller. This controller is responsible for taking the decisions from the neural networks and breaking them down into sub-decisions or sub-actions dedicated to each of the hundreds of individual controllers that drive the joints located throughout the humanoid.
The second component is the collection of individual controllers or motors that are located in each joint present in the arms and legs. We refer to these as the joint controllers. They are responsible for managing the torque, velocity, and position of each joint for a given task. Together, the parent controller and the child controllers, or joint controllers, translate the intent finalized by the artificial neural networks into precise actions for the humanoid.
Now that we understand the top 6 core systems that are foundational to each humanoid, we still need to ensure that all of these actions that our humanoid will perform needs to be executed in harmony with nearby humans. What do we mean by that? We humans operate with a certain set of principles and values and we want to ensure that since humanoids will be operating around us, they have to be trained on those sets of values and principles. This is where the next core systems come into play.
Core System 7: Human-Robot Interaction System
While in theory, a humanoid can be trained to move parts in a factory at galactic speed, that might yield ultra-high production capacity and throughput, but it comes at the expense of human values. Any action taken to the extreme yields extreme results, and this is both good and bad.
On the positive side, that means higher production in factories. But on the negative side, that also means ultra-high risks. What if a humanoid goes rogue? What if it continues to perform at that speed even when the other systems around it such as the conveyor belt stops working? What if it keeps operating at that speed and doesn’t realize a nearby human worker is exhausted and about to fall on the ground?
You see? We humans when we work whether at our office, in store or in factories we are not just performing an action. We are cognitively aware of our surroundings, and that’s our consciousness. We need humanoids that are not just great at performing tasks but also have values and awareness and that is the hardest part in this industry. We need to ensure that humanoids are aware and trained on the values and principles that are core to our biology and evolutionary cycle. That's precisely the function of the human-robot interaction system.
While there are a series of experiments and research efforts being conducted to think about consciousness for humanoids, we are still at the early stages here. For now, one thing is clear, we are approaching the much needed levels of dexterity needed to perform actions in the real world with humanoids, but we are probably decades away from understanding human consciousness with surgical depth and figuring out ways to embed that level of awareness with precision and depth into humanoid robots.
If this essay triggered your appetite to learn robotics, you may want to explore our course on Robotics Foundations. We teach robotics with core principles and real-world case studies. From space rovers to mobile robots and quadrupeds (i.e., robot dogs), you will learn how to build real robots from scratch.
Cheers,
Prathamesh
Disclaimer: This blog is for educational purposes only and does not constitute financial, business, or legal advice. The experiences shared are based on past events. All opinions expressed are those of the author and do not represent the views of any mentioned companies. Readers are solely responsible for conducting their own due diligence and should seek professional legal or financial advice tailored to their specific circumstances. The author and publisher make no representations or warranties regarding the accuracy of the content and expressly disclaim any liability for decisions made or actions taken based on this blog.
Comentarios