Vision Is All You Need

Abstract

In this project, we address the problem of vision-based autonomous navigation in dynamic environments. Inspired by biological systems and human pilots who navigate using visual inputs, we challenge the conventional multi-sensor approach by proposing a model that relies primarily on visual inputs. We design a specific task where an autonomous drone navigates through gates while evading incoming projectiles by relying on visual cues. We introduce two methods: (i) imitation learning to distill knowledge from a privileged state-based policy into a vision-based student policy, and (ii) a novel RMA-inspired Drone Dynamics Module (DDM) for explicit dynamics estimation. Our findings indicate that the latter method outperforms the student policy in robustness and effectiveness. However, the limitation of using a single frontal camera highlights the need for multi-camera setups in future work. The results are promising for vision-based navigation in dynamic environments, indicating directions for future research and improvements.

Overview

Introduction

Related Works

Methodology

Experiments

Conclusion and Limitations

Individual Contributions

I. Introduction

Biological systems adeptly navigate complex environments using mainly visual inputs [1]. Studies reveal that human pilots operate drones effectively using only the onboard camera's video streams, bypassing explicit state estimation or trajectory planning due to their skill in selecting essential visual information [2], [3]. Inspired by this, our project questions the traditional multi-sensor strategy in autonomous systems, suggesting that solely visual inputs are sufficient for effective navigation in dynamic settings. This approach contrasts with conventional methods that depend on complex sensors for environmental dynamics [4], [5], [6], aiming instead to develop a simpler, cost-effective vision-based model for an autonomous drone. The project examines if an autonomous agent can depend only on visual information to understand environmental dynamics and make optimal decisions. To evaluate this, we designed a task where a drone navigates through gates while dodging projectiles, solely based on visual input to assess dynamics.

This task parallels real-world drone uses in defense, law enforcement, and search and rescue operations [7], [8]. Despite progress in vision-based drones [3], [9], challenges remain in capturing environmental dynamics effectively, particularly in complex scenarios. The agent's success hinges on (i) extracting meaningful visual representations of dynamics and (ii) training a high-performance policy using these cues.

Our method employs contrastive learning to train a perception network to identify crucial features from complex images, focusing on relevant visuals while discarding non-essential elements [3].

We develop a vision-based control policy using a privileged learning approach. Initially, we train a state-based policy with reinforcement learning (RL) to maximize the drone's performance using privileged information about the dynamics and positions of projectiles and checkpoints [10]. We then leverage this privileged information to create vision-based policies that infer environmental dynamics from visual signals.

We propose two approaches: (i) imitation learning to transfer knowledge to a vision-based policy without privileged state information [3], and (ii) training a dynamics module to predict privileged information directly from visual cues, similar to the RMA framework [11].

Our tests in a PyBullet Drone simulator [12] show encouraging results. Both methods effectively learn to avoidn projectiles and navigate to checkpoints, though a gap persists between privileged-based and vision-based policies, suggesting opportunities for further advancements.

While public research on drones avoiding or interacting with projectiles is limited, many studies cover various aspects of drone navigation, particularly drone racing [3], [4], [9], [13]. We extend this research by introducing projectiles, thus increasing complexity and testing the drone's capability to understand environmental dynamics more rigorously.

Previous research on autonomous drone navigation and racing often relied on state-based methods using precise global positioning data to train neural networks for reinforcement learning (RL) policies [5], [13]. For instance, Song et al. [13] used such data to train an RL agent to navigate through gates. In contrast, our method utilizes relative positions derived from camera data, aiming to approximate these values through vision, thereby reducing dependency on precise global positioning, a shift from traditional state-based approaches.

Initially, vision-based methods in drone racing utilized direct trajectory planning from visual inputs. For example, Foehn et al. [4] combined visual-inertial odometry with CNN-based detection for robust state estimation, using a trajectory planner that generated paths from a simplified drone model. However, this method did not accurately reflect the drone’s actuation limits, often leading to unfeasible trajectories. In contrast, our approach, inspired by Fu et al. [3], employs imitation learning where a vision-based policy learns from a model with privileged information (gate positions), thus extracting essential details from visual cues. We aim to expand this to more complex scenarios where gates are randomly placed, testing the model's robustness without overfitting to predetermined routes.

Our Dynamics Module methodology draws from the Rapid Model Adaptation (RMA) framework [11], which features a two-phase training procedure. Despite RMA’s attempts to integrate visual inputs [14], it doesn't fully exploit vision in dynamic settings due to its original task constraints. We've adapted this approach to better predict environmental dynamics and manage a vision-based system, enhancing the use of vision specifically for our task. We maintain the core two-phase training but have tailored the architecture to better capture dynamics, aligning more closely with our project's specific needs.

III. Methodology

III-A. Task Formulation

Our task involves navigating a drone through a series of gates while dodging incoming projectiles, framing this as a dual-objective optimization problem. The primary objectives are to maximize the number of gates navigated and minimize the transit time between gates. The drone's hardware includes a single forward-facing camera and an inertial measurement unit (IMU) that tracks orientation, velocity, and acceleration.

In our dynamic environment, gates are randomly generated at various heights and orientations along the $x$-axis. Additionally, projectiles are launched randomly from variable positions, targeting the drone based on its current trajectory. These projectiles are fired with velocities calculated to intersect the drone's path, considering the drone's motion, position, and gravitational effects at the moment of launch.

For the simplicity of modeling projectile dynamics, we define a random variable $t$ representing the time from the projectile's launch to its potential impact on the drone. The firing velocity of the projectile $\mathbf{v}^p$ is then determined as follows: \begin{align} \begin{cases} v^p_{x} = \frac{d_{x}}{t} + v^d_{x},\\ v^p_{y} = \frac{d_{y}}{t} + v^d_{y},\\ v^p_{z} = \frac{d_{z}}{t} + v^d_{z} + \frac{1}{2}gt^2, \end{cases}\label{eq:proj} \end{align} where $d$ denotes the distance vector from the projectile launch point to the drone at the time of firing, $v^d$ represents the drone’s velocity, and $g$ is the gravitational acceleration.

III-B. Base-privileged Policy

Initially, we establish a base-privileged policy $\pi_{\text{privileged}}$ that acts as a benchmark for evaluating other models and for foundational training. This policy is theoretically designed to maximize the drone's performance in our specified tasks.

III-B-1. Observation Space

The drone's observation space includes various metrics from the inertial measurement unit (IMU), such as velocity $\mathbf{v}^d = (v^d_x, v^d_y, v^d_z)$, angular velocity $\mathbf{w}^d = (w^d_x, w^d_y, w^d_z)$, and orientation parameters—roll, pitch, and yaw $\mathbf{r} = (r, p, y)$. These metrics facilitate faster and more stable learning. In the privileged setting, the drone receives data on the relative position and orientation of the next gate $\mathbf{g} = (g_x, g_y, g_z, \phi_x, \phi_y, \phi_z)$ as viewed from the drone’s camera. Additionally, the privileged information includes the relative positions and distance $\mathbf{p}_i = (p_{i_x}, p_{i_y}, p_{i_z}, d)$ and velocities of projectiles $\mathbf{v}^p_i = (v^p_{i_x}, v^p_{i_y}, v^p_{i_z})$ for each $i$-th airborne projectile attacking the drone, where $i \in [1, \dots, P_{\max}]$, with $P_{\max}$ denoting the maximum number of projectiles that can be airborne simultaneously. These cumulative projectile data points $\mathbf{p}$ and $\mathbf{v}^p$ ensure the model accounts for dynamic threats. If fewer than $P_{\max}$ projectiles are present, non-existent projectiles are represented by zero-padding.

III-B-2. Policy Network

For the base-privileged policy $\pi_{\text{privileged}}$, we use a multilayer perceptron (MLP) as the neural network architecture for the policy head. Similar to RMA [11], the policy uses the current drone state (including the previously taken action) $\textbf{s}_t = [\mathbf{v}^d_t,\, \mathbf{w}^d_t,\, \mathbf{r}_t,\, \boldsymbol{a}_{t-1}] \in \mathbb{R}^{13}$ and encoded privileged information, called the extrinsic vector, $\mathbf{z}_t \in \mathbb{R}^8$, to decide the next action. We obtain the extrinsic vector $\mathbf{z}_t$ by passing it through an MLP, denoted as $\mu$, such that $\mu(\mathbf{g}_t,\, \mathbf{p}_t,\, \mathbf{v^p}_t) = \mathbf{z}_t$.

Teacher Policy Architecture — Teacher Policy architecture.

III-B-3. Action Space

The action space of the drone includes controlling the thrust levels of its four rotors. Specifically, the base-privileged policy maps the observational input to the thrust levels of these rotors, which are then converted to RPMs for each rotor at every time step $t$, expressed as $\pi_{\text{privileged}}(\textbf{s}_t, \mathbf{z}_t) = \boldsymbol{a}_t$, where $\boldsymbol{a}_t = (w^1_t, w^2_t, w^3_t, w^4_t)$.

III-B-4. Reward Function

We use Reinforcement Learning (RL) to train the policy $\pi_\text{privileged}$ of our base-privileged agent to maximize the reward provided by the environment. Given the complexity of the tasks—passing through gates and evading projectiles—a straightforward reward function proved insufficient for mastering multiple tasks simultaneously. Therefore, we segmented the learning process into three sequential curriculum phases:

1] Basic Stabilization and Hovering:

The first phase focuses on basic stabilization and learning to hover toward gates. High rewards are given for maintaining stability, and penalties are imposed for crashing or excessive tilting. The reward function at each step $t$ is: \begin{align} \begin{split} R(\mathbf{s}_t, \boldsymbol{a}_t, \mathbf{s}_{t+1}) &= t - R_{\text{crashed}} \times \mathbb{1}_{\text{crashed}}\\ &- |\max(r, p, y)| \times \mathbb{1}_{\text{tilted}}\\ &+ \lambda_\text{dist} \left(\|{\mathbf{s}_t - \mathbf{g}_t}\| - \|{\mathbf{s}_{t+1} - \mathbf{g}_t}\right\|), \end{split} \end{align} where $\mathbb{1}_{\text{crashed}}$ indicates if the drone crashed, incurring a high penalty, and $\mathbb{1}_{\text{tilted}}$ indicates if the drone is excessively tilted, with penalties based on the largest tilting angle. A progress reward is given based on the distance change to the next gate, modulated by $\lambda_\text{dist}$.

2] Navigation:

The second phase emphasizes navigation. Rewards increase as the drone decreases the distance to gates and successfully navigates through them. The reward function at each time step $t$ is: \begin{align} \begin{split} R(\mathbf{s}_t, \boldsymbol{a}_t, \mathbf{s}_{t+1}) &= R_{\text{passed}} \times \mathbb{1}_{\text{passed}}\\ &- R_{\text{crashed}} \times \mathbb{1}_{\text{crashed}}\\ &+ \lambda_\text{dist} \left(\|{\mathbf{s}_t - \mathbf{g}_t}\| - \|{\mathbf{s}_{t+1} - \mathbf{g}_t}\right\|), \end{split} \end{align} where $\mathbb{1}_{\text{passed}}$ indicates when a gate is successfully navigated, earning a high reward.

3] Evasion and Navigation:

In the final phase, the drone must navigate gates and evade projectiles. We introduce projectiles without penalizing the drone for being hit initially, as previous trials led to a "suicide" strategy to avoid penalties. Instead, we increase the penalty for crashes and add a minor reward for survival. The reward function is: \begin{align} \begin{split} R(\mathbf{s}_t, \boldsymbol{a}_t, \mathbf{s}_{t+1}) &= R_{\text{passed}} \times \mathbb{1}_{\text{passed}}\\ &- R_{\text{crashed}} \times \mathbb{1}_{\text{crashed}}\\ &+ \lambda_\text{dist} \left(\|{\mathbf{s}_t - \mathbf{g}_t}\| - \|{\mathbf{s}_{t+1} - \mathbf{g}_t}\right\|)\\ &+ R_{\text{projectile}} \times \#~\text{airborne projectiles}, \end{split} \end{align} where an additional reward $R_{\text{projectile}}$ is given for each airborne projectile.

III-C. Image Feature Representation Learning

To transition from a base-privileged policy to a vision-based policy, we need an effective image feature extractor that can produce meaningful visual features. This requires extracting latent representations that capture relevant information about projectiles, targets, and environmental dynamics. We use a two-stage approach to learn visual feature embeddings from raw image observations.

In the first stage, we fine-tune the YOLOv5 model [15], so we can leverage its powerful object detection capabilities. Specifically, we captured 2,500 images from our simulated environment and annotated the bounding boxes for the current gate, next gate, and projectiles. We then fine-tuned a pre-trained YOLOv5s model with this custom dataset. This step allows the model to accurately detect and localize these critical entities within the visual scene, effectively converting the object detector model into a customized feature extractor.

The second stage involves extracting compact, yet informative visual features from the fine-tuned YOLOv5s model's output. We append an average pooling layer to the model's detection head, which downsamples the output feature maps. The resulting low-dimensional feature vectors are then concatenated, and L2 normalization is applied for stability during training.

III-D. Student Imitation Policy

Once we have obtained an optimal teacher policy that can navigate the drone through gates while avoiding projectiles and an image feature extractor that produces meaningful visual features, we can distill the base-privileged policy knowledge into a vision-based student policy. The key difference is that the student policy relies exclusively on the last $k$ time steps of drone states $\textbf{s}_{t-k:t}$ and visual observations $\textbf{o}_{t-k:t}$ without direct access to privileged information. The drone state $\mathbf{s}_t$ is given as in Section III-B-2.

The student policy architecture comprises three main components: a feature extractor, a memory-based network, and a policy network. The feature extractor, as described in the previous section, converts the raw image observations $\mathbf{o}_{t-k:t}$ into compact visual embeddings $\text{YOLOv5}(\mathbf{o}_{t-k:t}) = \mathbf{z}_{t-k:t}\in\mathbb{R}^{k\times 128}$. These embeddings, concatenated with the drone state information, serve as the input to the memory-based network.

We created two alternative architectures for the memory-based network: Temporal Convolutional Network (TCN) and Transformer-based Network.

For the first architecture, we employ a Temporal Convolutional Network (TCN) [16] to capture temporal dependencies and extract relevant information from the history of observations and states. The TCN operates on the sequence of concatenated embeddings $[\mathbf{s}_t \oplus \mathbf{z}_t,\, \mathbf{s}_{t-1} \oplus \mathbf{z}_{t-1}, \dots,\, \mathbf{s}_{t-k} \oplus \mathbf{z}_{t-k}]$, where $\oplus$ denotes concatenation, producing a rich temporal representation that encodes the conditions of the environment. We call this model StudentTCN.

For the second architecture, inspired by the success of transformer architectures in various sequence modeling tasks, including applying Transformers for policy modeling [17], we explored a transformer-based network. Similar to the TCN, this network receives a concatenated sequence of drone states and visual embeddings as input. Instead of using temporal convolutions, we employed a transformer encoder module [18] to capture long-range dependencies and intricate interactions between state and visual information across time steps. We call this model StudentTransformer.

Finally, a multi-layer perceptron (MLP) policy network takes the memory-based network's output, aggregated using mean pooling, as input and predicts the desired control command, mirroring the action space of the teacher policy. Through this imitation learning process, the student policy learns to implicitly infer relevant gate information from visual cues, mimicking the teacher's optimal decision-making without relying on privileged state information.

The optimization objective for imitation learning is defined as the mean squared error between the outputs of the teacher policy and the student policy: \begin{align} \mathcal{L}(\theta) = \|{\pi_\text{student}(\textbf{s}_{t-k:t}, \textbf{o}_{t-k:t}|\theta) - \pi_\text{privileged}(\textbf{s}_t, \mathbf{z}_t)}\|^2_2 \end{align} where $\theta$ represents the learnable parameters of the student policy.

III-E. Drone Dynamics Module-based Policy (DDMP)

Building upon insights from the Rapid Model Adaptation (RMA) [11] framework, we propose an alternative approach for leveraging visual observations for navigation in dynamic environments. Instead of distilling knowledge from a privileged-based policy, we train a dedicated Drone Dynamics Module to estimate environmental dynamics from visuals and state information.

The Drone Dynamics Module (DDM) is a transformer-based architecture that receives two inputs: a sequence of the last $k$ drone state vectors $\textbf{s}_{t-k:t}$ and a sequence of the last $k$ visual embeddings $\mathbf{z}_{t-k:t}$. Specifically, we use the compact visual embeddings obtained from the fine-tuned YOLOv5 feature extractor, $\text{YOLOv5}(\mathbf{o}_{t-k:t}) = \mathbf{z}_{t-k:t} \in \mathbb{R}^{k \times 128}$.

Formally, DDM operates on the sequence of concatenated embeddings $[\mathbf{s}_t \oplus \mathbf{z}_t,\, \mathbf{s}_{t-1} \oplus \mathbf{z}_{t-1}, \dots,\, \mathbf{s}_{t-k} \oplus \mathbf{z}_{t-k}]$. This sequence is processed through transformer encoder, where the output is then aggregated using mean pooling and passed through an MLP to predict encoded privileged dynamics information, called the extrinsic vector $\mathbf{\hat{z}}_t$ as described in Section III-B-2.

Given the trained privileged-based policy, we optimize the DDM to predict the extrinsic vector of privileged information using the mean squared error loss: \begin{align} \mathbf{\hat{z}} = \text{DDM}([\mathbf{s}_t \oplus\mathbf{z}_t,&\, \mathbf{s}_{t-1} \oplus \mathbf{z}_{t-1}, \dots,\, \mathbf{s}_{t-k} \oplus \mathbf{z}_{t-k}]|\omega)\\ &\mathcal{L}(\omega) = \|{\mathbf{\hat{z}}_t - \mathbf{z}_t }\|^2_2 \end{align} where $\omega$ represents the learnable parameters of the DDM. Then DDMP policy is then defined as: \begin{align} \pi_\text{DDMP}\left(\mathbf{s}_t, \text{DDM}(\textbf{s}_{t-k:t}, \textbf{o}_{t-k:t})\right) = \pi_{\text{privileged}}(\textbf{s}_t, \mathbf{z}_t) \end{align}

IV. Experiments

IV-A. Experimental Setup

IV-A-1. Simulator Environment

We utilize the Gym Pybullet Drones environment [12] for its simplicity and realistic physics simulations. While alternatives like Flightmare [13] and AirSim [19] offer more detailed realism, their complexity and the time required for mastery make them less suitable for our project's timeline. PyBullet is also chosen for its straightforward integration with the Gymnasium library and ease of customization in Python.

Given the camera's limited field of view, the environment is only partially observable, posing challenges in spatial awareness and reacting to dynamic obstacles. To address this, we implement several adjustments to create a theoretical framework for successful vision-based navigation.

We set the gates' orientation and position on the $y$-axis such that the maximum angle between consecutive gates is $22.5^\circ$ to facilitate manageable turns and visibility of successive gates. Both training and evaluation involve passing through five gates in total. During training, the gates are randomly placed at the beginning of each episode. For evaluation, the drone's performance is assessed in six environments with randomly positioned gates, each with fixed and different seeds.

Projectiles are fired towards the drone with a probability of $0.02$ per time step, ensuring the drone's camera can capture them. This is to accommodate the simulation's single-camera limitation, meaning projectiles are fired in the direction the camera is facing at the moment of firing. A maximum of two projectiles may be airborne simultaneously to maintain balanced challenge levels.

IV-A-2. Evaluation

To evaluate our policies, we designed several metrics. We measure the average number of gates passed before the drone crashes due to a projectile hit or any other reason. Additionally, we measure the average time taken to transit between two consecutive gates. These two metrics capture the essential aspects of the optimization problem: passing through as many checkpoints as possible as quickly as possible. Finally, to provide more context for comparisons, we measure the average number of projectiles encountered per episode before either completing the episode or being struck down by a projectile.

These metrics are calculated across 20 gate layouts with different seeds, with each layout evaluated for 30 seconds, approximately corresponding to 3 episodes.

IV-B. Training Details

IV-B-1. Base-privileged Policy Training

The MLP $\mu$ that maps privileged information to the extrinsic vector $\mathbf{z}_t$ is a 3-layer MLP with hidden layer sizes of 256 and 128. It encodes a privileged vector $[\mathbf{g}_t,\, \mathbf{p}_t,\, \mathbf{v^p}_t] \in \mathbb{R}^{20}$ (6-dimensional vector for gate orientation, and two 7-dimensional vectors for up to two airborne projectiles) into an extrinsic vector $\mathbf{z}_t \in \mathbb{R}^8$. The base policy head is a 4-layer MLP with hidden sizes of 128, 64, 32, and 16. It takes as input the current drone state $\textbf{s}_t = [\mathbf{v}^d_t,\, \mathbf{w}^d_t,\, \mathbf{r}_t,\, \boldsymbol{a}_{t-1}] \in \mathbb{R}^{13}$ and the extrinsic vector $\mathbf{z}_t \in \mathbb{R}^8$, and outputs the next action $\boldsymbol{a}_t$. GELU activation [20] is used for all MLP layers.

We trained the policy using Proximal Policy Optimization (PPO) [22] from Stable Baselines3 [21], using its default training parameters and conducting rollouts in 32 parallel environments. Although Soft Actor-Critic (SAC) [23] is often recommended as a good or even better alternative for continuous control problems, it resulted in much slower convergence during the initial curriculum step for stabilization and hovering.

The reward constants from Section III-B-4 used to train base-privileged agents are as follows: $R_\text{crashed}=5$ for the first two curriculum steps, and $R_\text{crashed}=50$ in the third step to impose a higher penalty for drone crashes caused by projectiles, enabling faster learning. The reward for passing gates is $R_\text{passed}=100$. The distance progress reward is scaled with $\lambda_\text{dist}=10$ to amplify the usually small distance progress rewards, and the reward for encountering a projectile is $R_\text{projectile}=2$.

Priviledge Info — Reward curve curriculum steps 2.

A small final note regarding the base-privileged agent: Following the proposals from Song et al. [10] and Fu et al. [3], we provided the agent with privileged positional and orientation information for the next gate and the one after it. While this approach enabled the policy to perform well during the first two curriculum steps, the agent did not successfully learn to avoid projectiles.

IV-B-2. Student Policy and DDMP Training

As described in Section III-D, for the student policy, we explore both TCN and Transformer architectures for memory-based networks. The TCN structured with three sequential layers of 1D convolutions, each with 128 output channels, and exponentially increasing dilations. The Transformer uses 6 layers of Transformer encoder layers, also with a hidden size of 128. The policy head for both architectures is a 4-layer MLP with hidden sizes of 128, 64, 32, and 16, similar to the policy head for base-privileged agents.

For the DDMP, we experiment only with the Transformer architecture for the Drone Dynamics Module due to the discouraging results obtained with TCN, as discussed in Section IV-C. This setup also uses 6 layers of Transformer encoder layers with a hidden size of 128.

Drone Falling — Drone Dynamics Module-based Policy Explicit Dynamics Estimation.

For both vision-based implementations, we use the previous drone states and observations from the last 0.5 seconds, corresponding to $k=12$ previous steps. This helps the model estimate the dynamics of the environment, either implicitly (student policy) or explicitly (DDMP).

To collect the training data of the state-action history for these policies, we unroll the trained base policy $\pi_\text{privileged}$ [3]. For our task, we gathered approximately 1 million episode steps from various randomly generated gate layouts.

For training the student policy and DDM, we use the AdamW optimizer [24] with a learning rate of $1e-4$ for 300 epochs. The MultiStepLR scheduler decreases the learning rate by a factor of 0.5 every 100 epochs.

IV-B-3. Training Hardware

The models were trained using a CUDA-enabled NVIDIA V100 PCIe 32 GB GPU with 7 TFLOPS provided by the Scitas cluster.

IV-C. Policy Comparison

The results show a significant discrepancy between the baseline privileged agent and our two approaches. We also observe that the complex dynamics in this setup are not adequately captured by less descriptive TCN models, necessitating the use of Transformer architecture for both the student policy and Drone Dynamics Module. Direct comparisons indicate that the DDMP approach slightly outperforms the student policy approach in both metrics.

Approach	Avg. Time to Gate (seconds)	Avg. Gates Passed (out of 5)	Avg. Number of Projectiles Avoided
Baseline (Priviledge Info)	1.71	3.7	2.1
Student Policy (TCN)	1.62	0.2	-
Student Policy (Transformer)	1.6	1.1	2.5
Drone Dynamics Module-based Policy	1.6	1.3	2.9

Experiment Results with Projectiles (Curriculum step 3)

The complexity of the dynamics in this setup is not effectively addressed by TCN models, making the use of Transformer architecture essential for both the student policy and the drone dynamics module. Despite the higher number of parameters associated with Transformer architecture, the policy maintains a reasonable inference time of approximately 0.004 seconds on a CPU.

Insights from the number of projectiles encountered per episode show that both the Student Policy and DDMP face and avoid more projectiles on average than the base-privileged policy. This indicates that the performance drop is not due to an inability to capture dynamic information about projectiles and gates, but rather due to losing their way to the gate.

This is confirmed by results where, even with projectiles turned off, the average number of gates passed is almost identical for our vision-based policies. In contrast, the base-privileged policy, which always has information on the next gate, shows a significant increase in the average number of gates passed.

Approach	Avg. Time to Gate (seconds)	Avg. Gates Passed (out of 5)
Baseline (Priviledge Info)	1.6	4.6
Student Policy (Transformer)	1.5	1.3
Drone Dynamics Module-based Policy	1.5	1.3

Experiment Results without Projectiles (Curriculum step 2, no projectiles)

Further qualitative analysis reveals that both the student agent and DDMP perform well in avoiding projectiles but struggle to find the next gate. The student agent often misses the second gate and gets completely lost, while DDMP is more successful, only losing sight of the gate while avoiding projectiles and attempting to recover.

Both vision-based agents suffer significantly from losing sight of the gate. This issue should be fixable in future work by expanding to multi-camera systems, allowing better coverage and robustness. Additionally, both agents have a very short memory span of only half a second. Integrating a Simultaneous Localization and Mapping (SLAM) system to map and approximate the environment could prevent the agents from losing sight of the gates.

V. Conclusion & Limitations

In this work, we designed and evaluated two approaches for vision-based drone navigation: imitation learning (implicit dynamics estimation) and our innovative RMA-inspired [11] Drone Dynamics Module (explicit dynamics estimation). Both approaches demonstrated promising results, particularly in estimating environmental dynamics through vision, as evidenced by their performance in projectile avoidance.

Overall, we find that the dynamics module approach is more robust and effective than the student policy method. However, to confidently confirm this, further analysis in more complex and realistic environments is essential to fully validate our approach.

Additionally, the limitation of using a single frontal camera proved significant, particularly for gate navigation. Drones often lose sight of the gates during evasive maneuvers. Expanding to multicamera setups and integrating Simultaneous Localization and Mapping (SLAM) could profoundly improve the performance of our methods in future research.

VI. Individual Contributions

Lazar Milikic originated the initial project idea of applying RMA principles for the dynamic estimation of objects in environments. Marko Mekjavic and Lazar Milikic designed the task used in the project that was further refined by Ahmad Jarrar. Ahmad Jarrar worked on creating the custom gym environment for the project. Lazar Milikic worked on creating the reward function, curriculum learning design, and training the baseline model. Lazar and Ahmad created the algorithm for the parabolic projectiles. Lazar Milikic, Said Gurbuz, and Marko Mekjavic worked on the vision-based drone and models. Said, Marko, and Ahmad worked on tuning the YOLO model to our needs. Marko Mekjavic and Lazar Milikic worked on project presentation. Lazar and Marko wrote the report. All members contributed to creating in the final report. Lazar Milikic worked on designing the models for all agents and experiments. Ahmad Jarrar created the demonstration videos and evaluation results. Said worked on finalizing the code structure.

References

O. TRULLIER, S. I. WIENER, A. BERTHOZ, and J.-A. MEYER, “Biologically based artificial navigation systems: Review and prospects,” Progress in Neurobiology, vol. 51, no. 5, pp. 483–544, 1997. [Online]. Available: sciencedirect:S0301008296000603.
C. Pfeiffer and D. Scaramuzza, “Human-piloted drone racing: Visual processing and control,” IEEE Robotics and Automation Letters, vol. 6, no. 2, pp. 3467–3474, 2021.
J. Fu, Y. Song, Y. Wu, F. Yu, and D. Scaramuzza, “Learn- ing deep sensorimotor policies for vision-based autonomous drone racing,” 2022.
P. Foehn, A. Romero, and D. Scaramuzza, “Time- optimal planning for quadrotor waypoint flight,” Science Robotics, vol. 6, no. 56, Jul. 2021. [Online]. Available: scirobotics.abh1221
H. X. Pham, H. M. La, D. Feil-Seifer, and L. V. Nguyen, “Autonomous uav navigation using reinforcement learning,” 2018.
W. Koch, R. Mancuso, R. West, and A. Bestavros, “Reinforcement learning for uav attitude control,” ACM Trans. Cyber-Phys. Syst., vol. 3, no. 2, feb 2019. [Online]. Available: doi.3301273
C. J. A. A. (Retired), “Combat search and rescue by drone,” Aug 2023. [Online]. Available: combat-search-and-rescue-drone
C. Chell, “The global impact of ukraine’s drone revolution on military forces,” Mar 2024. [Online]. Available: global-impact-of-ukraines-drone-revolution
J. Xing, L. Bauersfeld, Y. Song, C. Xing, and D. Scaramuzza, “Contrastive learning for enhancing robust scene transfer in vision-based agile flight,” 2024.
Y. Song, M. Steinweg, E. Kaufmann, and D. Scaramuzza, “Autonomous drone racing with deep reinforcement learning,” 2021.
A.Kumar,Z.Fu,D.Pathak,andJ.Malik,“Rma:Rapidmotor adaptation for legged robots,” 2021.
J. Panerati, H. Zheng, S. Zhou, J. Xu, A. Prorok, and A. P. Schoellig, “Learning to fly – a gym environment with pybullet physics for reinforcement learning of multi-agent quadcopter control,” 2021.
Y. Song, S. Naji, E. Kaufmann, A. Loquercio, and D. Scara- muzza, “Flightmare: A flexible quadrotor simulator,” in Proceedings of the 2020 Conference on Robot Learning, 2021, pp. 1147–1157.
Z. Fu, A. Kumar, A. Agarwal, H. Qi, J. Malik, and D. Pathak, “Coupling vision and proprioception for navigation of legged robots,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 17252–17262, 2021. [Online]. Available: CorpusID:244896056
Ultralytics, “ultralytics/yolov5: v7.0 - YOLOv5 SOTA Realtime Instance Segmentation,” https://github.com/ultralytics/yolov5, 2022, accessed: 1st July, 2024. [Online]. Available: zenodo.7347926
C. Lea, R. Vidal, A. Reiter, and G. D. Hager, “Temporal convolutional networks: A unified approach to action segmentation,” 2016.
L.Chen,K.Lu,A.Rajeswaran,K.Lee,A.Grover,M.Laskin, P. Abbeel, A. Srinivas, and I. Mordatch, “Decision trans- former: Reinforcement learning via sequence modeling,” 2021.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2023.
S. Shah, D. Dey, C. Lovett, and A. Kapoor, “Airsim: High-fidelity visual and physical simulation for autonomous vehicles,” in Field and Service Robotics, 2017. [Online]. Available: arxiv.1705.05065
D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),” 2023.
A. Hill, A. Raffin, M. Ernestus, A. Gleave, A. Kanervisto, R. Traore, P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J. Schulman, S. Sidor, and Y. Wu, “Stable baselines,” [Online]. Available: arxiv.1705.05065
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” 2017.
T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” 2018.
I. Loshchilov and F. Hutter, “Decoupled weight decay regu- larization,” 2019.

Vision Is All You Need:
A Vision-Based Approach to Dynamics Estimation in Autonomous Navigation

A snippet from a successful run demonstrates the drone adeptly navigating through the desired gates effortlessly, all the while evading projectiles.

Abstract

Overview

I. Introduction