Implicit Mapping

iMAP: Implicit Mapping and Positioning in Real-Time ¶
NICE-SLAM: Neural Implicit Scalable Encoding for SLAM ¶
iSDF: Real-Time Neural Signed Distance Fields for Robot Perception ¶
Atlas: End-to-End 3D Scene Reconstruction from Posed Images ¶
CodeSLAM — Learning a Compact, Optimisable Representation for Dense Visual SLAM ¶
Deep SLAM papers ¶
- RAFT: Recurrent All-Pairs Field Transforms for Optical Flow ¶

The key idea here is to have a scene representation that is efficient and predictive .

Efficient : The memory consumption doesn't increase a lot as scene size increases. Voxel grid map is not an efficient representation though using Octree makes a little bit better.
Predictive : Predict scene properties even though direct observations are not available.

With the raise of NeRF, deep neural networks (DNN) seem to be an good choice to represent a scene. Given a position (x, y, z) query, DNN can predict colors, depths, signed distance functions, textures, etc.

In this post, I will share a couple of papers related to this topic. It is by no means a comprehensive literature review. Most of them are from Dr. Andrew J. Davison's group.

iMAP: Implicit Mapping and Positioning in Real-Time ¶

System input: RGB-D images
System output: camera poses + scene representation NN

In [4]:

# iMAP system overview
from IPython.display import Image
Image('img/imap.png', width=480)
#

Out[4]:

Scene Representation - iMAP ¶

Network structure ¶

$(c, \rho) = F_\theta(p)$
Input: 3D coordinate [$p = (x, y, z)$]
Output: Color [$c$], Volume Density [$\rho$]
Train Parameters: 4 hidden layers of feature size 256, two output heads, and Gaussian positional embedding (TODO) [$\theta$]

Retrieve predicted color and depth for a pixel ¶

Get pixel position in image as $(u, v)$
Get viewing direction in world frame $r = R_{WC}K^{-1}[u, v, 1]^T$. $K$ is camera intrinsic parameter.
Take $N$ samples from the ray direction as $p_i = d_i * r$
Query DNN to get an output for each sample position as $(c_i, \rho_i)$
Calculate ray termination probability $w_i = o_i \prod_{j=1}^{i-1}o_j$, $o_i = 1-exp(-\rho_i * (d_{i+1}-d_{i}))$
- From here we can see the predicted $\rho_i$ density is transformed to an occupancy probability value as $o_i$.
- Then f I want to generate mesh using march cube, I can use use voxel at (x, y, z)'s occupancy probability.
Generate predicted depth and color as weighted sum. $\hat{D}[u, v] = \sum{w_id_i}, \hat{I}[u, v] = \sum{w_ic_i}$
- In addition to the mean, we can also get variance which is used as the weights for non-linear optimization.

Joint SLAM Optimization - iMAP ¶

Optimization Parameters ¶

Keyframe Poses (6 DoF): $T$
Network Parameter: $\theta$

Objective function ¶

Photometric loss: $ I[u, v] - \hat{I}[u, v]$
Geometric loss: ${D}[u, v] - \hat{D}[u, v]$

Tracking - iMAP ¶

DNN parameters $\theta$ is copied as $\theta_i$ when keyframe $i$ is inserted. $\theta_i$ is used for tracking.

Optimization Parameters ¶

New frame Poses (6 DoF): $T$

Objective function ¶

Photometric loss: $ I[u, v] - \hat{I}[u, v]$
Geometric loss: ${D}[u, v] - \hat{D}[u, v]$

Active Sampling - iMAP ¶

The idea is simple. Since using all pixels in the image and all images in the stream for online training is not feasible, so we have to select a subset of pixels and images. Instead of selecting them uniformly, we can sample them based on weights which is calculated based on the loss. The larger the loss is, the larger the weight is.

Keyframe Selection - iMAP ¶

Render a uniform set of pixel samples s and calculate the proportion P with a normalised depth error smaller than a threshold, to measure the fraction of the frame already explained by our map snapshot.

$$ \begin{align} P = \frac{1}{|S|}\sum_{[u, v] \in S}\mathbb{1}(\frac{|D[u, v]-\hat{D}[u, v]|}{D[u, v]} < t_D) \end{align} $$

The smaller $\frac{|D[u, v]-\hat{D}[u, v]|}{D[u, v]}$ is, the better NN explains the map.
The larger $P$ is, the better NN explains the map.
So when $P$ is smaller than a threshold, we add a new keyframe.

NICE-SLAM: Neural Implicit Scalable Encoding for SLAM ¶

Scene Representation ¶

iMAP uses a single MLP NN to represent the whole scene while NICE-SLAM uses a hierarchical scene representation. My understanding is that it allows to capture local structure better because a single NN for scene representation has to be updated globally. However, I think this also increases the memory usage similar to voxel grid maps.

Network structure ¶

4 Feature grids:
- color feature grid $\psi_w$ : 16x16cm, trainable parameters
- geometry feature grid $\phi_0$ : 2m x 2m, coarse grid, trainable parameters
- geometry feature grid $\phi_1$ : 32cm x 32cm, middle level grid, trainable parameters
- geometry feature grid $\phi_2$ : 16cm x 16cm, fine level grid, trainable parameters
4 Feature decoder:
- color feature decoder: $g(h(p), \psi_w(p)) \rightarrow rgb$:
  - $rgb$ color channel
  - $h()$ positional encoding, trainable parameters
  - $\psi_w()$ tri-linear interpolation, trainable parameters
- geometry decode $f_0(h(p), \phi_0(p)) \rightarrow o_0$:
  - $o$ is occupancy probability
  - $\phi_0()$ tri-linear interpolation, fixed pretrained parameters
- geometry decode $f_1(h(p), \phi_1(p)) \rightarrow o_1$:
  - $\phi_1()$ tri-linear interpolation, fixed pretrained parameters
- geometry decode $f_2(h(p), \phi_2(p)) \rightarrow \Delta o$:
  - $\phi_2()$ tri-linear interpolation, fixed pretrained parameters
Hidden feature dimension of 32 and 5 fully-connected blocks

Retrieve predicted color and depth for a pixel ¶

Same as iMAP except that:

The network directly predict occupancy
Sample more points near the depth observation
Depth has coarse depth obtained using $o_0$ and fine depth obtained using $o_1+\Delta o$

Joint SLAM Optimization ¶

Optimization Parameters ¶

Keyframe Poses (6 DoF): $T$
Network Parameter: $\theta$

Objective function ¶

Photometric loss: $ I[u, v] - \hat{I}[u, v]$
Geometric loss including using both coarse and fine depths: ${D}[u, v] - \hat{D}[u, v]$

Multi-stage training ¶

- First: Only train mid-level feature grid
- Second: Train both mid-level and fine-level grid
- Finally: Train all: coarse level feature, mid-level, fine-level feature, color feature, color decoder, poses.
- ** ??? When to switch to next stage? How to run in real-time? **

Tracking ¶

Same as iMAP and also filters pixels with large depth/color re-rendering loss.

Keyframe Selection ¶

Same as iMAP.

iSDF: Real-Time Neural Signed Distance Fields for Robot Perception ¶

Key ideas:

Use posed images so this means it has to have an additional tracking system. I assume it would be a sparse feature based SLAM system.
Directly regress Signed distance function (SDF) and normals.
I think Atlas paper below is more interesting.

Atlas: End-to-End 3D Scene Reconstruction from Posed Images ¶

CodeSLAM — Learning a Compact, Optimisable Representation for Dense Visual SLAM ¶

Key ideas:

Pre-train a decoder network and encoder network (Not used in SLAM).
Given an RGB image and a code (essentially a feature embedding vector), decoder network will output a depth map.
SLAM system minimizes geometric errors by optimizing over camera poses and code , while decoder network weights are fixed.
Important Note: The paper still uses LM based iterative non-linear least square solvers so in order to calculate Jacobian with respect to the code, they use a linear decoder network
I think this paper can be considered as a pioneering work but also outdated.

Reference:

Sucar et al iMAP: Implicit Mapping and Positioning in Real-Time
Zhu et al. NICE-SLAM: Neural Implicit Scalable Encoding for SLAM
Ortiz et al. iSDF: Real-Time Neural Signed Distance Fields for Robot Perception
Sucar et al. NodeSLAM: Neural Object Descriptors for Multi-View Shape Reconstruction
Czarnowski et al. DeepFactors: Real-Time Probabilistic Dense Monocular SLAM
Bloesch et al. CodeSLAM — Learning a Compact, Optimisable Representation for Dense Visual SLAM
Mildenhall et al. NeRF: Representing scenes as neural radiance fields for view synthesis
Murez et al. Atlas: End-to-End 3D Scene Reconstruction from Posed Images

Interesting Deep SLAM Papers:

Teed et al. RAFT: Recurrent All-Pairs Field Transforms for Optical Flow (ECCV 2020 best paper)
Guizilini et al. Learning Optical Flow, Depth, and Scene Flow without Real-World Labels
Teed et al. DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras

Deep SLAM papers ¶

RAFT: Recurrent All-Pairs Field Transforms for Optical Flow ¶

This is one of my favorite deep learning paper so far. The idea is so simple and clear which mimic the traditional ways for estimating flow .

This blog is converted from imap.ipynb

Written on April 26, 2022