Created by

Matthew Earl

on March 06, 2021.

## Introduction

The landing of the Mars 2020 Perseverance rover final month amazed the arena.

The enticing photos of the descent exhibits every stage of the sequence.

Whereas you happen to could presumably even possess no longer viewed it already

you’ll also peek it right here.

One thing that I realized noteworthy was once the self-similarity of the martian

terrain. As the lander descends in direction of the ground it is annoying to catch a

sense of scale, since there may per chance be no longer any familiar frame of reference to make clear us how far

away the ground is. This led me to embark on a mission in which I

reproject the photos onto a satellite image acquired from the

Mars Teach Orbiter,

along with a scale to make clear us how neat parts on the ground truly are:

In this post I’m going to make clear how I venerable Python, OpenCV, PyTorch, and

Blender to compose the above photos.

## Keypoints and correspondences

Producing my video entails distorting the frames of the authentic photos so

that every frame lines up with the satellite image. The neatly-liked attain of doing this

is to:

- fabricate some salient keypoints from every image
- procure correspondences between the aspects
- procure a mathematical purpose that maps aspects in the first image to those in the second image.

The crucial aspects of implementing the above are described in

this OpenCV tutorial, but I may summarize the blueprint right here.

Breaking this down, on the left is a frame from the video that we want to align,

with the reference satellite image on the lawful:

First and predominant, we say OpenCV’s

Scale Invariant Feature Remodel (SIFT) keypoint

detector to pull out salient keypoints from the image:

Each crimson imperfect right here marks a potentially “lively” level as definite by the

SIFT algorithm. Related to every level (but no longer proven) is a vector of 128

values which describes the section of the image that surrounds the keypoint. The

conception is that this descriptor is invariant to things esteem scale (as the name

implies), rotation, and lights differences. We can then match up aspects in our

pair of photography with connected descriptors:

## Projective transformations

Now that now we possess realized the keypoint pairs, your next step is to procure a transformation

that maps the keypoints from the video frame onto the corresponding keypoints of

the satellite image. To originate this, we compose say of a class of transformations known

as projective transformations. Projective transformations could presumably also moreover be venerable to

characterize how mounted aspects on a flat plane alternate obvious situation when considered

from a slightly a pair of space and perspective, which is safe to us for the reason that ground of Mars could presumably also moreover be effectively

approximated by a flat plane at these distances. That is assuming that the digicam conforms to a rectilinear perspective projection (i.e. without lens distortion), which appears to be like to be the case.

A projective transformation is represented by a 3×3 matrix (M). To prepare this form of

transformation to a 2D level (v) we first append a 1 to present a 3-vector, then

multiply by the matrix:

[v’=M begin{bmatrix}

v_x \

v_y \

1

end{bmatrix}]

To catch relief to a 2D level, the result is split by its

third ingredient, and truncated relief to a 2-vector:

[v_text{projected}=begin{bmatrix}

v’_x / v’_z \

v’_y / v’_z

end{bmatrix}]

This would maybe presumably also moreover be visualized by

plotting the aspects on the z=1 plane, applying the transformation, and then

projecting every level in direction of the origin, relief onto the z=1 plane:

When we talk about composing projective transformations, what we are truly doing is

multiplying the underlying matrices: projective transformations possess the property

that the composition of two transformations is the identical as the projective transformation

given by the matrix made from their respective matrices. Written symbolically

this may per chance be written as

[forall x in mathbb{R}^2 colon p_{M_1} ( p_{M_2} (x) )=

p_{M_1 M_2} (x)]

where (p_{M}) denotes the projective transformation connected to the 3×3

matrix (M).

Discovering the transformation is executed utilizing a

RANSAC attain.

For more crucial aspects on RANSAC please witness

my Pluto flyby post.

Once now we possess a transformation for every frame, we can reproject every video frame

into the frame of reference of the satellite image, thus acquiring the stablized

peek.

## Discovering transformations for every frame

Sadly it is no longer simply a case of repeating the above course of for every

frame in relate to manufacture a total video, because the algorithm is no longer ready to

fabricate enough correspondences for every frame.

In relate to solve this, we also see transformations between the video frames

themselves. The thought being that if a frame has no insist transformation linking

it to a satellite image, but we originate possess a transformation linking it to a different

frame that is itself linked to the satellite image, then we can simply

originate the two transformations to blueprint the authentic frame onto the satellite peek.

So, I labelled every thirtieth frame (ie. one frame per second) as a “keyframe”,

and then exhaustively hunted for transformations between every pair of

keyframes. For the relaxation frames I hunted for transformations to the

nearest keyframe.

This outcomes in a sexy dense graph with one node per frame, and one edge per

transformation realized. Here’s a simplified example, with keyframes at every 5

frames in would like to at every 30:

Any path from the satellite node to a convey frame’s node represents a series

of transformations that once silent will blueprint the frame onto the satellite

peek.

We can inaugurate by deciding on one path for every node. Doing a breadth-first

search

from the satellite node will give us a path to every frame whereas also

guaranteeing that it is the shortest that you simply’ll also have faith:

We wish the shortest path that you simply’ll also have faith, because little errors procure with every

extra transformation.

Here’s a transient clip made utilizing shortest path transformations:

## Optimization

Whereas the above blueprint yields a tight reprojection, it is no longer supreme. There

are clear mode switches round when the shortest path modifications.

If we incorporate all correspondences, and no longer actual these on the shortest path,

this provides more files and outcomes in smoother and more pleasing

transformations.

To originate this, I wrote a loss purpose which returns the final reprojection error,

given a satellite-relative transformation for every image:

`src_pts`

and `dst_pts`

are both `N x 3`

arrays, representing every pair of

aspects in the dataset. `frame_transforms`

is an `M x 3 x 3`

array representing

the candidate transformations, `M`

being the different of frames in the video.

`frame_transforms`

are relative to the satellite image, which is to affirm a level

in the satellite image when transformed with `frame_transforms[i]`

ought to composed

give the corresponding level in frame `i`

.

Since there are loads of level-pairs per frame, `src_idx`

and `dst_idx`

are venerable

to blueprint every half of of each level-pair to the corresponding video frame.

The `loss`

purpose proceeds by taking the first aspects from every pair, mapping

them relief into the satellite image’s frame of reference, then mapping them into

the frame of reference of the second image. With pleasing frame transformations and

supreme correspondences, these transformed aspects desires to be very stop to the

corresponding impart of second aspects. The supreme line of the `loss`

purpose then

measures the Euclidean distance (sum of squares) between the reprojected first

aspects and the (unmodified) second aspects. The thought is that if we procure a impart of

`frame_transforms`

with a decrease loss, then we are going to possess a more pleasing impart of

transformations.

`loss`

is written utilizing Torch. Torch is an computerized

differentiation framework with functionality for applying gradient

descent (amongst other things).

As such we can say it to iteratively toughen our `frame_transforms`

:

`dataset`

is constructed from the impart of correspondences, and the

`initial_frame_transforms`

are these derived from composing the transformations

along the shortest paths.

After running this loop for a whereas we obtain the supreme impart of transformations

for every frame. This produces a more accurate impart of transformations:

## Rendering

To fabricate the supreme video I venerable the 3D modelling and rendering application

Blender. I venerable Blender’s

rich Python scripting interface to animate a quad whose corners prepare the

reprojected video’s corners. To catch the lawful texture for the quad I took

again of Blender’s shader scheme:

In frequent, the shader scheme decides how a convey level on a ground ought to composed

be dark, which is commonly a purpose of incoming light, peek course, and

properties of the ground. Here I’m utilizing it in a in point of fact easy attain which

calculates what colour the level on the quad desires to be, given the level’s

coordinates in 3D space.

Here’s a breakdown of the slightly a pair of stages:

- Be pleased the impart of the relate be colored, and change the Z component

with a 1. That is the first stage of the projective transformation where we

flip the two-vector right into a 3-vector by appending a one. - Multiply this 3-vector by a matrix outlined by the constants proven right here. These

constants are genuinely intriguing so that on any given frame these level to

`frame_transforms[frame_num]`

. - Divide via by z (mission onto the z=1 plane).
- At this level the coordinates are when it involves pixels in the video frame.

On the different hand the next stage wants them to be in the fluctuate 0 to 1, so divide by the

video width and peak right here. - Look up the given coordinates in the video, and output the corresponding

colour.

## Final touches

There are a pair of extra aspects that wanted addressing to manufacture the supreme video:

- I venerable many satellite photography in would like to actual one. On the different hand, I designate one

as the “reference frame” (ie. the frame with the identity transformation) and

care for the comfort as in the occasion that they had been video key frames. - For the length of the early section of the video, the rover’s heatshield is visible. With out

intervention, some frame correspondences tune the heatshield (which is itself

transferring) in would like to the terrain, causing crude tracking. So, I manually

extracted some keypoints from the heatshield on a convey frame, and

omitted all keypoints that had been corresponding to at the least thought to be one of many heatsheid’s

keypoints. - Infrequently ever, degenerate frame correspondences are realized. When all matching

keypoints are in a line you catch loads of solutions corresponding to rotations

about that line. Even though matching keypoints are no longer precisely in a line but are

stop, the transformation realized could presumably also moreover be wrong. There was once one such image

pair that precipitated this predicament in my video, which I manually excluded.

## Conclusion

I possess proven that the photos from the Perseverance rover’s descent could presumably also moreover be

stablized and aligned with a reference satellite image. Whereas I’m fully delighted with

the result and it no doubt helps give context to the raw photos, there are

many suggestions that it is also improved, as an instance, for the length of the early section of the

video there are no longer many keypoints realized by SIFT. This manifests itself as

inaccuracy in the tracking. Perchance experimenting with slightly a pair of keypoint

algorithms would yield more usable keypoints.

There is also also different routes to solve the predicament which I possess no longer

explored right here. As an instance, the predicament is slightly corresponding to that of frequent

video stabilization. Perchance I could presumably also say an off-the-shelf solver to compose a

connected attain.