东升国际官网

申请试用
登录
  • hd-share-img01
主题技术
以原创技术系统为根基,SenseCore东升国际官网AI大装置为主题基座,布局多领域、多方向前沿钻研,
急剧买通AI在各个垂直场景中的利用,向行业赋能。

CVPR 2021|对症下药,用图像宰割与像素投乒匾到预约义的地标点

2022-02-22

视觉定位这一工作的指标是凭据图像推算出相机的六自由度位姿,即三自由度的地位和三自由度的旋转。目前主流的视觉定位步骤有两种,即基于 SfM 的视觉定位步骤和基于场景坐标回归的步骤。

固然基于场景坐标回归的步骤在幼型静态场景中的视觉定位方面已经阐发出优良的机能,但它依然会回归出很多较差质量的场景坐标,这会给正确的相机位姿估计带来影响。为相识决这个问题,我们提出了一种新鲜的视觉定位框架 VS-Net,并在多个公共数据集上进行了测试,机能优于之前的场景坐标回归步骤和一些代表性的基于 SfM 的视觉定位步骤。



VS-Net: Voting and Segmentation for Visual Localization

Zhaoyang Huang1,2*  Han Zhou1*  Yijin Li1  Bangbang Yang1  Yan Xu2  Xiaowei Zhou1 Hujun Bao1 Guofeng Zhang1? Hongsheng Li2,3

1State Key Lab of CAD&CG, Zhejiang University?  2CUHK-SenseTime Joint Laboratory, The Chinese University of Hong Kong  3School of CTS, Xidian University


Part 1 论文简介

固然基于场景坐标回归的步骤在幼型静态场景中的视觉定位方面已经阐发出优良的机能,但它依然会回归出很多较差质量的场景坐标,这会给正确的相机位姿估计带来影响。为相识决这个问题,我们提出了一种新鲜的视觉定位框架,该框架凭据场景造订一系列可进建的特定场景地标,并通过这些地标在查问图像和 3D 地图之间成立 2D 到 3D 的对应关系。在地标天生阶段,指标场景的 3D 表表被均匀宰割成幼块,并将每个幼块的中心视为场景特定的地标。为了鲁棒而正确地复原特定场景的地标,我们提出了一种同时预测宰割与像素投票的网络 VS-Net,通过使用该网络的宰割分支将二维图像中的像素宰割为分歧的地标块,并使用像素投票分支估计每个块在二维图像内的地标地位。由于场景中的地标数量可能多达5千甚至更多,使用常用的交叉熵损失训练拥有如此多类此外宰割网络而言推算与显存成本过高。为此,我们进一步提出了一种新的基于原型的三元组损失函数与在线负样本挖掘战术,可能有效地监督训练拥有大量标签的语义宰割网络。总的来说,该工作的重要贡献如下:


提出通过场景定造化地标来进行视觉定位,并提出通过投票与宰割(voting-by-segmentation)来定位图像中的场景地标,从而使得相机位姿估计能更精准鲁棒。

 

由于场景地标数量过大(即图像宰割是标签数量过大),我们提出了基于原型的三元组损失(prototype-based triplet loss)来解决标签数量很大情况下的图像宰割问题。据我们所知,我们是第一个解决标签数量很大情况下的图像宰割问题。在640x480分辨率,5千个标签类别设置下的图像宰割工作中,我们提出的损失只必要传统的交叉熵损失算力和显存亏损的约0.1%(26.7MFLOPS v.s. 36.9GFLOPS;3.08MB v.s. 5.7GB)。


Part 2 有关工作

 

1. 基于SfM(Structure-from-Motion)的视觉定位步骤

传统的视觉定位框架通过 SfM 技术构建地图,使用通用特点检测器和描述符。给定一个查问图像,他们提取一样的 2D 特点并通过描述符将它们匹配到地图中的 3D 特点。特点检测器和特点描述符的关系在这个框架中极度沉要,由于它同时影响了地图质量和查问图像中 2D-3D 对应关系的匹配水平,这决定了定位的正确性。在基于 SfM 的视觉定位系统中,地图中的 3D 特点点是凭据多个相对应的2D点通过三角丈量法沉建。这些地图中的 3D 特点点会极度凌乱(如图1(a)所示),由于一个现实场景中的 3D 点往往会被多个分歧的3D 特点点来表白,这是由于建图时图像的视角变动较大而使得 2D 特点未能匹配成功,这种质量不高的地图会影响视觉定位成效。


1.jpg

图1 SfM构建地图与深度传感器构建地图比力


2. 基于场景坐标回归(Scene Coordinate Regression)的视觉定位步骤

随着深度进建的发展,训练特定场景的神经网络对地图进行编码并使用它对该场景的图像进行定位定位成为另一种视觉定位规划。场景坐标回归的视觉定位步骤通过训练一个神经网络来预测图像每个像素的场景坐标来构建 2D-3D 对应关系,而后使用经典的 RANSAC-PnP 步骤来推算相机位姿。该规划可能使用没有特点数据库但是越发正确的三维地图(如图1(b)是一个使用深度传感器沉建的浓密地图),并在中幼型场景中获得了优异成效。然而通过该步骤构建的 2D-3D 对应关系依然不够正确且表点比例较高(如图2(b)所示)。与之相比,我们提出的 VS-Net 会得到稀少但是更正确鲁棒的 2D-3D 对应(如图2(c)所示),这同时增长了定位的精度和鲁棒性。


2.jpg

图 2 2D-3D 对应关系的沉投影误差比力


Part 3 步骤描述


3.jpg

图 3 VS-Ne视觉定位框架


场景坐标回归步骤比力适合幼规模场景的视觉定位工作,通常为每个像素都成立输入查问图像和场景的 3D 表表点的 2D-3D 对应关系(即场景坐标)。然而,很大一部门像素预测的对应三维场景坐标有很高的沉投影误差,这增长了定位失败的可能性并降低后续 RANSAC-PnP 算法的定位精度。针对这些问题,我们提出使用 VS-Net 来鉴别一系列场景定造化的地标(图 3)并成立它们与 3D 地图的对应关系以实现精确定位。场景定造化的地标是从场景的 3D 表表直接界说的一组稀少的三维点。我们对场景的 3D 表表进行均匀宰割,得到一组面片(patches),并遴选每个面片的几何中心作为场景定造化地标。给定分歧视角的训练图像,我们能够投影这些天生的场景地标及其面片到图像平面以鉴别它们在图像中的对应像素。通过这种方式,我们可以为所有训练图像天生对应的地标信息。

 

在训练阶段,我们使用类似语义宰割的像素级宰割来预测属于每个在推理阶段,给定一个新的输入图像,我们从 VS-Net 获得地标宰割图和地标地位投票图。而后能够基于地标宰割和地位投票图成立 2D 到 3D 地标对应关系。与只能通过筛选场景坐标回归步骤中 2D 到 3D 对应异常值的 RANSAC-PnP 算法分歧,我们提出的步骤中的地标若是没有足够高的投票相信度,就会被直接烧毁,这就预防了从定位不正确的地标中估计相机的地位(图2)。此表,成立在场景坐标步骤上的对应关系很容易受到不不变预测的影响,而在东升国际官网步骤中,受轻微滋扰的投票不会影响投票地标地位的正确性,由于它们会被面片内 RANSAC 推算交点算法过滤掉。像素对应的三维地标 ID。同时我们增长地标二维地位定位分支,通过输出指向地标二维投影的方向向量,使每个像素掌管估计其相应地标的二维地位。


场景唯一地标天生:

n} ∈R3 被选为场景唯一地标进行定位。由于 Supervoxel 产生大幼类似的块,天生的地标大多均匀地散布在三维表表上,这能够从分歧的角度提供足够的地标,因而有利于定位鲁棒性。

 

给定训练图像和场景的相机姿势,三维场景特定的地标 {q1, . . . ,qn},以及它们有关的三维块能够被投影到二维图像上。对于每幅图像,我们能够天生一个地标宰割图 S∈ZH×W 和一个地标地位投票图 d∈RH×W×2。对于基于块的地标宰割,坐标 pi= (ui, vi)的像素被分配到由三维块的投影决定的地标标签(ID)。若是一个像素对应的区域没有被投影面覆盖,如天空或远处的物体,则给它分配一个布景标签0,暗示这个像素对视觉定位无效。

 

对于地标地位投票,我们首先通过凭据相机内涵矩阵 K 和相机姿势参数 C 投影三维地标来推算地标 qj 的投影二维地位 lj=P(qj, K, C)∈R2。属于地标 j 的的每个像素掌管预测指向 j 的二维投影的二维方向向量 di∈R2,即


4.jpg


其中 di 是一个归一化的二维向量,暗示地标 j 的方向。

 

在界说了真实地标宰割图和真实方向投票图后,我们能够监督所提出的 VS-Net 预测这两个图。经过训练,VS-Net 能够预测查问图像的宰割图和投票图,我们能够据此成立精确的二维到三维的对应关系,以实现稳重的视觉定位。


基于原型的在线进建三元监督投票宰割网络:


传统的语义宰割工作通常选取交叉熵损失来监督所有预测像素的齐全分类

传统的语义宰割工作通常选取交叉熵损失来监督所有预测像素的齐全分类 One-Hot 向量。然而,东升国际官网地标宰割必要输出拥有大量类别(地标)的宰割图,以有效地鉴别每个场景唯一地标。通例语义宰割中的逐像故旧叉熵损失和通例的三元组损失在此时都不成用。

 

为相识决这个问题,我们提出了一种新的基于原型的三元组宰割损失函数和在线负采样战术来监督有大量类的语义宰割。它守护和更新一组可进建的类原型嵌入,每一个嵌入都代表一个语义类,即 Pj 暗示第 j 个类的嵌入。直观地说,第 j 类的嵌入应该靠近 Pj,并远离其他类的原型。我们提出的损失是基于拥有在线负采样战术的三元组损失设计的。


5.jpg

图 4 在线负采样战术


给定 VS-Net 的图像宰割分支输出的逐像素特点图E和类的原型集 P,首先我们对各个特点和原型进行 L2 规范化,而后使用基于特点原型的三元组损失对其进行优化,以使每个像素的特点更靠近它对应的类的特点原型而远离其他类的特点原型。对于正负采样,我们设计了两种采样战术,一种是把当前预测的特点图中所有拥有一样 landmark id 的 embedding 每一维取均值作为东升国际官网 anchor 特点向量,而后从 prototype set 当选择正负样本监督网络训练,也就是图4(a) 的采样方式。但这样选择的负样本可能不够充分,为了在不显著增长推算量的同时保障负样本的多样性,我们对每个像素推算k个最相近的负样本,也就是图4(b) 所绘造的采样方式。

6.jpg

其中的:

7.jpg


暗示像素向量和类原型向量之间的余弦类似度,m 代表三元组损失的边际,P_(i+) 暗示与像素 i 相对应的 ground-truth(正)类原型向量,P_(i-) 暗示非对应的(负)类原型向量的采样。

 

对于每个像素,若何在上述基于原型的三元组损失中确定其负类原型向量 Pi- 对最终机能有至关沉要的影响,随机抽样负类会使训练过于单一。给定输入图像,我们观察到活动地标的数量(即图像中属于地标的至少一个像素)是有限的。此表,属于统一地标块的像素在特点空间上彼此靠近,并且会共享类似的负原型,由于它们拥有类似的向量。因而,我们建议为每个活动地标挖掘代表性负类,每个像素随机采样来自挖掘类集的负类以形成代表三元组。

 

具体来说,给定一个有地标索引(类别)i+ 的像素 i,我们首先检索输入图像中与地标 i+ 有关的所有像素向量,并取其均匀值以获切当图像中地标的均匀类向量值 Mi+。而后使用均匀类向量从原型嵌入集中检索 k 个最近邻负原型 PiD芄唤庋 kNN 负原型以为是硬负样本。三元组损失使用像素 i 的从 kNN 负向原型集中均匀采样的单一负原型向量 Pi-(公式 (2) )。


基于方向向量的投票网络:


给定从上面介绍的宰割解码器天生的宰割图,输入图像中的每个像素要么被分配一个地标标签,要么是一个无效的标签,用于暗示太远的物体或区域(例如:天空)。我们使用了另一个投票解码器,用于确定给定图像中地标的投影 2D 地位。解码器每个像素输出一个 2D 方向向量,指向其相应地标的 2D 地位。投票解码器使用以下损失监督,

8.jpg

其中1暗示 L1 范数,其中的  和  别离暗示像素 i 的 ground-truth 的投票方向和预测的投票方向。


训练与定位:

整体损失 Loverall 是地标宰割损失和地标方向投票损失的组合,

9.jpg

其中 λ 对损失项的贡献进行加权。

 

在定位阶段,我们将地标宰割图中预测拥有一样地标标签的像素组合在一路,我们通过推算预测投票图中地标方向投票的交集来估计其对应的地标地位,称为投票-宰割算法。

 

具体的,给定宰割图,我们首先过滤掉像素隔宿幼于阈值 Ts 的地标块,由于太幼的地标其指向的 2D 地标地位通常也是不不变的。使用向量求交模型从 RANSAC 推算出地标的 2D 地位的初始估计,该模型通过推算两个随机采样的定向投票的交叉并选择拥有最多的如果来天生多个地标地位如果内部投票。而后,地位通过迭代 EM 算法进一步细化。在 E 步骤中,我们从当前周围圆形区域中网络地标 j 的内部投票向量。在 M 步中,我们选取了 Antonio 等人介绍的最幼二乘法。凭据圆形区域中的投票推算更新的地标地位。在迭代过程中,一个没有得到足够定向投票支持的投票地标,批注投票一致性低,将被舍弃。


Part 4 试验了局

我们在 Microsoft 7-Scenes 和 Cambridge Landmarks 两个数据集上与基于 SfM 和基于场景坐标回归的视觉定位步骤进行了比力。如表1所示,我们提出的基于定造化地标的视觉定位规划在所有场景中都获得了最好的精度,并在一些场景中(好比 GreatCourt 与 Office)显著优于其他步骤。


10.jpg

表 1 视觉定位精度比力。我们通过相机平移误差与相机旋转误差的中位数来比力定位精度


我们也对一些比力有挑战的查问图像进行了视觉定位了局的比力。对给定的查问图像,我们用定位系统推算出相机位姿之后,将沉建的 3D 模型投影到对应的相机位姿中。通过对比查问图像与沉投影天生的图像我们能够定性的比力视觉定位的了局。如图5所示,只管有比力极端的动态物体遮挡 图5(a) 和恶劣的光照前提图5(b),我们依然能比力好地估计相机位姿。


11.jpg

图 5对有挑战性图像的视觉定位

Reference:

[1]Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Simon, Brian Curless, Steven M Seitz, and Richard Szeliski. Building rome in a day. Communications of the ACM, 54(10):105–112, 2011.

 

[2]Franklin Antonio. Faster line segment intersection. In Graphics Gems III (IBM Version), pages 199–202. Elsevier, 1992.

 

[3]Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic. Netvlad: Cnn architecture for weakly supervised place recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5297–5307, 2016.

 

[4]Clemens Arth, Daniel Wagner, Manfred Klopschitz, Arnold Irschara, and Dieter Schmalstieg. Wide area localization on mobile phones. In 2009 8th ieee international symposium on mixed and augmented reality, pages 73–82. IEEE, 2009.

 

[5]Nicolas Aziere and Sinisa Todorovic. Ensemble deep manifold similarity learning using hard proxies. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7299–7307, 2019.

 

[6]Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. SURF: Speeded up robust features. In Proceedings of the European conference on computer vision, pages 404–417. Springer, 2006.

 

[7]Eric Brachmann, Alexander Krull, Sebastian Nowozin, Jamie Shotton, Frank Michel, Stefan Gumhold, and Carsten Rother. Dsac-differentiable ransac for camera localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6684–6692, 2017.

 

[8]Eric Brachmann and Carsten Rother. Learning less is more6d camera localization via 3d surface regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4654–4662, 2018.

 

[9]Eric Brachmann and Carsten Rother. Expert sample consensus applied to camera re-localization. In Proceedings of the IEEE International Conference on Computer Vision, pages 7525–7534, 2019.

 

[10]Samarth Brahmbhatt, Jinwei Gu, Kihwan Kim, James Hays, and Jan Kautz. Geometry-aware learning of maps for camera localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2616– 2625, 2018.

 

[11]Ignas Budvytis, Marvin Teichmann, Tomas Vojir, and Roberto Cipolla. Large scale joint semantic re-localisation and scene understanding via globally unique instance coordinate regression. arXiv preprint arXiv:1909.10239, 2019.

 

[12]Federico Camposeco, Andrea Cohen, Marc Pollefeys, and Torsten Sattler. Hybrid scene compression for visual localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7653–7662, 2019.

 

[13]Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation. CoRR, abs/1706.05587, 2017.

 

[14]Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017.

 

[15]Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 224–236, 2018.

 

[16]Michael Donoser and Dieter Schmalstieg. Discriminative feature-to-point matching in image-based localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 516–523, 2014.

 

[17]Mihai Dusmanu, Ignacio Rocco, Tomas Pajdla, Marc Polle- ? feys, Josef Sivic, Akihiko Torii, and Torsten Sattler. D2-net: A trainable CNN for joint description and detection of local features. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 8092–8101, 2019.

 

[18]Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981.

 

[19]Yixiao Ge, Haibo Wang, Feng Zhu, Rui Zhao, and Hongsheng Li. Self-supervising fine-grained region similarities for large-scale image localization. arXiv preprint arXiv:2006.03926, 2020.

 

[20]Yisheng He, Wei Sun, Haibin Huang, Jianran Liu, Haoqiang Fan, and Jian Sun. Pvn3d: A deep point-wise 3d keypoints voting network for 6dof pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11632–11641, 2020.

 

[21]Zhaoyang Huang, Yan Xu, Jianping Shi, Xiaowei Zhou, Hujun Bao, and Guofeng Zhang. Prior guided dropout for robust visual localization in dynamic environments. In Proceedings of the IEEE International Conference on Computer Vision, pages 2791–2800, 2019.

 

[22]Marco Imperoli and Alberto Pretto. Active detection and localization of textureless objects in cluttered environments. arXiv preprint arXiv:1603.07022, 2016.

 

[23]Alex Kendall and Roberto Cipolla. Geometric loss functions for camera pose regression with deep learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5974–5983, 2017.

 

[24]Alex Kendall, Matthew Grimes, and Roberto Cipolla. Posenet: A convolutional network for real-time 6-dof camera relocalization. In Proceedings of the IEEE international conference on computer vision, pages 2938–2946, 2015.

 

[25]Xiaotian Li, Shuzhe Wang, Yi Zhao, Jakob Verbeek, and Juho Kannala. Hierarchical scene coordinate classification and regression for visual localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11983–11992, 2020.

 

[26]Yunpeng Li, Noah Snavely, and Daniel P Huttenlocher. Location recognition using prioritized feature matching. In European conference on computer vision, pages 791–804. Springer, 2010.

 

[27]Yutian Lin, Lingxi Xie, Yu Wu, Chenggang Yan, and Qi Tian. Unsupervised person re-identification via softened 6109 similarity learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3390–3399, 2020.

 

[28]Yuan Liu, Zehong Shen, Zhixuan Lin, Sida Peng, Hujun Bao, and Xiaowei Zhou. Gift: Learning transformation-invariant dense visual descriptors via group cnns. In Advances in Neural Information Processing Systems, pages 6990–7001, 2019.

 

[29]Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.

 

[30]David G Lowe. Distinctive image features from scaleinvariant keypoints. International journal of computer vision, 60(2):91–110, 2004.

 

[31]Jean-Michel Morel and Guoshen Yu. Asift: A new framework for fully affine invariant image comparison. SIAM journal on imaging sciences, 2(2):438–469, 2009.

 

[32]Yair Movshovitz-Attias, Alexander Toshev, Thomas K Leung, Sergey Ioffe, and Saurabh Singh. No fuss distance metric learning using proxies. In Proceedings of the IEEE International Conference on Computer Vision, pages 360–368, 2017.

 

[33]Richard A Newcombe, Shahram Izadi, Otmar Hilliges, David Molyneaux, David Kim, Andrew J Davison, Pushmeet Kohli, Jamie Shotton, Steve Hodges, and Andrew W Fitzgibbon. Kinectfusion: Real-time dense surface mapping and tracking. In ISMAR, volume 11, pages 127–136, 2011.

 

[34]Markus Oberweger, Mahdi Rad, and Vincent Lepetit. Making deep heatmaps robust to partial occlusions for 3d object pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 119–134, 2018.

 

[35]Yuki Ono, Eduard Trulls, Pascal Fua, and Kwang Moo Yi. Lf-net: learning local features from images. In Advances in neural information processing systems, pages 6234–6244, 2018.

 

[36]Jeremie Papon, Alexey Abramov, Markus Schoeler, and Florentin Worg ¨ otter. Voxel cloud connectivity segmentation - ¨ supervoxels for point clouds. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, Portland, Oregon, June 22-27 2013.

 

[37]Georgios Pavlakos, Xiaowei Zhou, Aaron Chan, Konstantinos G Derpanis, and Kostas Daniilidis. 6-dof object pose from semantic keypoints. In 2017 IEEE international conference on robotics and automation (ICRA), pages 2011–2018. IEEE, 2017.

 

[38]Sida Peng, Yuan Liu, Qixing Huang, Xiaowei Zhou, and Hujun Bao. Pvnet: Pixel-wise voting network for 6dof pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4561–4570, 2019.

 

[39]Qi Qian, Lei Shang, Baigui Sun, Juhua Hu, Hao Li, and Rong Jin. Softtriple loss: Deep metric learning without triplet sampling. In Proceedings of the IEEE International Conference on Computer Vision, pages 6450–6458, 2019.

 

[40]Tong Qin, Peiliang Li, and Shaojie Shen. Vins-mono: A robust and versatile monocular visual-inertial state estimator. IEEE Transactions on Robotics, 34(4):1004–1020, 2018.

 

[41]Jerome Revaud, Cesar De Souza, Martin Humenberger, and Philippe Weinzaepfel. R2d2: Reliable and repeatable detector and descriptor. In Advances in Neural Information Processing Systems, pages 12405–12415, 2019.

 

[42]Olaf Ronneberger, Philipp Fischer, and Thomas Brox. Unet: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.

 

[43]Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the IEEE international conference on Computer Vision (ICCV), pages 2564–2571. IEEE, 2011.

 

[44]Paul-Edouard Sarlin, Cesar Cadena, Roland Siegwart, and Marcin Dymczyk. From coarse to fine: Robust hierarchical localization at large scale. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 12716–12725, 2019.

 

[45]Torsten Sattler, Bastian Leibe, and Leif Kobbelt. Improving image-based localization by active correspondence search. In European conference on computer vision, pages 752–765. Springer, 2012.

 

[46]Johannes L Schonberger and Jan-Michael Frahm. Structurefrom-motion revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4104–4113, 2016.

 

[47]Johannes Lutz Schonberger, Enliang Zheng, Marc Pollefeys, ¨ and Jan-Michael Frahm. Pixelwise view selection for unstructured multi-view stereo. In European Conference on Computer Vision (ECCV), 2016.

 

[48]Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015.

 

[49]Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene coordinate regression forests for camera relocalization in rgb-d images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2930–2937, 2013.

 

[50]Chen Song, Jiaru Song, and Qixing Huang. Hybridpose: 6d object pose estimation under hybrid representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 431–440, 2020.

 

[51]Julien Valentin, Matthias Nie?ner, Jamie Shotton, Andrew Fitzgibbon, Shahram Izadi, and Philip HS Torr. Exploiting uncertainty in regression forests for accurate camera relocalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4400–4408, 2015.

 

[52]Bing Wang, Changhao Chen, Chris Xiaoxuan Lu, Peijun Zhao, Niki Trigoni, and Andrew Markham. Atloc: Attention guided camera localization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 10393– 10401, 2020.

 

[53]Qianqian Wang, Xiaowei Zhou, Bharath Hariharan, and Noah Snavely. Learning feature descriptors using camera pose supervision. arXiv preprint arXiv:2004.13324, 2020.

 

[54]Philippe Weinzaepfel, Gabriela Csurka, Yohann Cabon, and Martin Humenberger. Visual localization by learning objects-of-interest dense match regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5634–5643, 2019.

 

[55]Changchang Wu et al. Visualsfm: A visual structure from motion system. 2011.

 

[56]Chao-Yuan Wu, R Manmatha, Alexander J Smola, and Philipp Krahenbuhl. Sampling matters in deep embedding learning. In Proceedings of the IEEE International Conference on Computer Vision, pages 2840–2848, 2017.

 

[57]Tong Xiao, Shuang Li, Bochao Wang, Liang Lin, and Xiaogang Wang. Joint detection and identification feature learning for person search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3415–3424, 2017.

 

[58]Yan Xu, Zhaoyang Huang, Kwan-Yee Lin, Xinge Zhu, Jianping Shi, Hujun Bao, Guofeng Zhang, and Hongsheng Li. Selfvoxelo: Self-supervised lidar odometry with voxel-based deep neural networks. Conference on Robot Learning, 2020.

 

[59]Fei Xue, Xin Wu, Shaojun Cai, and Junqiu Wang. Learning multi-view camera relocalization with graph neural networks. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11372–11381. IEEE, 2020.

 

[60]Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016.

 

[61]Bernhard Zeisl, Torsten Sattler, and Marc Pollefeys. Camera pose voting for large-scale image-based localization. In Proceedings of the IEEE International Conference on Computer Vision, pages 2704–2712, 2015.

 

[62]Guofeng Zhang, Zilong Dong, Jiaya Jia, Tien-Tsin Wong, and Hujun Bao. Efficient non-consecutive feature tracking for structure-from-motion. In European Conference on Computer Vision, pages 422–435. Springer, 2010.

 

[63]Liang Zheng, Yujia Huang, Huchuan Lu, and Yi Yang. Poseinvariant embedding for deep person re-identification. IEEE Transactions on Image Processing, 28(9):4500–4509, 2019.

 

[64]Zilong Zhong, Zhong Qiu Lin, Rene Bidart, Xiaodan Hu, Ibrahim Ben Daya, Zhifeng Li, Wei-Shi Zheng, Jonathan Li, and Alexander Wong. Squeeze-and-attention networks for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13065–13074, 2020.

 

[65]Zhun Zhong, Liang Zheng, Zhiming Luo, Shaozi Li, and Yi Yang. Learning to adapt invariance in memory for person reidentification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.

 

[66]Lei Zhou, Zixin Luo, Tianwei Shen, Jiahui Zhang, Mingmin Zhen, Yao Yao, Tian Fang, and Long Quan. Kfnet: Learning temporal camera relocalization using kalman filtering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4919–4928, 2020.

 

[67] Siyu Zhu, Tianwei Shen, Lei Zhou, Runze Zhang, Jinglu Wang, Tian Fang, and Long Quan. Parallel structure from motion from local increment to global averaging. arXiv preprint arXiv:1702.08601, 2017.

产品试用
填写此单一表格,我们将尽快联系您!
商务合作
400 900 5986
周一至周五 9:00-12:00,13:00-18:00
合作同伴招募
【网站地图】