-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathpublib.js
671 lines (671 loc) · 212 KB
/
publib.js
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
const publib = {
'venue': ['TPAMI', 'IJCV', 'CVPR', 'CVPRW', 'ICCV', 'ICCVW', 'ECCV', 'ECCVW', 'ICLR', 'AAAI', 'ACCV', 'WACV', 'BMVC', 'ICPR', 'CVIU', 'ICASSP', 'PatternRecognition', 'NeuroComputing', 'EuroGraphics', 'Cybernetics', 'IROS', 'RSS', 'CORL', 'ACM MM', 'PCI', 'NeurIPS', 'PP', 'ICLRW', 'TIP', 'TNNLS', 'ICML', 'EMNLP'],
'full_venue': ['IEEE Transactions on Pattern Analysis and Machine Intelligence', 'International Journal of Computer Vision', 'IEEE Conference on Computer Vision and Pattern Recognition', 'IEEE Conference on Computer Vision and Pattern Recognition Workshops', 'IEEE International Conference on Computer Vision', 'IEEE International Conference on Computer Vision Workshops', 'IEEE European Conference on Computer Vision', 'IEEE European Conference on Computer Vision Workshops', 'International Conference on Learning Representations', 'Association for the Advancement of Artificial Intelligence', 'Asian Conference on Computer Vision', 'IEEE Winter Conference on Applications of Computer Vision', 'British Machine Vision Conference', 'International Conference on Pattern Recognition', 'Computer Vision and Image Understanding', 'IEEE International Conference on Acoustics, Speech and Signal Processing', 'Pattern Recognition Journal', 'Neurocomputing Journal', 'EuroGraphics Computer Graphics Forum', 'IEEE Transactions on Cybernetics', 'International Conference on Intelligent Robots and Systems', 'Robotics Science and Systems', 'Conference on Robot Learning', 'ACM International Conference on Multimedia', 'Proceedings of the Combustion Institute', 'Neural Information Processing Systems', 'Plant Physiology', 'International Conference on Learning Representations Workshops', 'IEEE Transactions on Image Processing', 'IEEE Transactions on Neural Networks and Learning Systems', 'International Conference on Machine Learning', 'Empirical Methods in Natural Language Processing'],
'author': ['Bernard Ghanem', 'Ali Thabet', 'Silvio Giancola', 'Adel Bibi', 'Guohao Li', 'Hani Itani', 'Humam Alwassel', 'Jean Lahoud', 'Jesus Zarzar', 'Modar Alfadly', 'Mengmeng Xu', 'Salman Alsubaihi', 'Sara Shaheen', 'Lama Affara', 'Fabian Caba Heilbron', 'Victor Escorcia', 'Baoyuan Wu', 'Ganzhao Yuan', 'Jia-Hong Huang', 'Jian Zhang', 'Yancheng Bai', 'Matthias Müller', 'Yongqiang Zhang', 'Cuong Duc Dao', 'Gopal Sharma', 'Juan Carlos Niebles', 'Rafał Protasiuk', 'Tarek Dghaily', 'Vincent Casser', 'Xin Yu', 'Neil Smith', 'Sally Sisi Qu', 'Peter Wonka', 'Dominik L Michels', 'Shyamal Buch', 'Jens Schneider', 'Aditya Khosla', 'Akshat Dave', 'Alexandre Heili', 'Alexey Dosovitskiy', 'Alyn Rockwood', 'Basura Fernando', 'Caigui Jiang', 'Changsheng Xu', 'Chuanqi Shen', 'Daniel Asmar', 'Fan Jia', 'Fatemeh Shiri', 'Fatih Porikli', 'Hailin Jin', 'Hongxun Yao', 'Indriyati Atmosukarto', 'Jagannadan Varadarajan', 'Jean-Marc Odobez', 'Joon-Young Lee', 'Joshua Peterson', 'Karthik Muthuswamy', 'Li Fei-Fei', 'Liangliang Nan', 'Marc Farra', 'Marc Pollefeys', 'Martin R Oswald', 'Maya Kreidieh', 'Ming-Hsuan Yang', 'Mingli Ding', 'Mohammed Hachama', 'Mohieddine Amine', 'Narendra Ahuja', 'Peng Sun', 'Qiang Ji', 'Rachit Dubey', 'René Ranftl', 'Richard Hartley', 'Shaunak Ahuja', 'Shuicheng Yan', 'Si Liu', 'Siwei Lyu', 'Tianzhu Zhang', 'Vladlen Koltun', 'Wayner Barrios', 'Wei Liu', 'Wei-Shi Zheng', 'Weidong Chen', 'Yongping Zhao', 'Yongqiang Li', 'Yuanhao Cao', 'Zhenjie Zhang', 'Zhifeng Hao', 'El Houcine Bergou', 'Peter Richtarik', 'Ozan Sener', 'Eduard Gorbunov', 'Abdullah Hamdi', 'Sara Rojas', 'Juan C. Pérez', 'Guillaume Jeanneret', 'Pablo Arbeláez', 'Motasem Alfarra', 'Chen Zhao', 'Alejandro Pardo', 'Nursulu Kuzhagaliyeva', 'Eshan Singh', 'S. Mani Sarathy', 'Dhruv Mahajan', 'Bruno Korbar', 'Lorenzo Torresani', 'Du Tran', 'Justine Braguy', 'Merey Ramazanova', 'Muhammad Jamil', 'Boubacar A. Kountche', 'Randa Zarban', 'Abrar Felemban', 'Jian You Wang', 'Pei-Yu Lin', 'Imran Haider', 'Matias Zurbriggen', 'Salim Al-Babili', 'Anthony Cioppa', 'Adrien Deliège', 'Floriane Magera', 'Olivier Barnich', 'Marc Van Droogenbroeck', 'Meisam J. Seikavandi', 'Jacob V. Dueholm', 'Kamal Nasrollahi', 'Thomas B. Moeslund', 'Juan León Alcázar', 'Long Mai', 'Federico Perazzi', 'Guocheng Qian', 'Abdulellah Abualshour', 'Itzel Carolina Delgadillo Perez', 'Bing Li', 'Chia-Wen Lin', 'Cheng Zheng', 'Shan Liu', 'Wen Gao', 'C.-C. Jay Kuo', 'Maria A. Bravo', 'Thomas Brox', 'Shibiao Xu', 'Juan-Manuel Perez-Rua', 'Brais Martinez', 'Xiatian Zhu', 'Li Zhang', 'Tao Xiang', 'Jialin Gao', 'Xin Sun', 'Xi Zhou', 'Hasan Abed Al Kader Hammoud', 'Mattia Soldan', 'Jesper Tegner', 'Naeemullah Khan', 'Philip H.S. Torr', 'Maksim Makarenko', 'Arturo Burguete-Lopez', 'Qizhou Wang', 'Fedor Getman', 'Andrea Fratalocchi', 'Kezhi Kong', 'Mucong Ding', 'Zuxuan Wu', 'Chen Zhu', 'Gavin Taylor', 'Tom Goldstein', 'Andrés Villa', 'Kumail Alhamoud', 'Xindi Wu', 'Anurag Kumar', 'Pablo Arbelaez', 'Miguel Martin', 'Yoichi Sato', 'Fiona Ryan', 'Eric Zhongcong Xu', 'Richard Newcombe', 'Takuma Yagi', 'Kiran Somasundaram', 'Ilija Radosavovic', 'Audrey Southerland', 'Xingyu Liu', 'Zachary Chavis', 'Jackson Hamburger', 'Akshay Erapalli', 'Ruijie Tao', 'Tullie Murrell', 'Dhruv Batra', 'Qichen Fu', 'Chao Li', 'Aude Oliva', 'Jianbo Shi', 'Christian Fuegen', 'Cristina Gonzalez', 'Federico Landini', 'James Hillis', 'Zhenqiang Li', 'Takumi Nishiyasu', 'Morrie Doulaty', 'Andrew Westbury', 'David Crandall', 'Yunyi Zhu', 'Yifei Huang', 'Haizhou Li', 'Hanbyul Joo', 'Jayant Sharma', 'Antonino Furnari', 'Kristen Grauman', 'Paola Ruiz Puentes', 'Kris Kitani', 'Wenqi Jia', 'Antonio Torralba', 'Sean Crane', 'Dima Damen', 'Vincent Cartillier', 'Adriano Fragomeni', 'Hyun Soo Park', 'Yanghao Li', 'Minh Vo', 'Miao Liu', 'C. V. Jawahar', 'Jonathan Munro', 'Santhosh Kumar Ramakrishnan', 'Jitendra Malik', 'Yusuke Sugano', 'Siddhant Bansal', 'Weslie Khoo', 'Christoph Feichtenhofer', 'Satwik Kottur', 'Michael Wray', 'Giovanni Maria Farinella', 'Rohit Girdhar', 'Leda Sari', 'Raghava Modhugu', 'Xuhua Huang', 'Mike Zheng Shou', 'Will Price', 'Tien Do', 'Tushar Nagarajan', 'Mingfei Yan', 'Abrham Gebreselasie', 'Ziwei Zhao', 'Karttikeya Mangalam', 'Jachym Kolar', 'Vamsi Krishna Ithapu', 'Hao Jiang', 'Eugene Byrne', 'Yuchen Wang', 'James M. Rehg', 'Anna Frühstück', 'Gabriel Pérez S.'],
'coauthor': ['', '*', '+', '×', '▪', '▴', '▾'],
'distinction': ['Short Paper', 'Spotlight', 'Oral', 'Best Paper Award'],
'link': ['Code', 'Data', 'Video', 'Poster', 'Slides', 'More', 'Website'],
'keys': ['handle', 'theme', 'year', 'venue', 'thumbnail', 'paper', 'title', 'authors', 'coauthors', 'distinctions', 'links', 'abstract'],
'lib': [
['10754/681726', 3, 22, 2, '1E-MJ-GeK5-1we_SUtc5gqtgHrP7OnVaH', '1a9-S1ljum6PSPHtni0iK_GXWSrY4C8jg', '3DeformRS: Certifying Spatial Deformations on Point Clouds', [249, 97, 94, 2, 0],
[1, 1, 1, 0, 0],
[],
['https://github.com/gaperezsa/3DeformRS', 0, 'https://motasemalfarra.netlify.app/publication/3deformrs/', '', '', '', ''], "3D computer vision models are commonly used in security-critical applications such as autonomous driving and surgical robotics. Emerging concerns over the robustness of these models against real-world deformations must be addressed practically and reliably. In this work, we propose 3DeformRS, a method to certify the robustness of point cloud Deep Neural Networks (DNNs) against real-world deformations. We developed 3DeformRS by building upon recent work that generalized Randomized Smoothing (RS) from pixel-intensity perturbations to vector-field deformations. In particular, we specialized RS to certify DNNs against parameterized deformations (e.g. rotation, twisting), while enjoying practical computational costs. We leverage the virtues of 3DeformRS to conduct a comprehensive empirical study on the certified robustness of four representative point cloud DNNs on two datasets and against seven different deformations. Compared to previous approaches for certifying point cloud DNNs, 3DeformRS is fast, scales well with point cloud size, and provides comparable-to-better certificates. For instance, when certifying a plain PointNet against a 3° z-rotation on 1024-point clouds, 3DeformRS grants a certificate 3x larger and 20x faster than previous work."
],
['10754/677988', 3, 22, 6, '1jConO8a1tvjtsxh74liIOUgJHMLgGKwq', '1i6rXW5ATYbduIW05CLB9HVKsBCG0kJ27', 'On the Robustness of Quality Measures for GANs', [97, 94, 248, 154, 32, 0],
[0, 0, 0, 0, 0, 0],
[],
['https://github.com/MotasemAlfarra/R-FID-Robustness-of-Quality-Measures-for-GANs', 0, 'https://motasemalfarra.netlify.app/publication/on_the_robustness_of_quality_measures_for_gans/', '', '', '', ''], "This work evaluates the robustness of quality measures of generative models such as Inception Score (IS) and Fréchet Inception Distance (FID). Analogous to the vulnerability of deep models against a variety of adversarial attacks, we show that such metrics can also be manipulated by additive pixel perturbations.Our experiments indicate that one can generate a distribution of images with very high scores but low perceptual quality.Conversely, one can optimize for small imperceptible perturbations that, when added to real world images, deteriorate their scores.Furthermore, we extend our evaluation to generative models themselves, including the state of the art network StyleGANv2.We show the vulnerability of both the generative model and the FID against additive perturbations in the latent space.Finally, we show that the FID can be robustified by directly replacing the Inception model by a robustly trained Inception. We validate the effectiveness of the robustified metric through extensive experiments, which show that it is more robust against manipulation."
],
['10754/672925', 1, 22, 2, '1cSlSO_y3ZwAw0FUYTrcJ4xVy4o50xEHo', '1M8sSjYSZ4-jcHjvImiIx_Vp4O6y2Fh81', 'Ego4D: Around the World in 3,000 Hours of Egocentric Video', [206, 198, 245, 181, 205, 230, 182, 244, 218, 180, 171, 237, 178, 221, 173, 204, 228, 10, 174, 98, 224, 186, 213, 211, 236, 197, 183, 226, 214, 187, 239, 192, 194, 233, 201, 209, 225, 242, 227, 169, 193, 188, 216, 195, 241, 232, 220, 185, 196, 235, 207, 108, 231, 177, 179, 223, 184, 217, 246, 168, 176, 240, 200, 170, 199, 212, 229, 191, 0, 243, 219, 203, 208, 202, 175, 189, 215, 247, 172, 190, 234, 210, 105, 238, 222],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[2],
['https://ego4d-data.org/', 0, '', '', '', '', ''], "We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards with consenting participants and robust de-identification procedures where relevant. Ego4D dramatically expands the volume of diverse egocentric video footage publicly available to the research community. Portions of the video are accompanied by audio, 3D meshes of the environment, eye gaze, stereo, and/or synchronized videos from multiple egocentric cameras at the same event. Furthermore, we present a host of new benchmark challenges centered around understanding the first-person visual experience in the past (querying an episodic memory), present (analyzing hand-object manipulation, audio-visual conversation, and social interactions), and future (forecasting activities). By publicly sharing this massive annotated dataset and benchmark suite, we aim to push the frontier of first-person perception."
],
['10754/677989', 1, 22, 2, '1Cn4J8OeIKk6CR1MmcJ54TNrQInMHDvCm', '1pd9OvVBjBNTV-GFCpbDUnhJ7mcaJ4G1t', 'vCLIMB: A Novel Video Class Incremental Learning Benchmark', [166, 167, 15, 14, 127, 0],
[0, 0, 0, 0, 0, 0],
[2],
['https://vclimb.netlify.app/', 0, '', '', '', '', ''], "Continual learning (CL) is under-explored in the video domain. The few existing works contain splits with imbalanced class distributions over the tasks, or study the problem in unsuitable datasets. We introduce vCLIMB, a novel video continual learning benchmark. vCLIMB is a standardized test-bed to analyze catastrophic forgetting of deep models in video continual learning. In contrast to previous work, we focus on class incremental continual learning with models trained on a sequence of disjoint tasks, and distribute the number of classes uniformly across the tasks. We perform in-depth evaluations of existing CL methods in vCLIMB, and observe two unique challenges in video data. The selection of instances to store in episodic memory is performed at the frame level. Second, untrimmed training data influences the effectiveness of frame sampling strategies. We address these two challenges by proposing a temporal consistency regularization that can be applied on top of memory-based continual learning methods. Our approach significantly improves the baseline, by up to 24% on the untrimmed continual learning task.."
],
['10754/665805', 2, 22, 2, '1v8zXOMWE0yPZYEUhER-UwF5yvj5bhxRx', '12y7PvVIY9d6MPxgQ54dpA_-UbV40AjJ8', 'FLAG: Adversarial Data Augmentation for Graph Neural Networks', [160, 4, 161, 162, 163, 0, 164, 165],
[0, 0, 0, 0, 0, 0, 0, 0],
[],
['https://github.com/devnkong/FLAG', 0, '', '', '', '', ''], "Data augmentation helps neural networks generalize better, but it remains an open question how to effectively augment graph data to enhance the performance of GNNs (Graph Neural Networks). While most existing graph regularizers focus on augmenting graph topological structures by adding/removing edges, we offer a novel direction to augment in the input node feature space for better performance. We propose a simple but effective solution, FLAG (Free Large-scale Adversarial Augmentation on Graphs), which iteratively augments node features with gradient-based adversarial perturbations during training, and boosts performance at test time. Empirically, FLAG can be easily implemented with a dozen lines of code and is flexible enough to function with any GNN backbone, on a wide variety of large-scale datasets, and in both transductive and inductive settings. Without modifying a model's architecture or training setup, FLAG yields a consistent and salient performance boost across both node and graph classification tasks. Using FLAG, we reach state-of-the-art performance on the large-scale ogbg-molpcba, ogbg-ppa, and ogbg-code datasets."
],
['10754/676851', 2, 22, 2, '1S2Zf03jjw7R9teDSKX0Imz3xHMTEwPfZ', '1fGmKhYkyRykLU12fcbQ_KkZKQ1VDq5Kl', 'Real-time Hyperspectral Imaging in Hardware via Trained Metasurface Encoders', [155, 156, 157, 158, 2, 0, 159],
[0, 0, 0, 0, 0, 0, 0],
[],
['https://github.com/makamoa/hyplex', 0, '', '', '', '', ''], "Hyperspectral imaging has attracted significant attention to identify spectral signatures for image classification and automated pattern recognition in computer vision. State-of-the-art implementations of snapshot hyperspectral imaging rely on bulky, non-integrated, and expensive optical elements, including lenses, spectrometers, and filters. These macroscopic components do not allow fast data processing for, e.g real-time and high-resolution videos. This work introduces Hyplex, a new integrated architecture addressing the limitations discussed above. Hyplex is a CMOS-compatible, fast hyperspectral camera that replaces bulk optics with nanoscale metasurfaces inversely designed through artificial intelligence. Hyplex does not require spectrometers but makes use of conventional monochrome cameras, opening up the possibility for real-time and high-resolution hyperspectral imaging at inexpensive costs. Hyplex exploits a model-driven optimization, which connects the physical metasurfaces layer with modern visual computing approaches based on end-to-end training. We design and implement a prototype version of Hyplex and compare its performance against the state-of-the-art for typical imaging tasks such as spectral reconstruction and semantic segmentation. In all benchmarks, Hyplex reports the smallest reconstruction error. We additionally present what is, to the best of our knowledge, the largest publicly available labeled hyperspectral dataset for semantic segmentation."
],
['10754/673930', 1, 22, 2, '1OWVtUM3YYXq2e3pBLszwFpUcIQP0FvgX', '152AtWrr6FxX6m1sCPhQ5A3o5-m6zPsy1', 'MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions', [151, 99, 127, 14, 98, 2, 0],
[0, 0, 0, 0, 0, 0, 0],
[],
['https://github.com/Soldelli/MAD', 0, 'https://webcast.kaust.edu.sa/Mediasite/Showcase/default/Presentation/88b31db635ba445bb5fcbf43f7d5136f1d', '', '', '', ''], "The recent and increasing interest in video-language research has driven the development of large-scale datasets that enable data-intensive machine learning techniques. In comparison, limited effort has been made at assessing the fitness of these datasets for the video-language grounding task. Recent works have begun to discover significant limitations in these datasets, suggesting that state-of-the-art techniques commonly overfit to hidden dataset biases. In this work, we present MAD (Movie Audio Descriptions), a novel benchmark that departs from the paradigm of augmenting existing video datasets with text annotations and focuses on crawling and aligning available audio descriptions of mainstream movies. MAD contains over 384,000 natural language sentences grounded in over 1,200 hours of videos and exhibits a significant reduction in the currently diagnosed biases for video-language grounding datasets. MAD's collection strategy enables a novel and more challenging version of video-language grounding, where short temporal moments (typically seconds long) must be accurately grounded in diverse long-form videos that can last up to three hours. "
],
['10754/669602', 2, 22, 9, '15ZWp1RW4Dw50noa0XzxPTFa0xN5KOx60', '1nvBpTdZb0uY-vcY5YV6PJEpIfMrkgcPu', 'SCTN: Sparse Convolution-Transformer Network for Scene Flow Estimation', [133, 135, 2, 0],
[0, 0, 0, 0],
[],
['', 0, '', '', '', '', ''], "We propose a novel scene flow estimation approach to capture and infer 3D motions from point clouds. Estimating 3D motions for point clouds is challenging, since a point cloud is unordered and its density is significantly non-uniform. Such unstructured data poses difficulties in matching corresponding points between point clouds, leading to inaccurate flow estimation. We propose a novel architecture named Sparse Convolution-Transformer Network (SCTN) that equips the sparse convolution with the transformer. Specifically, by leveraging the sparse convolution, SCTN transfers irregular point cloud into locally consistent flow features for estimating continuous and consistent motions within an object/local object part. We further propose to explicitly learn point relations using a point transformer module, different from exiting methods. We show that the learned relation-based contextual information is rich and helpful for matching corresponding points, benefiting scene flow estimation. In addition, a novel loss function is proposed to adaptively encourage flow consistency according to feature similarity. Extensive experiments demonstrate that our proposed approach achieves a new state of the art in scene flow estimation. Our approach achieves an error of 0.038 and 0.037 (EPE3D) on FlyingThings3D and KITTI Scene Flow respectively, which significantly outperforms previous methods by large margins."
],
['10754/668407', 3, 22, 9, '11dUs6FOGlMkylgdwlH6K_HiBDSzDhza3', '1QMaReBuZvmkbXNn2s-KEFGqRclW1LzZp', 'Combating Adversaries with Anti-Adversaries', [97, 94, 1, 3, 154, 0],
[0, 0, 0, 0, 0, 0],
[],
['https://github.com/MotasemAlfarra/Combating-Adversaries-with-Anti-Adversaries', 0, '', '', '', '', ''], "Deep neural networks are vulnerable to small input perturbations known as adversarial attacks. Inspired by the fact that these adversaries are constructed by iteratively minimizing the confidence of a network for the true class label, we propose the anti-adversary layer, aimed at countering this effect. In particular, our layer generates an input perturbation in the opposite direction of the adversarial one and feeds the classifier a perturbed version of the input. Our approach is training-free and theoretically supported. We verify the effectiveness of our approach by combining our layer with both nominally and robustly trained models and conduct large-scale experiments from black-box to adaptive attacks on CIFAR10, CIFAR100, and ImageNet. Our layer significantly enhances model robustness while coming at no cost on clean accuracy."
],
['10754/670197', 3, 22, 9, '1Jb_D1ObFynJJzQpsLT2G1HINTyNQRKvD', '1N_O-h-98xJKItmNsptfs4NDwcItxk9FU', 'DeformRS: Certifying Input Deformations with Randomized Smoothing', [97, 3, 153, 154, 0],
[1, 1, 0, 0, 0],
[2],
['https://github.com/MotasemAlfarra/DeformRS', 0, '', '', '', '', ''], "Deep neural networks are vulnerable to input deformations in the form of vector fields of pixel displacements and to other parameterized geometric deformations e.g. translations, rotations, etc. Current input deformation certification methods either (i) do not scale to deep networks on large input datasets, or (ii) can only certify a specific class of deformations, e.g. only rotations. We reformulate certification in randomized smoothing setting for both general vector field and parameterized deformations and propose DeformRS-VF and DeformRS-PAR, respectively. Our new formulation scales to large networks on large input datasets. For instance, DeformRS-PAR certifies rich deformations, covering translations, rotations, scaling, affine deformations, and other visually aligned deformations such as ones parameterized by Discrete-Cosine-Transform basis. Extensive experiments on MNIST, CIFAR10, and ImageNet show competitive performance of DeformRS-PAR achieving a certified accuracy of 39% against perturbed rotations in the set [−10, 10] on ImageNet."
],
['10754/666190', 3, 21, 12, '14KyD2ftMe6Zijhc8_68o7YkcR4skfzZK', '1J5Zod8SN-IbCsH5gDz3FUU5W8RoBgdpJ', 'Rethinking Clustering for Robustness', [94, 97, 3, 1, 96, 0],
[1, 1, 0, 0, 0, 0],
[],
['https://github.com/clustr-official-account/Rethinking-Clustering-for-Robustness', 0, '', '', '', '', ''], "This paper studies how encouraging semantically-aligned features during deep neural network training can increase network robustness. Recent works observed that Adversarial Training leads to robust models, whose learnt features appear to correlate with human perception. Inspired by this connection from robustness to semantics, we study the complementary connection: from semantics to robustness. To do so, we provide a robustness certificate for distance-based classification models (clustering-based classifiers). Moreover, we show that this certificate is tight, and we leverage it to propose ClusTR (Clustering Training for Robustness), a clustering-based and adversary-free training framework to learn robust models. Interestingly, ClusTR outperforms adversarially-trained networks by up to 4% under strong PGD attacks."
],
['10754/666178', 2, 21, 4, '1mZz2bEjXcxj-6cBoV7q6TJU0CFgSU3cZ', '1NbMpJIU10N3J45Jj-0yxVb0ndIHzfG9y', 'MVTN: Multi-View Transformation Network for 3D Shape Recognition', [92, 2, 0],
[0, 0, 0],
[],
['https://github.com/ajhamdi/MVTN', 0, 'https://youtu.be/1zaHx8ztlhk', '', '', '', 'https://abdullahamdi.com/publication/mvtn-iccv/'], "Multi-view projection methods have demonstrated their ability to reach state-of-the-art performance on 3D shape recognition. Those methods learn different ways to aggregate information from multiple views. However, the camera view-points for those views tend to be heuristically set and fixed for all shapes. To circumvent the lack of dynamism of current multi-view methods, we propose to learn those view-points. In particular, we introduce the Multi-View Transformation Network (MVTN) that regresses optimal view-points for 3D shape recognition, building upon advances in differentiable rendering. As a result, MVTN can be trained end-to-end along with any multi-view network for 3D shape classification. We integrate MVTN in a novel adaptive multi-view pipeline that can render either 3D meshes or point clouds. MVTN exhibits clear performance gains in the tasks of 3D shape classification and 3D shape retrieval without the need for extra training supervision. In these tasks, MVTN achieves state-of-the-art performance on ModelNet40, ShapeNet Core55, and the most recent and realistic ScanObjectNN dataset (up to 6% improvement). Interestingly, we also show that MVTN can provide network robustness against rotation and occlusion in the 3D domain."
],
['10754/666102', 1, 21, 5, '11XQ0-GagbNZXPRVu8-1o4jgDh_Ni0kYI', '1zhH_oIw1JyNHLY5BrVjTSC7ESrW3-xrb', 'VLG-Net: Video-language graph matching network for video grounding', [151, 10, 31, 152, 0],
[1, 1, 1, 0, 0],
[],
['https://github.com/Soldelli/VLG-Net', 0, 'https://youtu.be/7aYgLALc_24?t=24750', '', '', '', ''], "Grounding language queries in videos aims at identifying the time interval (or moment) semantically relevant to a language query. The solution to this challenging task demands understanding videos' and queries' semantic content and the fine-grained reasoning about their multi-modal interactions. Our key idea is to recast this challenge into an algorithmic graph matching problem. Fueled by recent advances in Graph Neural Networks, we propose to leverage Graph Convolutional Networks to model video and textual information as well as their semantic alignment. To enable the mutual exchange of information across the modalities, we design a novel Video-Language Graph Matching Network (VLG-Net) to match video and query graphs. Core ingredients include representation graphs built atop video snippets and query tokens separately and used to model intra-modality relationships. A Graph Matching layer is adopted for cross-modal context modeling and multi-modal fusion. Finally, moment candidates are created using masked moment attention pooling by fusing the moment's enriched snippet features. We demonstrate superior performance over state-of-the-art grounding methods on three widely used datasets for temporal localization of moments in videos with language queries: ActivityNet-Captions, TACoS, and DiDeMo."
],
['10754/670594', 1, 21, 4, '1y0hM665ltzfXK3cCQtvcU0lRmceepQuZ', '1pPzUIrBCHMMAnd_rM892eVhdZXjDI8uk', 'Learning to Cut by Watching Movies', [99, 14, 127, 1, 0],
[0, 0, 0, 0, 0],
[],
['https://github.com/PardoAlejo/LearningToCut', 0, 'https://www.youtube.com/watch?v=0-FA2G8eGr0', '', '', '', 'https://www.alejandropardo.net/publication/learning-to-cut/'], "Video content creation keeps growing at an incredible pace; yet, creating engaging stories remains challenging and requires non-trivial video editing expertise. Many video editing components are astonishingly hard to automate primarily due to the lack of raw video materials. This paper focuses on a new task for computational video editing, namely the task of raking cut plausibility. Our key idea is to leverage content that has already been edited to learn fine-grained audiovisual patterns that trigger cuts. To do this, we first collected a data source of more than 10K videos, from which we extract more than 260K cuts. We devise a model that learns to discriminate between real and artificial cuts via contrastive learning. We set up a new task and a set of baselines to benchmark video cut generation. We observe that our proposed model outperforms the baselines by large margins. To demonstrate our model in real-world applications, we conduct human studies in a collection of unedited videos. The results show that our model does a better job at cutting than random and alternative baselines."
],
['10754/670906', 1, 21, 3, '1MdqlD7CUn5hMJVqzNlwgiNNHch3Xh12J', '14D4pOKMniqMWli391fi1NTok_D_JUa4C', 'BAOD: Budget-Aware Object Detection', [99, 10, 1, 96, 0],
[1, 1, 0, 0, 0],
[3],
['', 0, 'https://www.youtube.com/watch?v=khJOvrZhL14', '', '', '', 'https://www.alejandropardo.net/publication/baod/'], "We study the problem of object detection from a novel perspective in which annotation budget constraints are taken into consideration, appropriately coined Budget Aware Object Detection (BAOD). When provided with a fixed budget, we propose a strategy for building a diverse and informative dataset that can be used to optimally train a robust detector. We investigate both optimization and learning-based methods to sample which images to annotate and what type of annotation (strongly or weakly supervised) to annotate them with. We adopt a hybrid supervised learning framework to train the object detector from both these types of annotation. We conduct a comprehensive empirical study showing that a handcrafted optimization method outperforms other selection techniques including random sampling, uncertainty sampling and active learning. By combining an optimal image/annotation selection scheme with hybrid supervised learning to solve the BAOD problem, we show that one can achieve the performance of a strongly supervised detector on PASCAL-VOC 2007 while saving 12.8% of its original annotation budget. Furthermore, when 100% of the budget is used, it surpasses this performance by 2.0 mAP percentage points."
],
['10754/660731', 2, 20, 2, '1y8VPQ2nlrHeY9JtjIYkUTDgBB1rULMVX', '1aDvx5ro3g0kAYoEjxXjjV183NYBgjSbl', 'SGAS: Sequential Greedy Architecture Search', [4, 130, 132, 21, 1, 0],
[1, 1, 1, 0, 0, 0],
[],
['https://github.com/lightaime/sgas', 0, 'https://www.youtube.com/watch?v=I2ILmGJwO38&ab_channel=GuohaoLi', '', '', '', 'https://www.deepgcns.org/auto/sgas'], "Architecture design has become a crucial component of successful deep learning. Recent progress in automatic neural architecture search (NAS) shows a lot of promise. However, discovered architectures often fail to generalize in the final evaluation. Architectures with a higher validation accuracy during the search phase may perform worse in the evaluation. Aiming to alleviate this common issue, we introduce sequential greedy architecture search (SGAS), an efficient method for neural architecture search. By dividing the search procedure into sub-problems, SGAS chooses and prunes candidate operations in a greedy fashion. We apply SGAS to search architectures for Convolutional Neural Networks (CNN) and Graph Convolutional Networks (GCN). Extensive experiments show that SGAS is able to find state-of-the-art architectures for tasks such as image classification, point cloud classification and node classification in protein-protein interaction graphs with minimal computational cost."
],
['', 2, 21, 25, '1n7X7GhClGrMEwP3daRjTj5UbfUQgUqS9', '1NKZNb2_scA0b2WW_4I6ZXHW1k_7F5aj-', 'Anisotropic Separable Set Abstraction for Efficient Point Cloud Representation Learning', [130, 150, 4, 1, 0],
[0, 0, 0, 0, 0],
[1],
['https://github.com/guochengqian/ASSANet', 0, '', '', '', '', ''], "Access to 3D point cloud representations has been widely facilitated by LiDAR sensors embedded in various mobile devices. This has led to an emerging needfor fast and accurate point cloud processing techniques. In this paper, we revisitand dive deeper into PointNet++, one of the most influential yet under-explorednetworks, and develop faster and more accurate variants of the model. We firstpresent a novel Separable Set Abstraction (SA) module that disentangles thevanilla SA module used in PointNet++ into two separate learning stages: (1)learning channel correlation and (2) learning spatial correlation. The Separable SA module is significantly faster than the vanilla version, yet it achieves comparable performance. We then introduce a new Anisotropic Reduction function into our Separable SA module and propose an Anisotropic Separable SA (ASSA) modulethat substantially increases the network’s accuracy. We later replace the vanilla SA modules in PointNet++ with the proposed ASSA modules, and denote the modified network as ASSANet. Extensive experiments on point cloud classification,semantic segmentation, and part segmentation show that ASSANet outperforms PointNet++ and other methods, achieving much higher accuracy and faster speeds.In particular, ASSANet outperforms PointNet++ by 7.4 mIoU on S3DIS Area 5,while maintaining1.6× faster inference speed on a single NVIDIA 2080Ti GPU.Our scaled ASSANet variant achieves 66.8mIoU and outperforms KPConv, while being more than 54× faster."
],
['10754/664502', 2, 20, 3, '16qwEsW6Gjk2bPcMqCmJhUvGJbtEzlk9W', '1AMWkb28d-W-TBUa0xMNoe2ypzh2WKAyq', 'Self-Supervised Learning of Local Features in 3D Point Clouds', [1, 6, 0],
[1, 1, 0],
[],
['https://github.com/alitabet/morton-net', 0, '', '', '', '', 'http://www.humamalwassel.com/publication/mortonnet/'], "We present a self-supervised task on point clouds, in order to learn meaningful point-wise features that encode local structure around each point. Our self-supervised network, operates directly on unstructured/unordered point clouds. Using a multi-layer RNN, our architecture predicts the next point in a point sequence created by a popular and fast Space Filling Curve, the Morton-order curve. The final RNN state (coined Morton feature) is versatile and can be used in generic 3D tasks on point clouds. Our experiments show how our self-supervised task results in features that are useful for 3D segmentation tasks, and generalize well between datasets. We show how Morton features can be used to significantly improve performance (+3% for 2 popular algorithms) in semantic segmentation of point clouds on the challenging and large-scale S3DIS dataset. We also show how our self-supervised network pretrained on S3DIS transfers well to another large-scale dataset, vKITTI, leading to 11% improvement."
],
['10754/672843', 1, 21, 31, '1gvL5L_2sSJVlpFtvhKKX8-KJ1uTtRnZN', '1k1kr2i3Mb6aQIZmRI8yezBWfwmNhUNA7', 'Relation-aware Video Reading Comprehension for Temporal Language Grounding', [147, 148, 10, 149, 0],
[1, 1, 0, 0, 0],
[],
['', 0, '', '', '', '', ''], "Temporal language grounding in videos aims to localize the temporal span relevant to the given query sentence. Previous methods treat it either as a boundary regression task or a span extraction task. This paper will formulate temporal language grounding into video reading comprehension and propose a Relation-aware Network (RaNet) to address it. This framework aims to select a video moment choice from the predefined answer set with the aid of coarse-and-fine choice-query interaction and choice-choice relation construction. A choice-query interactor is proposed to match the visual and textual information simultaneously in sentence-moment and token-moment levels, leading to a coarse-and-fine cross-modalinteraction. Moreover, a novel multi-choicerelation constructor is introduced by leverag-ing graph convolution to capture the dependen-cies among video moment choices for the bestchoice selection. Extensive experiments onActivityNet-Captions, TACoS, and Charades-STA demonstrate the effectiveness of our solution. Codes will be released soon"
],
['10754/668420', 1, 21, 25, '16H7LbNz3cuJFsDwO5721HawhyrWW19-X', '13SsB7PQnL-SLjrjZs8YasNjD_d0r7ziL', 'Low-Fidelity End-to-End Video Encoder Pre-training for Temporal Action Localization', [10, 142, 144, 0, 143],
[0, 0, 0, 0, 0],
[],
['', 0, '', '', '', '', ''], "Temporal action localization (TAL) is a fundamental yet challenging task in video understanding. Existing TAL methods rely on pre-training a video encoder through action classification supervision. This results in a task discrepancy problem for the video encoder – trained for action classification, but used for TAL. Intuitively, end-to-end model optimization is a good solution. However, this is not operable for TAL subject to the GPU memory constraints, due to the prohibitive computational cost in processing long untrimmed videos. In this paper, we resolve this challenge by introducing a novel low-fidelity end-to-end (LoFi) video encoder pre-training method. Instead of always using the full training configurations for TAL learning, we propose to reduce the mini-batch composition in terms of temporal, spatial or spatio-temporal resolution so that end-to-end optimization for the video encoder becomes operable under the memory conditions of a mid-range hardware budget. Crucially, this enables the gradient to flow backward through the video encoder from a TAL loss supervision, favourably solving the task discrepancy problem and providing more effective feature representations. Extensive experiments show that the proposed LoFi pre-training approach can significantly enhance the performance of existing TAL methods. Encouragingly, even with a lightweight ResNet18 based video encoder in a single RGB stream, our method surpasses two-stream ResNet50 based alternatives with expensive optical flow, often by a good margin."
],
['10754/666111', 1, 21, 4, '1sntqZ6oVB0cFdmnGCjdOT9v2Pqr9mkqu', '1XJtIhSYuas1h0CBgflfiXUaWq-wzpRHi', 'Boundary-sensitive Pre-training for Temporal Localization in Videos', [10, 142, 15, 143, 144, 145, 0, 146],
[0, 0, 0, 0, 0, 0, 0, 0],
[],
['http://github.com/frostinassiky/bsp', 0, '', '', '', '', 'https://frostinassiky.github.io/bsp/'], "Many video analysis tasks require temporal localization thus detection of content changes. However, most existing models developed for these tasks are pre-trained on general video action classification tasks. This is because large scale annotation of temporal boundaries in untrimmed videos is expensive. Therefore no suitable datasets exist for temporal boundary-sensitive pre-training. In this paper for the first time, we investigate model pre-training for temporal localization by introducing a novel boundary-sensitive pretext (BSP) task. Instead of relying on costly manual annotations of temporal boundaries, we propose to synthesize temporal boundaries in existing video action classification datasets. With the synthesized boundaries, BSP can be simply conducted via classifying the boundary types. This enables the learning of video representations that are much more transferable to downstream temporal localization tasks. Extensive experiments show that the proposed BSP is superior and complementary to the existing action classification based pre-training counterpart, and achieves new state-of-the-art performance on several temporal localization tasks."
],
['10754/666126', 1, 21, 5, '1WrihTVD-32IGQ5R7s5W9HnCIIdbdb6Dr', '1L6-S2T4DzwlnIZZt5RZ09kXae6s8l0yb', 'TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks', [6, 2, 0],
[0, 0, 0],
[1],
['https://github.com/HumamAlwassel/TSP', 0, 'https://www.youtube.com/watch?v=oLcL3vY_UFM&ab_channel=AIforCreativeVideoEditingandUnderstanding', '', '17ki2XyvTdTKZUbRZFXCUsHH8g4TW6us-', 'https://openaccess.thecvf.com/content/ICCV2021W/CVEU/supplemental/Alwassel_TSP_Temporally-Sensitive_Pretraining_ICCVW_2021_supplemental.pdf', 'http://www.humamalwassel.com/publication/tsp/'], "Due to the large memory footprint of untrimmed videos, current state-of-the-art video localization methods operate atop precomputed video clip features. These features are extracted from video encoders typically trained for trimmed action classification tasks, making such features not necessarily suitable for temporal localization. In this work, we propose a novel supervised pretraining paradigm for clip features that not only trains to classify activities but also considers background clips and global video information to improve temporal sensitivity. Extensive experiments show that using features trained with our novel pretraining strategy significantly improves the performance of recent state-of-the-art methods on three tasks: Temporal Action Localization, Action Proposal Generation, and Dense Video Captioning. We also show that our pretraining approach is effective across three encoder architectures and two pretraining datasets. We believe video feature encoding is an important building block for localization algorithms, and extracting temporally-sensitive features should be of paramount importance in building more accurate models. The code and pretrained models are available on our project website."
],
['10754/669611', 3, 21, 30, '1LuDN-i83ez1M1ta57mYW7X2ENSVhTUBh', '1LD6LuPM44yew1qsWoQL6R82LFbRIsb7e', 'Training Graph Neural Networks with 1000 Layers', [4, 21, 0, 78],
[0, 0, 0, 0],
[],
['https://github.com/lightaime/deep_gcns_torch/tree/master/examples/ogb_eff/', 0, '', '', '', '', 'https://www.deepgcns.org/arch/gnn1000'], "Deep graph neural networks (GNNs) have achieved excellent results on various tasks on increasingly large graph datasets with millions of nodes and edges. However, memory complexity has become a major obstacle when training deep GNNs for practical applications due to the immense number of nodes, edges, and intermediate activations. To improve the scalability of GNNs, prior works propose smart graph sampling or partitioning strategies to train GNNs with a smaller set of nodes or sub-graphs. In this work, we study reversible connections, group convolutions, weight tying, and equilibrium models to advance the memory and parameter efficiency of GNNs. We find that reversible connections in combination with deep network architectures enable the training of overparameterized GNNs that significantly outperform existing methods on multiple datasets. Our models RevGNN-Deep (1001 layers with 80 channels each) and RevGNN-Wide (448 layers with 224 channels each) were both trained on a single commodity GPU and achieve an ROC-AUC of 87.74±0.13 and 88.24±0.15 on the ogbn-proteins dataset. To the best of our knowledge, RevGNN-Deep is the deepest GNN in the literature by one order of magnitude."
],
['10754/667854', 0, 20, 29, '1MEr2s55q-RwAuYtIiXsbowL5uQZVwdzo', '1YtukkUsWgxvSV85xHsJpyfrCjnJZqSLq', 'KGSNet: Key-Point-Guided Super-Resolution Network for Pedestrian Detection in the Wild', [22, 20, 64, 141, 0],
[0, 0, 0, 0, 0],
[],
['', 0, '', '', '', '', ''], "In real-world scenarios (i.e., in the wild), pedestrians are often far from the camera (i.e., small scale), and they often gather together and occlude with each other (i.e., heavily occluded). However, detecting these small-scale and heavily occluded pedestrians remains a challenging problem for the existing pedestrian detection methods. We argue that these problems arise because of two factors: 1) insufficient resolution of feature maps for handling small-scale pedestrians and 2) lack of an effective strategy for extracting body part information that can directly deal with occlusion. To solve the above-mentioned problems, in this article, we propose a key-point-guided super-resolution network (coined KGSNet) for detecting these small-scale and heavily occluded pedestrians in the wild. Specifically, to address factor 1), a super-resolution network is first trained to generate a clear super-resolution pedestrian image from a small-scale one. In the super-resolution network, we exploit key points of the human body to guide the super-resolution network to recover fine details of the human body region for easier pedestrian detection. To address factor 2), a part estimation module is proposed to encode the semantic information of different human body parts where four semantic body parts (i.e., head and upper/middle/bottom body) are extracted based on the key points. Finally, based on the generated clear super-resolved pedestrian patches padded with the extracted semantic body part images at the image level, a classification network is trained to further distinguish pedestrians/backgrounds from the inputted proposal regions. Both proposed networks (i.e., super-resolution network and classification network) are optimized in an alternating manner and trained in an end-to-end fashion. Extensive experiments on the challenging CityPersons data set demonstrate the effectiveness of the proposed method, which achieves superior performance over previous state-of-the-art methods, especially for those small-scale and heavily occluded instances. Beyond this, we also achieve state-of-the-art performance (i.e., 3.89% MR⁻² on the reasonable subset) on the Caltech data set."
],
['10754/660665', 1, 21, 14, '1xlFM70o7hDccKqnN8hJeApv_81pHj4La', '1BaJCAW8s-9YweTCBqUkdFJCyXhwEjVUE', 'MAIN: Multi-Attention Instance Network for Video Segmentation', [127, 139, 95, 1, 140, 96, 0],
[0, 0, 0, 0, 0, 0, 0],
[],
['', 0, '', '', '', '', ''], "Instance-level video segmentation requires a solid integration of spatial and temporal information. However, current methods rely mostly on domain-specific information (online learning) to produce accurate instance-level segmentations. We propose a novel approach that relies exclusively on the integration of generic spatio-temporal attention cues. Our strategy, named Multi-Attention Instance Network (MAIN), overcomes challenging segmentation scenarios over arbitrary videos without modelling sequence- or instance-specific knowledge. We design MAIN to segment multiple instances in a single forward pass, and optimize it with a novel loss function that favors class agnostic predictions and assigns instance-specific penalties. We achieve state-of-the-art performance on the challenging Youtube-VOS dataset and benchmark, improving the unseen Jaccard and F-Metric by 6.8% and 12.7% respectively, while operating at real-time (30.3 FPS)."
],
['', 2, 21, 28, '1T3uzKFRtRTiUfYt4uTsp0va44ro9EsSG', '1ONgbXp-GoMby7TQB_Cz-xpZyYQnnyPpz', 'Shape-Preserving Stereo Object Remapping via Object-Consistent Grid Warping', [133, 134, 135, 136, 0, 137, 138],
[0, 0, 0, 0, 0, 0, 0],
[],
['', 0, '', '', '', '', ''], "Viewing various stereo images under different viewing conditions has escalated the need for effective object-level remapping techniques. In this paper, we propose a new object spatial mapping scheme, which adjusts the depth and size of the selected object to match user preference and viewing conditions. Existing warping-based methods often distort the shape of important objects or cannot faithfully adjust the depth/size of the selected object due to improper warping such as local rotations. In this paper, by explicitly reducing the transformation freedom degree of warping, we propose a optimization model based on axis-aligned warping for object spatial remapping. The proposed axis-aligned warping based optimization model can simultaneously adjust the depths and sizes of selected objects to their target values without introducing severe shape distortions. Moreover, we propose object consistency constraints to ensure the size/shape of parts inside a selected object to be consistently adjusted. Such constraints improve the size/shape adjustment performance while remaining robust to some extent to incomplete object extraction. Experimental results demonstrate that the proposed method achieves high flexibility and effectiveness in adjusting the size and depth of objects compared with existing methods."
],
['10754/660654', 3, 21, 0, '1Si7jIKy8icB18DnG-Vi6KK-RPPqZdhin', '1jJVTRFklVkwFZIoA-z4lPvgmD-zZws_K', 'DeepGCNs: Making GCNs Go as Deep as CNNs', [4, 21, 130, 132, 131, 1, 0],
[1, 1, 1, 0, 0, 0, 0],
[],
['https://github.com/lightaime/deep_gcns_torch', 0, 'https://www.youtube.com/watch?v=ffI5rsJjVkE', '', '', '', 'https://www.deepgcns.org/'], "Convolutional Neural Networks have been very successful at solving a variety of computer vision tasks such as object classification and detection, semantic segmentation, activity understanding, to name just a few. One key enabling factor for their great performance has been the ability to train very deep networks. Despite their huge success in many tasks, CNNs do not work well with non-Euclidean data, which is prevalent in many real-world applications. Graph Convolutional Networks offer an alternative that allows for non-Eucledian data input to a neural network. While GCNs already achieve encouraging results, they are currently limited to architectures with a relatively small number of layers, primarily due to vanishing gradients during training. This work transfers concepts such as residual/dense connections and dilated convolutions from CNNs to GCNs in order to successfully train very deep GCNs. We show the benefit of using deep GCNs experimentally across various datasets and tasks. Specifically, we achieve promising performance in part segmentation and semantic segmentation on point clouds and in node classification of protein functions across biological protein-protein-interaction graphs. We believe that the insights in this work will open avenues for future research on GCNs and their application to further tasks not explored in this paper."
],
['10754/660655', 3, 21, 27, '126iSfsv0aP21Zx3IMApBiy7UHfhS1zaG', '1FCuZKvYq5A5-A2Yfj7IPmON9k_x7N-AO', 'Expected Tight Bounds for Robust Deep Neural Network Training', [11, 3, 9, 92, 0],
[1, 1, 1, 0, 0],
[],
['https://github.com/ModarTensai/ptb', 0, '', 'https://drive.google.com/file/d/14QVnmcHkjdg9dIdU9lML1Bwmwsp0vB4A/view?usp=sharing', '', '', ''], "Training Deep Neural Networks (DNNs) that are robust to norm bounded adversarial attacks remains an elusive problem. While verification based methods are generally too expensive to robustly train large networks, it was demonstrated in Gowal et al that bounded input intervals can be inexpensively propagated per layer through large networks. This interval bound propagation (IBP) approach led to high robustness and was the first to be employed on large networks. However, due to the very loose nature of the IBP bounds, particularly for large networks, the required training procedure is complex and involved. In this paper, we closely examine the bounds of a block of layers composed of an affine layer followed by a ReLU nonlinearity followed by another affine layer. To this end, we propose expected bounds, true bounds in expectation, that are provably tighter than IBP bounds in expectation. We then extend this result to deeper networks through blockwise propagation and show that we can achieve orders of magnitudes tighter bounds compared to IBP. With such tight bounds, we demonstrate that a simple standard training procedure can achieve the best robustness-accuracy trade-off across several architectures on both MNIST and CIFAR10."
],
['10754/660725', 2, 21, 2, '1HqNkSeWGEdLZLcYXsfxQzAJ-T20lPu1N', '1s6T629L3N6GpTCORe8b3b47QRhQk0M-4', 'PU-GCN: Point Cloud Upsampling using Graph Convolutional Networks', [130, 131, 4, 1, 0],
[1, 1, 0, 0, 0],
[],
['https://github.com/guochengqian/PU-GCN', 0, '', '', '', '', 'https://sites.google.com/kaust.edu.sa/pugcn'], "Upsampling sparse, noisy, and non-uniform point clouds is a challenging task. In this paper, we propose 3 novel point upsampling modules: Multi-branch GCN, Clone GCN, and NodeShuffle. Our modules use Graph Convolutional Networks (GCNs) to better encode local point information. Our upsampling modules are versatile and can be incorporated into any point cloud upsampling pipeline. We show how our 3 modules consistently improve state-of-the-art methods in all point upsampling metrics. We also propose a new multi-scale point feature extractor, called Inception DenseGCN. We modify current Inception GCN algorithms by introducing DenseGCN blocks. By aggregating data at multiple scales, our new feature extractor is more resilient to density changes along point cloud surfaces. We combine Inception DenseGCN with one of our upsampling modules (NodeShuffle) into a new point upsampling pipeline: PU-GCN. We show both qualitatively and quantitatively the advantages of PU-GCN against the state-of-the-art in terms of fine-grained upsampling quality and point cloud uniformity. The website and source code of this work is available at https://sites.google.com/kaust.edu.sa/pugcn and https://github.com/guochengqian/PU-GCN respectively."
],
['10754/669426', 1, 21, 3, '1E6UvZ5pt5ZAhpxBtrO-wdlhSt7l0Ll3m', '11gwxxeSATAWQWSPmzznO06KdSZQdceol', 'APES: Audiovisual Person Search in Untrimmed Video', [127, 128, 129, 54, 96, 0, 14],
[0, 0, 0, 0, 0, 0, 0],
[],
['https://github.com/fuankarion/audiovisual-person-search', 0, '', '', '', '', ''], "Humans are arguably one of the most important subjects in video streams, many real-world applications such as video summarization or video editing workflows often require the automatic search and retrieval of a person of interest. Despite tremendous efforts in the person reidentification and retrieval domains, few works have developed audiovisual search strategies. In this paper, we present the Audiovisual Person Search dataset (APES), a new dataset composed of untrimmed videos whose audio (voices) and visual (faces) streams are densely annotated. APES contains over 1.9K identities labeled along 36 hours of video, making it the largest dataset available for untrimmed audiovisual person search. A key property of APES is that it includes dense temporal annotations that link faces to speech segments of the same identity. To showcase the potential of our new dataset, we propose an audiovisual baseline and benchmark for person retrieval. Our study shows that modeling audiovisual cues benefits the recognition of people's identities. To enable reproducibility and promote future research, the dataset annotations and baseline code are available at: https://github.com/fuankarion/audiovisual-person-search."
],
['10754/666179', 1, 21, 3, '1Ktdjmkha_rTH8KQSS9urgQYi5c4KCymr', '1XxNnR7IbeOlU8T3gp1kIrrGXV8hfM66n', 'SoccerNet-v2 : A Dataset and Benchmarks for Holistic Understanding of Broadcast Soccer Videos', [119, 118, 2, 123, 124, 125, 0, 126, 122],
[1, 1, 1, 1, 1, 0, 0, 0, 0],
[],
['https://github.com/SilvioGiancola/SoccerNetv2-DevKit/', 0, '', '', '', '', 'https://soccer-net.org/'], "Understanding broadcast videos is a challenging task in computer vision, as it requires generic reasoning capabilities to appreciate the content offered by the video editing. In this work, we propose SoccerNet-v2, a novel large-scale corpus of manual annotations for the SoccerNet video dataset, along with open challenges to encourage more research in soccer understanding and broadcast production. Specifically, we release around 300k annotations within SoccerNet's 500 untrimmed broadcast soccer videos. We extend current tasks in the realm of soccer to include action spotting, camera shot segmentation with boundary detection, and we define a novel replay grounding task. For each task, we provide and discuss benchmark results, reproducible with our open-source adapted implementations of the most relevant works in the field. SoccerNet-v2 is presented to the broader research community to help push computer vision closer to automatic solutions for more general video understanding and production purposes."
],
['10754/668843', 1, 21, 3, '1cC3WZqHju6s6nrB8CohYXkX4DqY_Ttib', '1yxawNijAp6akKbvVzek-RauXhVpRPScr', 'Temporally-Aware Feature Pooling for Action Spotting in Soccer Broadcasts', [2, 0],
[0, 0],
[],
['https://github.com/SilvioGiancola/SoccerNetv2-DevKit/', 0, '', '', '', '', 'https://soccer-net.org/'], "Toward the goal of automatic production for sports broadcasts, a paramount task consists in understanding the high-level semantic information of the game in play. For instance, recognizing and localizing the main actions of the game would allow producers to adapt and automatize the broadcast production, focusing on the important details of the game and maximizing the spectator engagement. In this paper, we focus our analysis on action spotting in soccer broadcast, which consists in temporally localizing the main actions in a soccer game. To that end, we propose a novel feature pooling method based on NetVLAD, dubbed NetVLAD++, that embeds temporally-aware knowledge. Different from previous pooling methods that consider the temporal context as a single set to pool from, we split the context before and after an action occurs. We argue that considering the contextual information around the action spot as a single entity leads to a sub-optimal learning for the pooling module. With NetVLAD++, we disentangle the context from the past and future frames and learn specific vocabularies of semantics for each subsets, avoiding to blend and blur such vocabulary in time. Injecting such prior knowledge creates more informative pooling modules and more discriminative pooled features, leading into a better understanding of the actions. We train and evaluate our methodology on the recent large-scale dataset SoccerNet-v2, reaching 53.4% Average-mAP for action spotting, a +12.7% improvement w.r.t the current state-of-the-art."
],
['10754/668881', 1, 21, 3, '1hh1TDx-8JUWc24IhO6MJbTNljhvXgL-z', '1FMLSkAbP0YuYQtBobVuL75v6QElsbFaD', 'Camera Calibration and Player Localization in SoccerNet-v2 and Investigation of their Representations for Action Spotting', [118, 119, 120, 2, 121, 0, 122],
[1, 1, 1, 1, 0, 0, 0],
[],
['https://github.com/SilvioGiancola/SoccerNetv2-DevKit/', 0, '', '', '', '', 'https://soccer-net.org/'], "Soccer broadcast video understanding has been drawing a lot of attention in recent years within data scientists and industrial companies. This is mainly due to the lucrative potential unlocked by effective deep learning techniques developed in the field of computer vision. In this work, we focus on the topic of camera calibration and on its current limitations for the scientific community. More precisely, we tackle the absence of a large-scale calibration dataset and of a public calibration network trained on such a dataset. Specifically, we distill a powerful commercial calibration tool in a recent neural network architecture on the large-scale SoccerNet dataset, composed of untrimmed broadcast videos of 500 soccer games. We further release our distilled network, and leverage it to provide 3 ways of representing the calibration results along with player localization. Finally, we exploit those representations within the current best architecture for the action spotting task of SoccerNet-v2, and achieve new state-of-the-art performances."
],
['10754/668823', 1, 21, 26, '1rTUYjSTTmP6NUWxzsq_XmTVMx1yed9Lm', '14ZzioEeOLA9zJlJYjQdlFNPD2CFr6M6w', 'SeedQuant: A deep learning-based tool for assessing stimulant and inhibitor activity on root parasitic seeds', [107, 108, 2, 109, 110, 111, 112, 113, 114, 115, 116, 0, 117],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[],
['', 0, '', '', '', '', ''], "Witchweeds (Striga spp.) and broomrapes (Orobanchaceae and Phelipanche spp.) are root parasitic plants that infest many crops in warm and temperate zones, causing enormous yield losses and endangering global food security. Seeds of these obligate parasites require rhizospheric, host-released stimulants to germinate, which opens up possibilities for controlling them by applying specific germination inhibitors or synthetic stimulants that induce lethal germination in host's absence. To determine their effect on germination, root exudates or synthetic stimulants/inhibitors are usually applied to parasitic seeds in in vitro bioassays, followed by assessment of germination ratios. Although these protocols are very sensitive, the germination recording process is laborious, representing a challenge for researchers and impeding high-throughput screens. Here, we developed an automatic seed census tool to count and discriminate germinated from non-germinated seeds. We combined deep learning, a powerful data-driven framework that can accelerate the procedure and increase its accuracy, for object detection with computer vision latest development based on the Faster R-CNN algorithm. Our method showed an accuracy of 94% in counting seeds of Striga hermonthica and reduced the required time from ˜5 minutes to 5 seconds per image. Our proposed software, SeedQuant, will be of great help for seed germination bioassays and enable high-throughput screening for germination stimulants/inhibitors. SeedQuant is an open-source software that can be further trained to count different types of seeds for research purposes."
],
['10754/660737', 1, 20, 25, '16sCX9INUnJyhSulNeJAiZZEpvU98lO3_', '1TIWNtoOM18kFxetNaRpfV8d1TlUA2y-g', 'Self-Supervised Learning by Cross-Modal Audio-Video Clustering', [6, 103, 104, 105, 0, 106],
[0, 0, 0, 0, 0, 0],
[1],
['https://github.com/HumamAlwassel/XDC', 0, 'https://videos.neurips.cc/search/Self-Supervised%20Learning%20by%20Cross-Modal%20Audio-Video%20Clustering/video/slideslive-38937777', 'http://www.humamalwassel.com/pdf/xdc-poster.pdf', '1zkJpTvQjVgpLVetHcuFQISfe_jUfCdbP', 'https://papers.nips.cc/paper/2020/file/6f2268bd1d3d3ebaabb04d6b5d099425-Supplemental.pdf', 'http://www.humamalwassel.com/publication/xdc/'], "Visual and audio modalities are highly correlated, yet they contain different information. Their strong correlation makes it possible to predict the semantics of one from the other with good accuracy. Their intrinsic differences make cross-modal prediction a potentially more rewarding pretext task for self-supervised learning of video and audio representations compared to within-modality learning. Based on this intuition, we propose Cross-Modal Deep Clustering (XDC), a novel self-supervised method that leverages unsupervised clustering in one modality (e.g., audio) as a supervisory signal for the other modality (e.g., video). This cross-modal supervision helps XDC utilize the semantic correlation and the differences between the two modalities. Our experiments show that XDC outperforms single-modality clustering and other multi-modal variants. XDC achieves state-of-the-art accuracy among self-supervised methods on multiple video and audio benchmarks. Most importantly, our video model pretrained on large-scale unlabeled data significantly outperforms the same model pretrained with full-supervision on ImageNet and Kinetics for action recognition on HMDB51 and UCF101. To the best of our knowledge, XDC is the first self-supervised learning method that outperforms large-scale fully-supervised pretraining for action recognition on the same architecture."
],
['10754/665996', 0, 21, 24, '1J59rRTZKxWvvYni2-ul6R9Y9aebs9TZ1', '1A3eZWNGfq4EZUHkYT72RyUr4PvDpyXr1', 'Using deep neural networks to diagnose engine pre-ignition', [100, 1, 101, 0, 102],
[0, 0, 0, 0, 0],
[],
[0, 0, 0, 0, 0, 0, 0], "Engine downsizing and boosting have been recognized as effective strategies for improving engine efficiency. However, operating the engines at high load promotes abnormal combustion events, such as pre-ignition and potential superknock. Currently the most effective method for detecting pre-ignition is by using in-cylinder pressure sensors that have high precision and sensitivity, but also high cost. Due to rapid advances in automotive technology such as autonomous driving, computer-aided designs and future connectivity, we propose to use a complimentary data-driven strategy for diagnosing abnormal combustion events. To this end, a data-driven diagnostics approach for pre-ignition detection with deep neural networks is proposed. The success of convolutional neural networks (CNNs) in object detection and recurrent neural networks (RNNs) in sequence forecasting inspired us to develop these models for pre-ignition detection. For a cost-effective strategy, we use data from less expensive sensors, such as lambda and low-resolution exhaust back pressure (EBP), instead of high resolution in-cylinder pressure measurements. The first deep learning model is combined with a commonly used dimensionality reduction tool–Principal Component Analysis (PCA). The second model eliminates this step and directly processes time-series data. Results indicate that the first model with reduced input dimensions, and correspondingly smaller size of the network, shows better performance in detecting pre-ignition cycles with an F1 score of 79%. Overall, the proposed deep learning approach is a promising alternative for abnormal combustion diagnostics using data from low resolution sensors."
],
['10754/660670', 1, 21, 11, '17utki4TWyxoBM8AreKhzfpdqNhz1ODXJ', '1Z9N_R18M0Kzn4F1p3U3EkKfVkSRB9lyn', 'RefineLoc: Iterative refinement for weakly-supervised action localization', [99, 6, 14, 1, 0],
[1, 1, 0, 0, 0],
[],
['https://github.com/HumamAlwassel/RefineLoc', 0, 'https://www.youtube.com/watch?v=80AoaXIC_eM', 0, 0, '1lm9sZ0WicdeJWXg9nuSrv-bua0RBJUod', 'http://www.humamalwassel.com/publication/refineloc/'], "Video action detectors are usually trained using datasets with fully-supervised temporal annotations. Building such datasets is an expensive task. To alleviate this problem, recent methods have tried to leverage weak labeling, where videos are untrimmed and only a video-level label is available. In this paper, we propose RefineLoc, a novel weakly-supervised temporal action localization method. RefineLoc uses an iterative refinement approach by estimating and training on snippet-level pseudo ground truth at every iteration. We show the benefit of this iterative approach and present an extensive analysis of five different pseudo ground truth generators. We show the effectiveness of our model on two standard action datasets, ActivityNet v1. 2 and THUMOS14. RefineLoc shows competitive results with the state-of-the-art in weakly-supervised temporal localization. Additionally, our iterative refinement process is able to significantly improve the performance of two state-of-the-art methods, setting a new state-of-the-art on THUMOS14."
],
['10754/660666', 1, 20, 23, '1FpELZiuQyKiPN_mPfxFSwMHfpjW_BHnP', '1ehfRi2vE17D3NlYRFqPyktXC2o7lfyM1', 'ThumbNet: One Thumbnail Image Contains All You Need for Recognition', [98, 0],
[0, 0],
[],
[0, 0, 'https://www.youtube.com/watch?v=m0Ms0xtSKFI&list=PLC28kDljnOrj-_w-MHKW36gVRvUe3XFjx', 0, 0, 'https://drive.google.com/file/d/1bBf_lT7tHrYgbRIlkmGV44HcOxqIKNY_/view?usp=sharing', ''], "Although deep convolutional neural networks (CNNs) have achieved great success in computer vision tasks, its real-world application is still impeded by its voracious demand of computational resources. Current works mostly seek to compress the network by reducing its parameters or parameter-incurred computation, neglecting the influence of the input image on the system complexity. Based on the fact that input images of a CNN contain substantial redundancy, in this paper, we propose a unified framework, dubbed as ThumbNet, to simultaneously accelerate and compress CNN models by enabling them to infer on one thumbnail image. We provide three effective strategies to train ThumbNet. In doing so, ThumbNet learns an inference network that performs equally well on small images as the original-input network on large images. With ThumbNet, not only do we obtain the thumbnail-input inference network that can drastically reduce computation and memory requirements, but also we obtain an image downscaler that can generate thumbnail images for generic classification tasks. Extensive experiments show the effectiveness of ThumbNet, and demonstrate that the thumbnail-input inference network learned by ThumbNet can adequately retain the accuracy of the original-input network even when the input images are downscaled 16 times."
],
['10754/666421', 3, 20, 6, '1PN0gQqEG2tGZUUp_Wo8gMCMRthcTw-0w', '1yte6P8pPA9v-DuuApQb2BPdFD9Dwshhu', 'Gabor Layers Enhance Network Robustness', [94, 97, 95, 3, 1, 0, 96],
[1, 1, 1, 0, 0, 0, 0],
[],
['https://github.com/BCV-Uniandes/Gabor_Layers_for_Robustness', 0, 'https://www.youtube.com/watch?v=R4uLxHP5ryk', 0, 0, 0, ''], "We revisit the benefits of merging classical vision concepts with deep learning models. In particular, we explore the effect on robustness against adversarial attacks of replacing the first layers of various deep architectures with Gabor layers, i.e. convolutional layers with filters that are based on learnable Gabor parameters. We observe that architectures enhanced with Gabor layers gain a consistent boost in robustness over regular models and preserve high generalizing test performance, even though these layers come at a negligible increase in the number of parameters. We then exploit the closed form expression of Gabor filters to derive an expression for a Lipschitz constant of such filters, and harness this theoretical result to develop a regularizer we use during training to further enhance network robustness. We conduct extensive experiments with various architectures (LeNet, AlexNet, VGG16 and WideResNet) on several datasets (MNIST, SVHN, CIFAR10 and CIFAR100) and demonstrate large empirical robustness gains. Furthermore, we experimentally show how our regularizer provides consistent robustness improvements."
],
['10754/660663', 3, 20, 7, '160N74RpceVei0KORT1G3tE_1v4MogOcu', '', 'Towards Analyzing Semantic Robustness of Deep Neural Networks', [92, 0],
[0,0],
[3],
['https://github.com/ajhamdi/semantic-robustness', 0, 'https://www.youtube.com/watch?v=rf5ynrBap2Q&feature=youtu.be', '', `https://drive.google.com/file/d/1HypbEcQ5EbubWgHxCnSnor3neGHP2ijg/view?usp=sharing`, 0, 'https://eccv20-adv-workshop.github.io/'], "Despite the impressive performance of Deep Neural Networks (DNNs) on various vision tasks, they still exhibit erroneous high sensitivity toward semantic primitives (e.g. object pose). We propose a theoretically grounded analysis for DNNs robustness in the semantic space. We qualitatively analyze different DNNs semantic robustness by visualizing the DNN global behavior as semantic maps and observe interesting behavior of some DNNs. Since generating these semantic maps does not scale well with the dimensionality of the semantic space, we develop a bottom-up approach to detect robust regions of DNNs. To achieve this, We formalize the problem of finding robust semantic regions of the network as optimization of integral bounds and develop expressions for update directions of the region bounds. We use our developed formulations to quantitatively evaluate the semantic robustness of different famous network architectures. We show through extensive experimentation that several networks, though trained on the same dataset and while enjoying comparable accuracy, they do not necessarily perform similarly in semantic robustness. For example, InceptionV3 is more accurate despite being less semantically robust than ResNet50. We hope that this tool will serve as the first milestone towards understanding the semantic robustness of DNNs."
],
['10754/660765', 3, 20, 6, '1b8oBK8J24mz0VYDqTacyCsfLLWb5GdWN', '12vc9sDMEMPbi5WmVygryFfpRZfdVsrNy', 'AdvPC: Transferable Adversarial Perturbations on 3D Point Clouds', [92, 93,1, 0],
[0,0, 0, 0],
[],
['https://github.com/ajhamdi/AdvPC', 0, 'https://www.youtube.com/watch?v=Aq1PPcbhG8Y&feature=youtu.be', '', `https://drive.google.com/file/d/1ZDg-39cBYcN9T-5mKfAcMD-QiF3scGtQ/view?usp=sharing`, 0, ''], "Deep neural networks are vulnerable to adversarial attacks, in which imperceptible perturbations to their input lead to erroneous network predictions. This phenomenon has been extensively studied in the image domain, and has only recently been extended to 3D point clouds. In this work, we present novel data-driven adversarial attacks against 3D point cloud networks. We aim to address the following problems in current 3D point cloud adversarial attacks: they do not transfer well between different networks, and they are easy to defend against via simple statistical methods. To this extent, we develop a new point cloud attack (dubbed AdvPC) that exploits the input data distribution by adding an adversarial loss, after Auto-Encoder reconstruction, to the objective it optimizes. AdvPC leads to perturbations that are resilient against current defenses, while remaining highly transferable compared to state-of-the-art attacks. We test AdvPC using four popular point cloud networks: PointNet, PointNet++ (MSG and SSG), and DGCNN. Our proposed attack increases the attack success rate by up to 40% for those transferred to unseen networks (transferability), while maintaining a high success rate on the attacked network. AdvPC also increases the ability to break defenses by up to 38% as compared to other baselines on the ModelNet40 dataset."
],
['10754/664563', 2, 20, 9, '1I5s5MMG85gZWBBsqDrrSfI7tTmGiGReq', '1vkp7GKuK9oOH3WKSlJ7Pfak6wC0ISl5', 'SADA: Semantic Adversarial Diagnostic Attacks for Autonomous Applications', [92, 21, 0],
[0, 0, 0],
[1],
['https://github.com/ajhamdi/SADA', 0, 'https://www.youtube.com/watch?v=clguL24kVG0&feature=youtu.be', 'https://drive.google.com/file/d/1AbMZyIHDmjbArVwT6AQ6XbWMMPDDM-Yw/view?usp=sharing', `https://drive.google.com/file/d/1xRnytHRXzEkVa5dEeBkAZCE3fWZ91me4/view?usp=sharing`, 0, ''], "One major factor impeding more widespread adoption of deep neural networks (DNNs) is their lack of robustness, which is essential for safety-critical applications such as autonomous driving. This has motivated much recent work on adversarial attacks for DNNs, which mostly focus on pixel-level perturbations void of semantic meaning. In contrast, we present a general framework for adversarial attacks on trained agents, which covers semantic perturbations to the environment of the agent performing the task as well as pixel-level attacks. To do this, we re-frame the adversarial attack problem as learning a distribution of parameters that always fools the agent. In the semantic case, our proposed adversary (denoted as BBGAN) is trained to sample parameters that describe the environment with which the black-box agent interacts, such that the agent performs its dedicated task poorly in this environment. We apply BBGAN on three different tasks, primarily targeting aspects of autonomous navigation: object detection, self-driving, and autonomous UAV racing. On these tasks, BBGAN can generate failure cases that consistently fool a trained agent."
],
['10754/660283', 3, 20, 8, '', '', 'A Stochastic Derivative Free Optimization Method with Momentum', [91, 3, 90, 88, 89],
[0, 0, 0, 0, 0],
[],
[0, 0, 0, 0, 0, 0, ''], "We consider the problem of unconstrained minimization of a smooth objective function in a setting where only function evaluations are possible. We propose and analyze stochastic zeroth-order method with heavy ball momentum. In particular, we propose, SMTP, a momentum version of the stochastic three-point method (STP) Bergou et al. 2019). We show new complexity results for non-convex, convex and strongly convex functions. We test our method on a collection of earning to continuous control tasks on several MuJoCo Todorov et al. (2012) environments with varying difficulty and compare against STP, other state-of-the-art derivative-free optimization algorithms and against policy gradient methods. SMTP significantly outperforms STP and all other methods that we considered in our numerical experiments. Our second contribution is SMTP with importance sampling which we call SMTP_IS. We provide convergence analysis of this method for non-convex, convex and strongly convex objectives."
],
['10754/653108', 3, 20, 9, '', '', 'A Stochastic Derivative-Free Optimization Method with Importance Sampling: Theory and Learning to Control', [3, 88, 90, 0, 89],
[0, 0, 0, 0, 0],
[],
[0, 0, 0, 0, 0, 0, ''], "We consider the problem of unconstrained minimization of a smooth objective function in a setting where only function evaluations are possible. While importance sampling is one of the most popular techniques used by machine learning practitioners to accelerate the convergence of their models when applicable, there is not much existing theory for this acceleration in the derivative-free setting. In this paper, we propose the first derivative free optimization method with importance sampling and derive new improved complexity results on non-convex, convex and strongly convex functions. We conduct extensive experiments on various synthetic and real LIBSVM datasets confirming our theoretical results. We further test our method on a collection of continuous control tasks on MuJoCo environments with varying difficulty. Experiments suggest that our algorithm is practical for high dimensional continuous control problems where importance sampling results in a significant sample complexity improvement."
],
['10754/655902', 3, 19, 0, '1U95Ix4mSaGouumQntgZQUP2CsQ8_iLi5', '1pTK9miOIGi9DwHyEWGs6_6AvSKJkxyV3', 'Can We See More? Joint Frontalization and Hallucination of Unaligned Tiny Faces', [29, 47, 0, 48],
[0, 0, 0, 0],
[],
[0, 0, 0, 0, 0, 0, 'https://ieeexplore.ieee.org/document/8704962'], "In popular TV programs (such as CSI), a very low-resolution face image of a person, who is not even looking at the camera in many cases, is digitally super-resolved to a degree that suddenly the person's identity is made visible and recognizable. Of course, we suspect that this is merely a cinematographic special effect and such a magical transformation of a single image is not technically possible. Or, is it? In this paper, we push the boundaries of super-resolving (hallucinating to be more accurate) a tiny, non-frontal face image to understand how much of this is possible by leveraging the availability of large datasets and deep networks. To this end, we introduce a novel Transformative Adversarial Neural Network (TANN) to jointly frontalize very-low resolution (i.e. $16\\times 16$ pixels) out-of-plane rotated face images (including profile views) and aggressively super-resolve them ($8\\times$), regardless of their original poses and without using any 3D information. TANN is composed of two components: a transformative upsampling network which embodies encoding, spatial transformation and deconvolutional layers, and a discriminative network that enforces the generated high-resolution frontal faces to lie on the same manifold as real frontal face images. We evaluate our method on a large set of synthesized non-frontal face images to assess its reconstruction performance. Extensive experiments demonstrate that TANN generates both qualitatively and quantitatively superior results achieving over 4 dB improvement over the state-of-the-art."
],
[0, 2, 19, 2, '179WogDvo3OglftnBVCD4t72dRzqEKDDf', '1Yoe7ggGY8NC6G6H3qf6pVDszMeQ4-FlV', 'Leveraging Shape Completion for 3D Siamese Tracking', [2, 8, 0],
[1, 1, 0],
[],
['https://github.com/SilvioGiancola/ShapeCompletion3DTracking', 0, 0, 0, 0, 'https://youtu.be/2-NAaWSSrGA', 0], 'Point clouds are challenging to process due to their sparsity, therefore autonomous vehicles rely more on appearance attributes than pure geometric features. However, 3D LIDAR perception can provide crucial information for urban navigation in challenging light or weather conditions. In this paper, we investigate the versatility of Shape Completion for 3D Object Tracking in LIDAR point clouds. We design a Siamese tracker that encodes model and candidate shapes into a compact latent representation. We regularize the encoding by enforcing the latent representation to decode into an object model shape. We observe that 3D object tracking and 3D shape completion complement each other. Learning a more meaningful latent representation shows better discriminatory capabilities, leading to improved tracking performance. We test our method on the KITTI Tracking set using car 3D bounding boxes. Our model reaches a 76.94% Success rate and 81.38% Precision for 3D Object Tracking, with the shape completion regularization leading to an improvement of 3% in both metrics.'
],
[0, 2, 19, 3, '1Gl_iVDaOdUglHXBjIaXS3Pv2op6XXwQt', '1iYNnUNY52E3aOhyJqwS2UhE4R71KAv8g', 'Learning a Controller Fusion Network by Online Trajectory Filtering for Vision-based UAV Racing', [21, 4, 28, 30, 33, 0],
[0, 0, 0, 0, 0, 0],
[2, 3],
[0, 0, 0, 0, 0, 0, 'https://matthias.pw/publication/cfn/'], 'Autonomous UAV racing has recently emerged as an interesting research problem. The dream is to beat humans in this new fast-paced sport. A common approach is to learn an end-to-end policy that directly predicts controls from raw images by imitating an expert. However, such a policy is limited by the expert it imitates and scaling to other environments and vehicle dynamics is difficult. One approach to overcome the drawbacks of an end-to-end policy is to train a network only on the perception task and handle control with a PID or MPC controller. However, a single controller must be extensively tuned and cannot usually cover the whole state space. In this paper, we propose learning an optimized controller using a DNN that fuses multiple controllers. The network learns a robust controller with online trajectory filtering, which suppresses noisy trajectories and imperfections of individual controllers. The result is a network that is able to learn a good fusion of filtered trajectories from different controllers leading to significant improvements in overall performance. We compare our trained network to controllers it has learned from, end-to-end baselines and human pilots in a realistic simulation; our network beats all baselines in extensive experiments and approaches the performance of a professional human pilot.'
],
[0, 2, 19, 3, '1_V_VGo5jFzJbjXRPfHKcseV99O5-1-u0', '1r0R9P_65P8noVG76tX_z0T7R2obDj1PU', 'Semantic Part RCNN for Real-World Pedestrian Detection', [10, 20, 31, 0],
[0, 0, 0, 0],
[],
[0, 0, 0, 0, 0, 0, 'http://openaccess.thecvf.com/CVPR2019_workshops/CVPR2019_Weakly_Supervised_Learning_for_RealWorld_Computer_Vision_Applications.py'], 'Recent advances in pedestrian detection, a fundamental problem in computer vision, have been attained by transferring the learned features of convolutional neural networks (CNN) to pedestrians. However, existing methods often show a significant drop in performance when heavy occlusion and deformation happen because most methods rely on holistic modeling. Unlike most previous deep models that directly learn a holistic detector, we introduce the semantic part information for learning the pedestrian detector. Rather than defining semantic parts manually, we detect key points of each pedestrian proposal and then extract six semantic parts according to the predicted key points, e.g., head, upper-body, left/right arms and legs. Then, we crop and resize the semantic parts and pad them with the original proposal images. The padded images containing semantic part information are passed through CNN for further classification. Extensive experiments demonstrate the effectiveness of adding semantic part information, which achieves superior performance on the Caltech benchmark dataset.'
],
[0, 3, 19, 3, '1X6kxlLO1dTq0TcgmRi7pEeaTUAhMNeuq', '1jSSWrXKMmKUehy0kuHu7Q-4H9ddsh8pr', 'Missing Labels in Object Detection', [10, 20, 0],
[0, 0, 0],
[],
[0, 0, 0, 0, 0, 0, 'http://openaccess.thecvf.com/CVPR2019_workshops/CVPR2019_Weakly_Supervised_Learning_for_RealWorld_Computer_Vision_Applications.py'], 'Object detection is a fundamental problem in computer vision. Impressive results have been achieved on largescale detection benchmarks by fully-supervised object detection (FSOD) methods. However, FSOD performance is highly affected by the quality of annotations available in training. Furthermore, FSOD approaches require tremendous instance-level annotations, which are time-consuming to collect. In contrast, weakly supervised object detection (WSOD) exploits easily-collected image-level labels while it suffers from relatively inferior detection performance. In this paper, we study the effect of missing annotations on FSOD methods and analyze approaches to train an object detector from a hybrid dataset, where both instancelevel and image-level labels are employed. Extensive experiments on the challenging PASCAL VOC 2007 and 2012 benchmarks strongly demonstrate the effectiveness of our method, which gives a trade-off between collecting fewer annotations and building a more accurate object detector. Our method is also a strong baseline bridging the wide gap between FSOD and WSOD performances.'
],
[0, 3, 19, 4, '1dBsem8ufxELFXHFnKj42cp8uUjg1iOS5', '1nvU-btKvFktIba0dY2Dkk2HNpNxvlvUk', 'Can GCNs Go as Deep as CNNs?', [4, 21, 1, 0],
[0, 0, 0, 0],
[2],
[0, 0, 0, 0, 0, 0, 'https://sites.google.com/view/deep-gcns'], 'Convolutional Neural Networks (CNNs) achieve impressive results in a wide variety of fields. Their success benefited from a massive boost with the ability to train very deep CNN models. Despite their positive results, CNNs fail to properly address problems with non-Euclidean data. To overcome this challenge, Graph Convolutional Networks (GCNs) build graphs to represent non-Euclidean data, and borrow concepts from CNNs and apply them to train these models. GCNs show promising results, but they are limited to very shallow models due to the vanishing gradient problem. As a result most state-of-the-art GCN algorithms are no deeper than 3 or 4 layers. In this work, we present new ways to successfully train very deep GCNs. We borrow concepts from CNNs, mainly residual/dense connections and dilated convolutions, and adapt them to GCN architectures. Through extensive experiments, we show the positive effect of these deep GCN frameworks. Finally, we use these new concepts to build a very deep 56-layer GCN, and show how it significantly boosts performance (+3.7% mIoU over state-of-the-art) in the task of point cloud semantic segmentation. The project website is available at this https URL.'
],
[0, 2, 19, 4, '1oO2I62BMAxJ8aQXDCwYWO1zPKwJVIcJJ', '1Rh8-TS_vRNo5ZxU78NddH-VaXE-5Znov', '3D Instance Segmentation via Multi-task Metric Learning', [7, 0, 60, 61],
[0, 0, 0, 0],
[2],
[0, 0, 0, 0, 0, 0, 0], "We propose a novel method for instance label segmentation of dense 3D voxel grids. We target volumetric scene representations which have been acquired with depth sensors or multi-view stereo methods and which have been processed with semantic 3D reconstruction or scene completion methods. The main task is to learn shape information about individual object instances in order to accurately separate them, including connected and incompletely scanned objects. We solve the 3D instance-labeling problem with a multi-task learning strategy. The first goal is to learn an abstract feature embedding which groups voxels with the same instance label close to each other while separating clusters with different instance labels from each other. The second goal is to learn instance information by estimating directional information of the instances' centers of mass densely for each voxel. This is particularly useful to find instance boundaries in the clustering post-processing step, as well as for scoring the quality of segmentations for the first goal. Both synthetic and real-world experiments demonstrate the viability of our approach. Our method achieves state-of-the-art performance on the ScanNet 3D instance segmentation benchmark."
],
[0, 3, 19, 8, '1WtVenF9RmTjXqkvYVY3lh3s6_be80ACq', '1Am-YHOTURvPCGqYpQg917Kl0E0nG3VAG', 'Deep Layers as Stochastic Solvers', [3, 0, 78, 71],
[0, 0, 0, 0],
[],
[0, 0, 0, 'https://drive.google.com/file/d/12ccARxNLd70gZ3U4-yCuAJZfUsC-K-DZ/view', 0, 0, 'http://www.adelbibi.com/publication/stochastic_solvers/'], 'We provide a novel perspective on the forward pass through a block of layers in a deep network. In particular, we show that a forward pass through a standard dropout layer followed by a linear layer and a non-linear activation is equivalent to optimizing a convex optimization objective with a single iteration of a au-nice Proximal Stochastic Gradient method. We further show that replacing standard Bernoulli dropout with additive dropout is equivalent to optimizing the same convex objective with a variance-reduced proximal method. By expressing both fully-connected and convolutional layers as special cases of a high-order tensor product, we unify the underlying convex optimization problem in the tensor setting and derive a formula for the Lipschitz constant L used to determine the optimal step size of the above proximal methods. We conduct experiments with standard convolutional networks applied to the CIFAR-10 and CIFAR-100 datasets, and show that replacing a block of layers with multiple iterations of the corresponding solver, with step size set via L, consistently improves classification accuracy.'
],
[0, 3, 19, 9, '1iTL6OWRgLhxruqKfozpxB7EX4Db3PtLv', '1a_Iizys9IYW1gC6aeQTjeBK75VTb2e8d', 'A Novel Framework for Robustness Analysis of Visual QA Models', [18, 23, 9, 0],
[0, 1, 1, 0],
[2],
['https://github.com/IVUL-KAUST/VQABQ', 0, 'https://youtu.be/s1mEvQVPS8E', 0, 0, 0, 0], 'Deep neural networks have been playing an essential role in many computer vision tasks including Visual Question Answering (VQA). Until recently, the study of their accuracy was the main focus of research but now there is a trend toward assessing the robustness of these models against adversarial attacks by evaluating their tolerance to varying noise levels. In VQA, adversarial attacks can target the image and/or the proposed main question and yet there is a lack of proper analysis of the later. In this work, we propose a flexible framework that focuses on the language part of VQA that uses semantically relevant questions, dubbed basic questions, acting as controllable noise to evaluate the robustness of VQA models. We hypothesize that the level of noise is positively correlated to the similarity of a basic question to the main question. Hence, to apply noise on any given main question, we rank a pool of basic questions based on their similarity by casting this ranking task as a LASSO optimization problem. Then, we propose a novel robustness measure Rscore and two large-scale basic question datasets (BQDs) in order to standardize robustness analysis for VQA models.'
],
[0, 3, 19, 11, '1GoZ89Co5Dbl6BU0z4lDvpcnwNkAqZrQ6', '1KPX77OUuhqP1FyfgRAaGAUiLOBR9JRHu', 'Local Color Mapping Combined with Color Transfer for Underwater Image Enhancement', [26, 3, 0],
[0, 0, 0],
[],
['https://github.com/rprotasiuk/underwater_enhancement', 0, 0, 0, 0, 'https://ivul.kaust.edu.sa/Documents/more/supplementary/WACV_2019_Supp.pdf', 0], 'Color correction and color transfer methods have gained a lot of attention in the past few years to circumvent color degradation that may occur due to various sources. In this paper, we propose a novel simple yet powerful strategy to profoundly enhance color distorted underwater images. The proposed approach combines both local and global information through a simple yet powerful affine transform model. Local and global information are carried through local color mapping and color covariance mapping between an input and some reference source, respectively. Several experiments on degraded underwater images demonstrate that the proposed method performs favourably to all other methods including ones that are tailored to correcting underwater images by explicit noise modelling.'
],
[0, 2, 19, 21, '1mWx7nbh8g2cL6pn20k9vD0tfq9h1WCwW', '1q9v0_8Ak-_JtqMwVxsu2-gXKMFw8Wfyq', 'OIL: Observational Imitation Learning', [4, 21, 28, 30, 33, 0],
[0, 0, 0, 0, 0, 0],
[],
[0, 0, 0, 0, 0, 0, 'https://matthias.pw/publication/oil/'], 'Recent work has explored the problem of autonomous navigation by imitating a teacher and learning an end-to-end policy, which directly predicts controls from raw images. However, these approaches tend to be sensitive to mistakes by the teacher and do not scale well to other environments or vehicles. To this end, we propose Observational Imitation Learning (OIL), a novel imitation learning variant that supports online training and automatic selection of optimal behavior by observing multiple imperfect teachers. We apply our proposed methodology to the challenging problems of autonomous driving and UAV racing. For both tasks, we utilize the Sim4CV simulator that enables the generation of large amounts of synthetic training data and also allows for online learning and evaluation. We train a perception network to predict waypoints from raw image data and use OIL to train another network to predict controls from these waypoints. Extensive experiments demonstrate that our trained network outperforms its teachers, conventional imitation learning (IL) and reinforcement learning (RL) baselines and even humans in simulation. The project website is available at this https URL and a video at this https URL.'
],
[0, 3, 18, 0, '1h2Snnstq3-BYS-9DakjpZP4lNcn_ANro', '1f7TfNBIKWIT1yzXZ7fA3QSxp3c5hJD5M', 'Lp-Box ADMM: A Versatile Framework for Integer Programming', [16, 0],
[0, 0],
[],
[0, 0, 0, 0, 0, 0, 0], 'This paper revisits the integer programming (IP) problem, which plays a fundamental role in many computer vision and machine learning applications. The literature abounds with many seminal works that address this problem, some focusing on continuous approaches (e.g., linear program relaxation), while others on discrete ones (e.g., min-cut). However, since many of these methods are designed to solve specific IP forms, they cannot adequately satisfy the simultaneous requirements of accuracy, feasibility, and scalability. To this end, we propose a novel and versatile framework called lp-box ADMM, which is based on two main ideas. (1) The discrete constraint is equivalently replaced by the intersection of a box and an lp-norm sphere. (2) We infuse this equivalence into the ADMM (Alternating Direction Method of Multipliers) framework to handle the continuous constraints separately and to harness its attractive properties. More importantly, the ADMM update steps can lead to manageable sub-problems in the continuous domain. To demonstrate its efficacy, we apply it to an optimization form that occurs often in computer vision and machine learning, namely binary quadratic programming (BQP). In this case, the ADMM steps are simple, computationally efficient. Moreover, we present the theoretic analysis about the global convergence of the lp-box ADMM through adding a perturbation with the sufficiently small factor ϵ to the original IP problem. Specifically, the globally converged solution generated by lp-box ADMM for the perturbed IP problem will be close to the stationary and feasible point of the original IP problem within O(ϵ). We demonstrate the applicability of lp-box ADMM on three important applications: MRF energy minimization, graph matching, and clustering. Results clearly show that it significantly outperforms existing generic IP solvers both in runtime and objective. It also achieves very competitive performance to state-of-the-art methods designed specifically for these applications.'
],
[0, 3, 18, 1, '1sSX4zZw6W0Rk-t2JGXPhRcOTaniM8lfW', '1Nnm0funn3RGpGy_dqEETqQdyKuyChZQ0', 'Multi-label Learning with Missing Labels Using Mixed Dependency', [16, 46, 80, 0, 76],
[0, 0, 0, 0, 0],
[],
[0, 0, 0, 0, 0, 0, 0], 'This work focuses on the problem of multi-label learning with missing labels (MLML), which aims to label each test instance with multiple class labels given training instances that have an incomplete/partial set of these labels (i.e., some of their labels are missing). The key point to handle missing labels is propagating the label information from the provided labels to missing labels, through a dependency graph that each label of each instance is treated as a node. We build this graph by utilizing different types of label dependencies. Specifically, the instance-level similarity is served as undirected edges to connect the label nodes across different instances and the semantic label hierarchy is used as directed edges to connect different classes. This base graph is referred to as the mixed dependency graph, as it includes both undirected and directed edges. Furthermore, we present another two types of label dependencies to connect the label nodes across different classes. One is the class co-occurrence, which is also encoded as undirected edges. Combining with the above base graph, we obtain a new mixed graph, called mixed graph with co-occurrence (MG-CO). The other is the sparse and low rank decomposition of the whole label matrix, to embed high-order dependencies over all labels. Combining with the base graph, the new mixed graph is called as MG-SL (mixed graph with sparse and low rank decomposition). Based on MG-CO and MG-SL, we further propose two convex transductive formulations of the MLML problem, denoted as MLMG-CO and MLMG-SL respectively. In both formulations, the instance-level similarity is embedded through a quadratic smoothness term, while the semantic label hierarchy is used as a linear constraint. In MLMG-CO, the class co-occurrence is also formulated as a quadratic smoothness term, while the sparse and low rank decomposition is incorporated into MLMG-SL, through two additional matrices (one is assumed as sparse, and the other is assumed as low rank) and an equivalence constraint between the summation of this two matrices and the original label matrix. Interestingly, two important applications, including image annotation and tag based image retrieval, can be jointly handled using our proposed methods. Experimental results on several benchmark datasets show that our methods lead to significant improvements in performance and robustness to missing labels over the state-of-the-art methods.'
],
['10754/626562', 2, 18, 1, '1JRiGrhw3rNKL47HFLHiCYrd65WwWt0Rr', '1KnpnunwyB_tmxONnqlmOANjvYsHB-Lul', 'Sim4CV: A photo-realistic simulator for computer vision applications', [21, 28, 7, 30, 0],
[0, 0, 0, 0, 0],
[],
[0, 0, 0, 0, 0, 0, 0], 'We present a photo-realistic training and evaluation simulator (UE4Sim) with extensive applications across various fields of computer vision. Built on top of the Unreal Engine, the simulator integrates full featured physics based cars, unmanned aerial vehicles (UAVs), and animated human actors in diverse urban and suburban 3D environments. We demonstrate the versatility of the simulator with two case studies: autonomous UAV-based tracking of moving objects and autonomous driving using supervised learning. The simulator fully integrates both several state-of-the-art tracking algorithms with a benchmark evaluation tool and a deep neural network (DNN) architecture for training vehicles to drive autonomously. It generates synthetic photo-realistic datasets with automatic ground truth annotations to easily extend existing real-world datasets and provides extensive synthetic data variety through its ability to reconfigure synthetic worlds on the fly using an automatic world generation tool.'
],
[0, 3, 18, 2, '1ebjRvif9ybFgshW0QcTDH8v_XUx9bxOz', '183T-npqzO_7928bahmF8RW84LYj0OueC', 'Analytic Expressions for Probabilistic Moments of PL-DNN with Gaussian Input', [3, 9, 0],
[1, 1, 0],
[2],
['https://github.com/ModarTensai/network_moments', 0, 'https://youtu.be/op9IBox_TTc?t=673', 'https://drive.google.com/file/d/1au9sd2NALWLrNZWFNDe4AkTRv7b9Z_kA/view?usp=sharing', 'https://drive.google.com/file/d/1cu2jdYItdPblB_MJoWIL73n_tDtRy0xV/view?usp=sharing', 'https://ivul.kaust.edu.sa/Documents/more/supplementary/Analytic%20Expressions%20for%20Probabilistic%20Moments%20of%20PL-DNN%20with%20Gaussian%20Input.pdf', 0], 'The outstanding performance of deep neural networks (DNNs), for the visual recognition task in particular, has been demonstrated on several large-scale benchmarks. This performance has immensely strengthened the line of research that aims to understand and analyze the driving rea- sons behind the effectiveness of these networks. One impor- tant aspect of this analysis has recently gained much atten- tion, namely the reaction of a DNN to noisy input. This has spawned research on developing adversarial input attacks as well as training strategies that make DNNs more robust against these attacks. To this end, we derive in this paper exact analytic expressions for the first and second mo- ments (mean and variance) of a small piecewise linear (PL) network (Affine, ReLU, Affine) subject to general Gaussian input. We experimentally show that these expressions are tight under simple linearizations of deeper PL-DNNs, es- pecially popular architectures in the literature (e.g. LeNet and AlexNet). Extensive experiments on image classifica- tion show that these expressions can be used to study the behaviour of the output mean of the logits for each class, the interclass confusion and the pixel-level spatial noise sensi- tivity of the network. Moreover, we show how these expres- sions can be used to systematically construct targeted and non-targeted adversarial attacks.'
],
[0, 3, 18, 2, '12o_qho7H3ijEb8sPEfDQxOVxPNt4c48h', '1wG2CWbzrKeVvxthe3oDNU7noEO87i-Up', 'Tagging Like Humans: Diverse and Distinct Image Annotation', [16, 82, 68, 80, 0, 76],
[0, 0, 0, 0, 0, 0],
[],
[0, 0, 0, 0, 0, 0, 0], 'In this work we propose a new automatic image annotation model, dubbed diverse and distinct image annotation (D2IA). The generative model D2IA is inspired by the ensemble of human annotations, which create semantically relevant, yet distinct and diverse tags. In D2IA, we generate a relevant and distinct tag subset, in which the tags are relevant to the image contents and semantically distinct to each other, using sequential sampling from a determinantal point process (DPP) model. Multiple such tag subsets that cover diverse semantic aspects or diverse semantic levels of the image contents are generated by randomly perturbing the DPP sampling process. We leverage a generative adversarial network (GAN) model to train D2IA. Extensive experiments including quantitative and qualitative comparisons, as well as human subject studies, on two benchmark datasets demonstrate that the proposed model can produce more diverse and distinct tags than the state-of-the-arts.'
],
[0, 2, 18, 2, '1psm5UEWQz64cvaI0H3JKu1O7bgnuMc1D', '1W__LTrhSFyrhBZFXfJ16QTrbEHeWIjyy', 'ISTA-Net: Interpretable Optimization-Inspired Deep Network for Image Compressive Sensing', [19, 0],
[0, 0],
[],
['http://jianzhang.tech/projects/ISTA-Net', 0, 0, 0, 0, 'https://ivul.kaust.edu.sa/Documents/more/supplementary/ISTA-Net%20Interpretable%20Optimization-Inspired%20Deep%20Network%20for%20Image.pdf', 0], 'With the aim of developing a fast yet accurate algorithm for compressive sensing (CS) reconstruction of natural images, we combine in this paper the merits of two existing categories of CS methods: the structure insights of traditional optimization-based methods and the speed of recent network-based ones. Specifically, we propose a novel structured deep network, dubbed ISTA-Net, which is inspired by the Iterative Shrinkage-Thresholding Algorithm (ISTA) for optimizing a general L1 norm CS reconstruction model. To cast ISTA into deep network form, we develop an effective strategy to solve the proximal mapping associated with the sparsity-inducing regularizer using nonlinear transforms. All the parameters in ISTA-Net (e.g. non-linear transforms, shrinkage thresholds, step sizes, etc.) are learned end-to-end, rather than being hand-crafted. Moreover, considering that the residuals of natural images are more compressible, an enhanced version of ISTA-Net in the residual domain, dubbed ISTA-Net+, is derived to further improve CS reconstruction. Extensive CS experiments demonstrate that the proposed ISTA-Nets outperform existing state-of-the-art optimization-based and network-based CS methods by large margins, while maintaining fast computational speed.'
],
[0, 2, 18, 2, '13N6jcV6UKEJMATFSF2RotZJ-Ha5Wc0hL', '1dQGQEtDzDvlB5ORRbKwC8wq9NrabhE6_', 'Finding Tiny Faces in the Wild with Generative Adversarial Network', [20, 22, 64, 0],
[0, 0, 0, 0],
[2],
[0, 0, 'https://youtu.be/DFHgJWQDE6o?t=354', 0, 0, 0, 0], 'Face detection techniques have been developed for decades, and one of remaining open challenges is detecting small faces in unconstrained conditions. The reason is that tiny faces are often lacking detailed information and blurring. In this paper, we proposed an algorithm to directly generate a clear high-resolution face from a blurry small one by adopting a generative adversarial network (GAN). Toward this end, the basic GAN formulation achieves it by super-resolving and refining sequentially (e.g. SR-GAN and cycle-GAN). However, we design a novel network to address the problem of super-resolving and refining jointly. We also introduce new training losses to guide the generator network to recover fine details and to promote the discriminator network to distinguish real vs. fake and face vs. non-face simultaneously. Extensive experiments on the challenging dataset WIDER FACE demonstrate the effectiveness of our proposed method in restoring a clear high-resolution face from a blurry small one, and show that the detection performance outperforms other state-of-the-art methods.'
],
[0, 2, 18, 2, '1YTn8NZdL-sKpyPOhyrqhBHQEg7g2jBTv', '1GCw5SWSnCuvVgAvjPMuHf1drUE15lhZS', 'W2F: A Weakly-Supervised to Fully-Supervised Framework', [22, 20, 64, 84, 0],
[0, 0, 0, 0, 0],
[],
[0, 0, 0, 0, 0, 'https://ivul.kaust.edu.sa/Documents/more/supplementary/W2F%20A%20Weakly-Supervised%20to%20Fully-Supervised%20Framework.pdf', 0], 'Weakly-supervised object detection has attracted much attention lately, since it does not require bounding box annotations for training. Although significant progress has also been made, there is still a large gap in performance between weakly-supervised and fully-supervised object detection. Recently, some works use pseudo ground-truths which are generated by a weakly-supervised detector to train a supervised detector. Such approaches incline to find the most representative parts of objects, and only seek one ground-truth box per class even though many same-class instances exist. To overcome these issues, we propose a weakly-supervised to fully-supervised framework, where a weakly-supervised detector is implemented using multiple instance learning. Then, we propose a pseudo ground-truth excavation (PGE) algorithm to find the pseudo ground-truth of each instance in the image. Moreover, the pseudo groundtruth adaptation (PGA) algorithm is designed to further refine the pseudo ground-truths from PGE. Finally, we use these pseudo ground-truths to train a fully-supervised detector. Extensive experiments on the challenging PASCAL VOC 2007 and 2012 benchmarks strongly demonstrate the effectiveness of our framework. We obtain 52.4% and 47.8% mAP on VOC2007 and VOC2012 respectively, a significant improvement over previous state-of-the-art methods.'
],
[0, 1, 18, 3, '1WaMpmypgTBU7VgXHpu2nypB289mhaGho', '1RkNnTRlf4m4XyLqD6pIm3xR08k0ndLZA', 'SoccerNet: A Scalable Dataset for Action Spotting in Soccer Videos', [2, 66, 27, 0],
[0, 0, 0, 0],
[2],
[0, 'https://cemse.kaust.edu.sa/ivul/soccernet', 'https://youtu.be/x4E3DPy84xM', 0, 0, 0, 'https://silviogiancola.github.io/SoccerNet/'], 'In this paper, we introduce SoccerNet, a benchmark for action spotting in soccer videos. The dataset is composed of 500 complete soccer games from six main European leagues, covering three seasons from 2014 to 2017 and a total duration of 764 hours. A total of 6,637 temporal annotations are automatically parsed from online match reports at a one minute resolution for three main classes of events (Goal, Yellow/Red Card, and Substitution). As such, the dataset is easily scalable. These annotations are manually refined to a one second resolution by anchoring them at a single timestamp following well-defined soccer rules. With an average of one event every 6.9 minutes, this dataset focuses on the problem of localizing very sparse events within long videos. We define the task of spotting as finding the anchors of soccer events in a video. Making use of recent developments in the realm of generic action recognition and detection in video, we provide strong baselines for detecting soccer events. We show that our best model for classifying temporal segments of length one minute reaches a mean Average Precision (mAP) of 67.8%. For the spotting task, our baseline reaches an Average-mAP of 49.7% for tolerances ranging from 5 to 60 seconds.'
],
[0, 2, 18, 3, '1TrHZhzWBy5cH-jYYwJ1ig_ZIKoIMfOnh', '1mg07T9nDc74H5EUuUcqzpEoK8OnKY9AA', 'Integration of Absolute Orientation Measurements in the KinectFusion Reconstruction pipeline', [2, 35, 32, 0],
[0, 0, 0, 0],
[],
[0, 0, 0, 0, 0, 0, 0], 'In this paper, we show how absolute orientation measurements provided by low-cost but high-fidelity IMU sensors can be integrated into the KinectFusion pipeline. We show that integration improves both runtime, robustness and quality of the 3D reconstruction. In particular, we use this orientation data to seed and regularize the ICP registration technique. We also present a technique to filter the pairs of 3D matched points based on the distribution of their distances. This filter is implemented efficiently on the GPU. Estimating the distribution of the distances helps control the number of iterations necessary for the convergence of the ICP algorithm. Finally, we show experimental results that highlight improvements in robustness, a speed-up of almost 12%, and a gain in tracking quality of 53% for the ATE metric on the Freiburg benchmark.'
],
[0, 1, 18, 6, '1nu192a-faDxdlrg0EU7_WQREOheLq_XI', '1nyotw_mLa-AmiXLL40Npb9e8zTY8qVH2', 'What do I Annotate Next? An Empirical Study of Active Learning for Action Localization', [14, 54, 49, 0],
[0, 0, 0, 0],
[],
[0, 0, 0, 0, 0, 'https://ivul.kaust.edu.sa/Documents/more/supplementary/What%20do%20I%20Annotate%20Next%20An%20Empirical%20Study%20of%20Active%20Learning%20for%20Action%20Localization-supp.pdf', 'https://cabaf.github.io/what-to-annotate-next/'], 'Despite tremendous progress achieved in temporal action localization, state-of-the-art methods still struggle to train accurate models when annotated data is scarce. In this paper, we introduce a novel active learning framework for temporal localization that aims to mitigate this data dependency issue. We equip our framework with active selection functions that can reuse knowledge from previously annotated datasets. We study the performance of two state-of-the-art active selection functions as well as two widely used active learning baselines. To validate the effectiveness of each one of these selection functions, we conduct simulated experiments on ActivityNet. We find that using previously acquired knowledge as a bootstrapping source is crucial for active learners aiming to localize actions. When equipped with the right selection function, our proposed framework exhibits significantly better performance than standard active learning strategies, such as uncertainty sampling. Finally, we employ our framework to augment the newly compiled Kinetics action dataset with ground-truth temporal annotations. As a result, we collect Kinetics-Localization, a novel large-scale dataset for temporal action localization, which contains more than 15K YouTube videos.'
],
[0, 1, 18, 6, '1lHyJ7o3gwYpAKisdlHq9ZTngoypwAUfK', '1gP6DKLWbfNsbEAFl1Uwh6LqRfL_tLMYi', 'Action Search: Spotting Actions in Videos and Its Application to Temporal Action Localization', [6, 14, 0],
[1, 1, 0],
[],
[0, 0, 'https://youtu.be/HHGoz4Y5QzM', 0, 0, 'https://ivul.kaust.edu.sa/Documents/more/supplementary/Action%20Search%20Spotting%20Targets%20in%20Videos%20and%20Its%20Application%20to%20Temporal%20Action%20Localization-supp.pdf', 'http://humamalwassel.com/publication/action-search/'], 'State-of-the-art temporal action detectors inefficiently search the entire video for specific actions. Despite the encouraging progress these methods achieve, it is crucial to design automated approaches that only explore parts of the video which are the most relevant to the actions being searched for. To address this need, we propose the new problem of action spotting in video, which we define as finding a specific action in a video while observing a small portion of that video. Inspired by the observation that humans are extremely efficient and accurate in spotting and finding action instances in video, we propose Action Search, a novel Recurrent Neural Network approach that mimics the way humans spot actions. Moreover, to address the absence of data recording the behavior of human annotators, we put forward the Human Searches dataset, which compiles the search sequences employed by human annotators spotting actions in the AVA and THUMOS14 datasets. We consider temporal action localization as an application of the action spotting problem. Experiments on the THUMOS14 dataset reveal that our model is not only able to explore the video efficiently (observing on average 17.3% of the video) but it also accurately finds human activities with 30.8% mAP.'
],
[0, 1, 18, 6, '1uT-7-jWMbujIV7nKeERegpH5toXX8Gpv', '1jKlm47vzrExFbBLwRFvJYbkkyin2n9al', 'Diagnosing Error in Temporal Action Detectors', [6, 14, 15, 0],
[1, 1, 1, 0],
[],
['https://github.com/HumamAlwassel/DETAD', 0, 'https://youtu.be/rnndiuF2ouM', 0, 0, 'https://ivul.kaust.edu.sa/Documents/more/supplementary/Diagnosing%20Error%20in%20Temporal%20Action%20Detectors-supp.pdf', 'http://humamalwassel.com/publication/detad/'], 'Despite the recent progress in video understanding and the continuous rate of improvement in temporal action localization throughout the years, it is still unclear how far (or close?) we are to solving the problem. To this end, we introduce a new diagnostic tool to analyze the performance of temporal action detectors in videos and compare different methods beyond a single scalar metric. We exemplify the use of our tool by analyzing the performance of the top rewarded entries in the latest ActivityNet action localization challenge. Our analysis shows that the most impactful areas to work on are: strategies to better handle temporal context around the instances, improving the robustness w.r.t. the instance absolute and relative size, and strategies to reduce the localization errors. Moreover, our experimental analysis finds the lack of agreement among annotator is not a major roadblock to attain progress in the field. Our diagnostic tool is publicly available to keep fueling the minds of other researchers with additional insights about their algorithms.'
],
[0, 2, 18, 6, '1FUGH7yQy9PzBXPM61Fw_LI3RfikkM4EV', '1aqPVbSdIp61TGqEveGgQ01jBxR0CnKTc', 'TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild', [21, 3, 2, 11, 0],
[1, 1, 1, 0, 0],
[],
[0, 0, 'https://youtu.be/5n09hq3eweM', 0, 0, 'https://ivul.kaust.edu.sa/Documents/more/supplementary/TrackingNet%20A%20Large%20Scale%20Dataset%20and%20Benchmark%20for%20Object%20Tracking%20in%20the%20Wild-supp.pdf', 'http://www.tracking-net.org'], 'Despite the numerous developments in object tracking, further improvement of current tracking algorithms is limited by small and mostly saturated datasets. As a matter of fact, data-hungry trackers based on deep-learning currently rely on object detection datasets due to the scarcity of dedicated large-scale tracking datasets. In this work, we present TrackingNet, the first large-scale dataset and benchmark for object tracking in the wild. We provide more than 30K videos with more than 14 million dense bounding box annotations. Our dataset covers a wide selection of object classes in broad and diverse context. By releasing such a large-scale dataset, we expect deep trackers to further improve and generalize. In addition, we introduce a new benchmark composed of 500 novel videos, modeled with a distribution similar to our training dataset. By sequestering the annotation of the test set and providing an online evaluation server, we provide a fair benchmark for future development of object trackers. Deep trackers fine-tuned on a fraction of our dataset improve their performance by up to 1.6% on OTB100 and up to 1.7% on TrackingNet Test. We provide an extensive benchmark on TrackingNet by evaluating more than 20 trackers. Our results suggest that object tracking in the wild is far from being solved.'
],
[0, 3, 18, 6, '1Hw2mmgajfHnZyzf48SIrX-sfjIskWT30', '1XwV6TkyQKBN9LVK4VCLghQMpXzABVb8Y', 'Face Super-resolution Guided by Facial Component Heatmaps', [29, 41, 0, 48, 72],
[0, 0, 0, 0, 0],
[],
['https://github.com/XinYuANU/Facial-Heatmaps-Guided-Hallucination', 0, 0, 0, 0, 0, 0], 'State-of-the-art face super-resolution methods leverage deep convolutional neural networks to learn a mapping between low-resolution (LR) facial patterns and their corresponding high-resolution (HR) counterparts by exploring local appearance information. However, most of these methods do not account for facial structure and suffer from degradations due to large pose variations and misalignments. In this paper, we propose a method that explicitly incorporates structural information of faces into the face super-resolution process by using a multi-task convolutional neural network (CNN). Our CNN has two branches: one for super-resolving face images and the other branch for predicting salient regions of a face coined facial component heatmaps. These heatmaps encourage the upsampling stream to generate super-resolved faces with higher-quality details. Our method not only uses low-level information (i.e., intensity similarity), but also middle-level information (i.e., face structure) to further explore spatial constraints of facial components from LR inputs images. Therefore, we are able to super-resolve very small unaligned face images (16×16 pixels) with a large upscaling factor of 8×, while preserving face structure. Extensive experiments demonstrate that our network achieves superior face hallucination results and outperforms the state-of-the-art.'
],
[0, 2, 18, 6, '1aBEMhMZSNi883sWs58sbzG4B6Tm-4MMY', '1C93ZhO-_0YsVS1lF3mbuo5FUcpbAcNvd', 'SOD-MTGAN: Small Object Detection via Multi-Task Generative Adversarial Network', [20, 22, 64, 0],
[1, 1, 0, 0],
[],
[0, 0, 0, 0, 0, 0, 0], 'Object detection is a fundamental and important problem in computer vision. Although impressive results have been achieved on large/medium sized objects on large-scale detection benchmarks (e.g. the COCO dataset), the performance on small objects is far from satisfaction. The reason is that small objects lack sufficient detailed appearance information, which can distinguish them from the background or similar objects. To deal with small object detection problem, we propose an end-to-end multi-task generative adversarial network (MTGAN). In the MTGAN, the generator is a super-resolution network, which can up-sample small blurred images into fine-scale ones and recover detailed information for more accurate detection. The discriminator is a multitask network, which describes each super-resolution image patch with a real/fake score, object category scores, and bounding box regression offsets. Furthermore, to make the generator recover more details for easier detection, the classification and regression losses in the discriminator are back-propagated into the generator during training. Extensive experiments on the challenging COCO dataset demonstrate the effectiveness of the proposed method in restoring a clear super-resolution image from a blurred small one, and show that the detection performance, especially for small sized objects, improves over state-of-the-art methods.'
],
[0, 2, 18, 7, '1spLy_koUIHmeQAhKu0UhV7t9g8AAuiGo', '1LWrQY-OXg3l3FTTS-FvXIvGswystU0DA', 'Teaching UAVs to Race: End-to-End Regression of Agile Controls in Simulation', [21, 28, 30, 33, 0],
[1, 1, 0, 0, 0],
[2, 3],
[0, 0, 0, 0, 0, 0, 'https://matthias.pw/publication/deep-fpv-racer/'], 'Automating the navigation of unmanned aerial vehicles (UAVs) in diverse scenarios has gained much attention in recent years. However, teaching UAVs to fly in challenging environments remains an unsolved problem, mainly due to the lack of training data. In this paper, we train a deep neural network to predict UAV controls from raw image data for the task of autonomous UAV racing in a photo-realistic simulation. Training is done through imitation learning with data augmentation to allow for the correction of navigation mistakes. Extensive experiments demonstrate that our trained network (when sufficient data augmentation is used) outperforms state-of-the-art methods and flies more consistently than many human pilots. Additionally, we show that our optimized network architecture can run in real-time on embedded hardware, allowing for efficient on-board processing critical for real-world deployment. From a broader perspective, our results underline the importance of extensive data augmentation techniques to improve robustness in end-to-end learning setups.'
],
[0, 2, 18, 16, '1kQEIbNdHDR4DkC_ARgtmGZtfdiSW9-Zm', '1XssG8q0oa1uPtNMiRzu1Z50VkVyWcaUL', 'Weakly-supervised object detection via mining pseudo ground truth bounding-boxes', [22, 20, 64, 84, 0],
[0, 0, 0, 0, 0],
[],
[0, 0, 0, 0, 0, 0, 0], 'Recently, weakly-supervised object detection has attracted much attention, since it does not require expensive bounding-box annotations while training the network. Although significant progress has also been made, there is still a large gap on the performance between weakly-supervised and fully-supervised object detection. To mitigate this gap, some works try to use the pseudo ground truths generated by a weakly-supervised detector to train a supervised detector. However, such approaches incline to find the most representative parts instead of the whole body of an object, and only seek one ground truth bounding-box per class even though many same-class instances exist in an image. To address these issues, we propose a weakly-supervised to fully-supervised framework (W2F), where a weakly-supervised detector is implemented using multiple instance learning. And then, we propose a pseudo ground-truth excavation (PGE) algorithm to find the accurate pseudo ground truth bounding-box for each instance. Moreover, the pseudo ground-truth adaptation (PGA) algorithm is designed to further refine those pseudo ground truths mined by PGE algorithm. Finally, the mined pseudo ground truths are used as supervision to train a fully-supervised detector. Additionally, we also propose an iterative ground-truth learning (IGL) approach, which enhances the quality of the pseudo ground truths by using the predictions of the fully-supervised detector iteratively. Extensive experiments on the challenging PASCAL VOC 2007 and 2012 benchmarks strongly demonstrate the effectiveness of our method. We obtain 53.1% and 49.4% mAP on VOC2007 and VOC2012 respectively, which is a significant improvement over previous state-of-the-art methods.'
],
['10754/632519', 2, 18, 22, '1U7pfMeYF2QOkvLislulPRk15L3HnU_Bv', '1Hz_vll5chebyjvB8mdjErW3wXdyBhUj3', 'Driving Policy Transfer via Modularity and Abstraction', [21, 39, 0, 78],
[0, 0, 0, 0],
[],
[0, 0, 'https://youtu.be/BrMDJqI6H5U', 0, 0, 0, 'https://matthias.pw/publication/driving-policy-transfer/'], 'End-to-end approaches to autonomous driving have high sample complexity and are difficult to scale to realistic urban driving. Simulation can help end-to-end driving systems by providing a cheap, safe, and diverse training environment. Yet training driving policies in simulation brings up the problem of transferring such policies to the real world. We present an approach to transferring driving policies from simulation to reality via modularity and abstraction. Our approach is inspired by classic driving systems and aims to combine the benefits of modular architectures and end-to-end deep learning approaches. The key idea is to encapsulate the driving policy such that it is not directly exposed to raw perceptual input or low-level vehicle dynamics. We evaluate the presented approach in simulated urban environments and in the real world. In particular, we transfer a driving policy trained in simulation to a 1/5-scale robotic truck that is deployed in a variety of conditions, with no finetuning, on two continents. The supplementary video can be viewed at https://youtu.be/BrMDJqI6H5U'
],
[0, 3, 17, 0, '1Eeww9huQkF5iGZpfbPTviMN-jGLsSHTv', '1KJqf7h-3ozOCehQIwYQpxsKCRRUcmrTa', 'L0TV: A Sparse Optimization Method for Impulse Noise Image Restoration', [17, 0],
[0, 0],
[],
[0, 0, 0, 0, 0, 0, 0], 'Total Variation (TV) is an effective and popular prior model in the field of regularization-based image processing. This paper focuses on total variation for removing impulse noise in image restoration. This type of noise frequently arises in data acquisition and transmission due to many reasons, e.g. a faulty sensor or analog-to-digital converter errors. Removing this noise is an important task in image restoration. State-of-the-art methods such as Adaptive Outlier Pursuit(AOP) [57], which is based on TV with l02-norm data fidelity, only give sub-optimal performance. In this paper, we propose a new sparse optimization method, called l0TV-PADMM, which solves the TV-based restoration problem with l0-norm data fidelity. To effectively deal with the resulting non-convex non-smooth optimization problem, we first reformulate it as an equivalent biconvex Mathematical Program with Equilibrium Constraints (MPEC), and then solve it using a proximal Alternating Direction Method of Multipliers (PADMM). Our l0TV-PADMM method finds a desirable solution to the original l0-norm optimization problem and is proven to be convergent under mild conditions. We apply l0TV-PADMM to the problems of image denoising and deblurring in the presence of impulse noise. Our extensive experiments demonstrate that l0TV-PADMM outperforms state-of-the-art image restoration methods.'
],
['10754/623954', 3, 17, 2, '1YN7RDl6BAPAI15JIXitxue-oczxI11nN', '1l_nITGVloFPWnsjoxpTtO5eyA57J66sj', 'FFTLasso: Large-Scale LASSO in the Fourier Domain', [3, 5, 0],
[0, 0, 0],
[2],
['https://github.com/adelbibi/FFTLasso', 0, 'https://youtu.be/UQNb7a9KUYk', 0, 0, 'https://ivul.kaust.edu.sa/Documents/more/supplementary/FFTLasso%20supplementary%20material%20PDF.pdf', 'http://www.adelbibi.com/'], 'In this paper, we revisit the LASSO sparse representation problem, which has been studied and used in a variety of different areas, ranging from signal processing and information theory to computer vision and machine learning. In the vision community, it found its way into many important applications, including face recognition, tracking, super resolution, image denoising, to name a few. Despite advances in efficient sparse algorithms, solving large-scale LASSO problems remains a challenge. To circumvent this difficulty, people tend to downsample and subsample the problem (e.g. via dimensionality reduction) to maintain a manageable sized LASSO, which usually comes at the cost of losing solution accuracy. This paper proposes a novel circulant reformulation of the LASSO that lifts the problem to a higher dimension, where ADMM can be efficiently applied to its dual form. Because of this lifting, all optimization variables are updated using only basic element-wise operations, the most computationally expensive of which is a 1D FFT. In this way, there is no need for a linear system solver nor matrix-vector multiplication. Since all operations in our FFTLasso method are element-wise, the subproblems are completely independent and can be trivially parallelized (e.g. on a GPU). The attractive computational properties of FFTLasso are verified by extensive experiments on synthetic and real data and on the face recognition task. They demonstrate that FFTLasso scales much more effectively than a state-of-the-art solver.'
],
[0, 3, 17, 2, '18CEb_ZBCgcwlN96bVgUJ4k2Kzqns6afr', '1Fq8E1UQwbNcwmZZSOtjf5lTzVwWAoUe7', 'Diverse Image Annotation', [16, 46, 80, 0],
[0, 0, 0, 0],
[],
[0, 0, 0, 0, 0, 'https://ivul.kaust.edu.sa/Documents/more/supplementary/Diverse%20Image%20Annotation%20Supplement.zip', 0], 'In this work, we study a new image annotation task called diverse image annotation (DIA). Its goal is to describe an image using a limited number of tags, whereby the retrieved tags need to cover as much useful information about the image as possible. As compared to the conventional image annotation task, DIA requires the tags to be not only representative of the image but also diverse from each other, so as to reduce redundancy. To this end, we treat DIA as a subset selection problem, based on the conditional determinantal point process (DPP) model, which encodes representation and diversity jointly. We further explore semantic hierarchy and synonyms among candidate tags to define weighted semantic paths. It is encouraged that two tags with the same semantic path are not retrieved simultaneously for the same image. This restriction is embedded into the algorithm used to sample from the learned conditional DPP model. Interestingly, we find that conventional metrics for image annotation (e.g., precision, recall, and F 1 score) only consider an overall representative capacity of all the retrieved tags, while ignoring their diversity. Thus, we propose new semantic metrics based on our proposed weighted semantic paths. An extensive subject study verifies that the proposed metrics are much more consistent with human evaluation than conventional annotation metrics. Experiments on two benchmark datasets show that the proposed method produces more representative and diverse tags, compared with existing methods.'
],
[0, 1, 17, 2, '15-pBfBp20MKmlOIV4eJZltusjBMUgK2B', '1dET2PZbiqaXFPncX_qIFIF15boyeKkae', 'SCC: Semantic Context Cascade for Efficient Action Detection', [14, 79, 15, 0],
[0, 0, 0, 0],
[],
[0, 0, 0, 0, 0, 'http://www.cabaf.net/scc/supplementary.html', 'http://www.cabaf.net/scc'], 'Despite the recent advances in large-scale video analysis, action detection remains as one of the most challenging unsolved problems in computer vision. This snag is in part due to the large volume of data that needs to be analyzed to detect actions in videos. Existing approaches have mitigated the computational cost, but still, these methods lack rich high-level semantics that helps them to localize the actions quickly.In this paper, we introduce a Semantic Cascade Context (SCC) model that aims to detect action in long video sequences. By embracing semantic priors associated with human activities, SCC produces high-quality class-specific action proposals and prune unrelated activities in a cascade fashion. Experimental results on ActivityNet unveils that SCC achieves state-of-the-art performance for action detection while operating at real time.'
],
[0, 2, 17, 2, '1aAQpe-hke8St5-ZuQ9hhUp5AwdFYfdiq', '12Kkw2ziI8X9tqFyfWUOv9rQ5lrCq4QXu', 'A Matrix Splitting Method for Composite Function Minimization', [17, 81, 0],
[0, 0, 0],
[],
[0, 0, 0, 0, 0, 'https://ivul.kaust.edu.sa/Documents/more/supplementary/A%20Matrix%20Splitting%20Method%20for%20Composite%20Function%20Minimization%20Supplement.pdf', 0], 'Composite function minimization captures a wide spectrum of applications in both computer vision and machine learning. It includes bound constrained optimization and cardinality regularized optimization as special cases. This paper proposes and analyzes a new Matrix Splitting Method (MSM) for minimizing composite functions. It can be viewed as a generalization of the classical Gauss-Seidel method and the Successive Over-Relaxation method for solving linear systems in the literature. Incorporating a new Gaussian elimination procedure, the matrix splitting method achieves state-of-the-art performance. For convex problems, we establish the global convergence, convergence rate, and iteration complexity of MSM, while for non-convex problems, we prove its global convergence. Finally, we validate the performance of our matrix splitting method on two particular applications: nonnegative matrix factorization and cardinality regularized sparse coding. Extensive experiments show that our method outperforms existing composite function minimization techniques in term of both efficiency and efficacy.'
],
['10754/623951', 2, 17, 2, '1BQNfvagp6RcDkNxvrGlcR_XgZWAhjyso', '1SS-y8wYoT7Nr2LfypCRaco--IAOd5ZTU', 'Context-Aware Correlation Filter Tracking', [21, 30, 0],
[0, 0, 0],
[2],
['https://goo.gl/gJ2vTs', 0, 'https://youtu.be/-mEkFAAag2Q', 0, 0, 'https://goo.gl/Mv0gnX', 'http://matthias.pw/'], 'Correlation filter (CF) based trackers have recently gained a lot of popularity due to their impressive performance on benchmark datasets, while maintaining high frame rates. A significant amount of recent research focuses on the incorporation of stronger features for a richer representation of the tracking target. However, this only helps to discriminate the target from background within a small neighborhood. In this paper, we present a framework that allows the explicit incorporation of global context within CF trackers. We reformulate the original optimization problem and provide a closed form solution for single and multi-dimensional features in the primal and dual domain. Extensive experiments demonstrate that this framework significantly improves the performance of many CF trackers with only a modest impact on frame rate.'
],
[0, 1, 17, 2, '1AhIuJqmOnhzxzmh_1S2uBt6QYJiA5V28', '1pQM7xTz42DsKEHvMUN7aiacytMz79Oax', 'SST: Single-Stream Temporal Action Proposals', [34, 15, 44, 0, 25],
[0, 0, 0, 0, 0],
[],
[0, 0, 0, 0, 0, 0, 'https://github.com/shyamal-b/sst/'], 'Our paper presents a new approach for temporal detection of human actions in long, untrimmed video sequences. We introduce Single-Stream Temporal Action Proposals (SST), a new effective and efficient deep architecture for the generation of temporal action proposals. Our network can run continuously in a single stream over very long input video sequences, without the need to divide input into short overlapping clips or temporal windows for batch processing. We demonstrate empirically that our model outperforms the state-of-the-art on the task of temporal action proposal generation, while achieving some of the fastest processing speeds in the literature. Finally, we demonstrate that using SST proposals in conjunction with existing action classifiers results in state-of-the-art temporal action detection performance.'
],
[0, 3, 17, 4, '1df5PsY2e3Fl5-4jGNOyzAasj7FMut5mc', '1P-Dj07W8GyKzoAlXgOXymFaK2zxa_Dlx', 'High Order Tensor Formulation for Convolutional Sparse Coding', [3, 0],
[0, 0],
[],
[0, 0, 0, 0, 0, 'https://ivul.kaust.edu.sa/Documents/more/supplementary/Tensor_CSC_supp.pdf', 'http://adelbibi.com'], 'Convolutional sparse coding (CSC) has gained attention for its successful role as a reconstruction and a classification tool in the computer vision and machine learning community. Current CSC methods can only reconstruct singlefeature 2D images independently. However, learning multidimensional dictionaries and sparse codes for the reconstruction of multi-dimensional data is very important, as it examines correlations among all the data jointly. This provides more capacity for the learned dictionaries to better reconstruct data. In this paper, we propose a generic and novel formulation for the CSC problem that can handle an arbitrary order tensor of data. Backed with experimental results, our proposed formulation can not only tackle applications that are not possible with standard CSC solvers, including colored video reconstruction (5D--tensors), but it also performs favorably in reconstruction with much fewer parameters as compared to naive extensions of standard CSC to multiple features/channels.'
],
[0, 2, 17, 4, '1eVzwfVf15cDH7ClQub_TDiV2guUURAqQ', '1BLXqeniZ3Jp8ODoqRg07x76kqOmTlS9V', '2D-Driven 3D Object Detection in RGB-D Images', [7, 0],
[0, 0],
[],
[0, 0, 0, 0, 0, 'https://ivul.kaust.edu.sa/Documents/more/supplementary/2D-Driven%203D%20Object%20Detection%20in%20RGB-D%20Images_supp.pdf', 0], 'In this paper, we present a technique that places 3D bounding boxes around objects in an RGB-D scene. Our approach makes best use of the 2D information to quickly reduce the search space in 3D, benefiting from state-of-the-art 2D object detection techniques. We then use the 3D information to orient, place, and score bounding boxes around objects. We independently estimate the orientation for every object, using previous techniques that utilize normal information. Object locations and sizes in 3D are learned using a multilayer perceptron (MLP). In the final step, we refine our detections based on object class relations within a scene. When compared to state-of-the-art detection methods that operate almost entirely in the sparse 3D domain, extensive experiments on the well-known SUN RGBD dataset [29] show that our proposed method is muchfaster (4.1s per image) in detecting 3D objects in RGB-D images and performs better (3 mAP higher) than the state-of-the-art method that is 4.7 times slower and comparably to the method that is two orders of magnitude slower. This work hints at the idea that 2D-driven object detection in 3D should be further explored, especially in cases where the 3D input is sparse.'
],
[0, 3, 17, 4, '1zL0cWJYibp9h4J19dQQmDbFxmzBrxzIb', '1G9PKLW44AAI5-Mjp-vcHAlrFWUO-MxVq', 'Constrained Convolutional Sparse Coding for Parametric Based Reconstruction Of Line Drawings', [12, 13, 0],
[0, 0, 0],
[],
[0, 0, 0, 0, 0, 'https://ivul.kaust.edu.sa/Documents/more/supplementary/Constrained%20Convolutional%20Sparse%20Coding%20for%20Parametric%20Based%20Reconstruction_supp.zip', 0], 'Convolutional sparse coding (CSC) plays an essential role in many computer vision applications ranging from image compression to deep learning. In this work, we spot the light on a new application where CSC can effectively serve, namely line drawing analysis. The process of drawing a line drawing can be approximated as the sparse spatial localization of a number of typical basic strokes, which in turn can be cast as a non-standard CSC model that considers the line drawing formation process from parametric curves. These curves are learned to optimize the fit between the model and a specific set of line drawings. Parametric representation of sketches is vital in enabling automatic sketch analysis, synthesis and manipulation. A couple of sketch manipulation examples are demonstrated in this work. Consequently, our novel method is expected to provide a reliable and automatic method for parametric sketch description. Through experiments, we empirically validate the convergence of our method to a feasible solution.'
],
[0, 3, 17, 9, '1L4YOoCba_hvXCbWh3vo1eZnHaEIAdEqY', '1L5_ErOayykJnzgykgMpduxmyVPDOkgDc', 'An Exact Penalty Method for Binary Optimization Based on MPEC Formulation', [17, 0],
[0, 0],
[],
['https://ivul.kaust.edu.sa/Documents/more/code/An%20Exact%20Penalty%20Method%20for%20Binary%20Optimization%20Based%20on%20MPEC%20Formulation.zip', 0, 0, 0, 0, 0, 0], 'Binary optimization is a central problem in mathematical optimization and its applications are abundant. To solve this problem, we propose a new class of continuous optimization techniques, which is based on Mathematical Programming with Equilibrium Constraints (MPECs). We first reformulate the binary program as an equivalent augmented biconvex optimization problem with a bilinear equality constraint, then we propose an exact penalty method to solve it. The resulting algorithm seeks a desirable solution to the original problem via solving a sequence of linear programming convex relaxation subproblems. In addition, we prove that the penalty function, induced by adding the complementarity constraint to the objective, is exact, i.e., it has the same local and global minima with those of the original binary program when the penalty parameter is over some threshold. The convergence of the algorithm can be guaranteed, since it essentially reduces to block coordinate descent in the literature. Finally, we demonstrate the effectiveness of our method on the problem of dense subgraph discovery. Extensive experiments show that our method outperforms existing techniques, such as iterative hard thresholding and linear programming relaxation.'
],
[0, 1, 17, 12, '1iM5eoOY_VCiXC83kT5Ume6A2wx1ngzEK', '18h2lT9ucA9SyD1CFmjHEuPrgMFReXT_X', 'End-to-End, Single-Stream Temporal Action Detection in Untrimmed Videos (SS-TAD)', [34, 15, 0, 57, 25],
[0, 0, 0, 0, 0],
[2],
['https://github.com/shyamal-b/ss-tad', 0, 0, 0, 0, 0, 'https://github.com/shyamal-b/ss-tad'], "In this work, we present a new intuitive, end-to-end approach for temporal action detection in untrimmed videos. We introduce our new architecture for Single-Stream Temporal Action Detection (SS-TAD), which effectively integrates joint action detection with its semantic sub-tasks in a single unifying end-to-end framework. We develop a method for training our deep recurrent architecture based on enforcing semantic constraints on intermediate modules that are gradually relaxed as learning progresses. We find that such a dynamic learning scheme enables SS-TAD to achieve higher overall detection performance, with fewer training epochs. By design, our single-pass network is very efficient and can operate at 701 frames per second, while simultaneously outperforming the state-of-the-art methods for temporal action detection on THUMOS'14."
],
[0, 3, 17, 18, '1uaTZHL9ukbwnVEXUW9_9XwJEkTQ83drf', '1oZ2WBlvt9eaAD15kTyT5awfZJcz9-OFU', 'Stroke Style Transfer', [12, 0],
[0, 0],
[0],
[0, 0, 0, 0, 0, 0, 0], 'We propose a novel method to transfer sketch style at the stroke level from one free-hand line drawing to another, whereby thesedrawings can be from different artists. It aims to transfer the style of the input sketch at the stroke level to the style encountered in sketches by other artists. This is done by modifying all the parametric stroke segments in the input, so as to minimize a global stroke-level distance between the input and target styles. To do this, we exploit a recent work on stroke authorship recognition to define the stroke-level distance [SRG15], which is in turn minimized using conventional optimization tools. We showcase the quality of style transfer qualitatively by applying the proposed technique on several input-target combinations.'
],
['10754/622773', 3, 16, 2, '1bQW2I4a1K5csV26rFGaiQQUtWfV175n8', '1P4qbtdtxtC5Px6A_q4kG8bbVMEkqsAbV', '3D Part-Based Sparse Tracker with Automatic Synchronization and Registration', [3, 77, 0],
[0, 0, 0],
[],
[0, 0, 'https://youtu.be/5YZEZseOYG4', 0, 0, 0, 0], 'In this paper, we present a part-based sparse tracker in a particle filter framework where both the motion and appearance model are formulated in 3D. The motion model is adaptive and directed according to a simple yet powerful occlusion handling paradigm, which is intrinsically fused in the motion model. Also, since 3D trackers are sensitive to synchronization and registration noise in the RGB and depth streams, we propose automated methods to solve these two issues. Extensive experiments are conducted on a popular RGBD tracking benchmark, which demonstrate that our tracker can achieve superior results, outperforming many other recent and state-of-the-art RGBD trackers.'
],
['10754/622892', 1, 16, 2, '19ZGZ_GyKAuw8XFWFZZwZQsUrLbKe2NvP', '1DUgPNkvgpLpyG0zlv1pOnIkgivdt9nKL', 'Fast Temporal Activity Proposals for Efficient Detection of Human Actions in Untrimmed Videos', [14, 25, 0],
[0, 0, 0],
[],
['https://github.com/cabaf/sparseprop', 0, 0, 0, 0, 'https://ivul.kaust.edu.sa/Documents/more/supplementary/Fast%20Temporal%20Activity%20Proposals%20for%20Efficient%20Detection%20of%20Human%20Actions%20in%20Untrimmed%20Videos.pdf', 0], 'In many large-scale video analysis scenarios, one is interested in localizing and recognizing human activities that occur in short temporal intervals within long untrimmed videos. Current approaches for activity detection still struggle to handle large-scale video collections and the task remains relatively unexplored. This is in part due to the computational complexity of current action recognition approaches and the lack of a method that proposes fewer intervals in the video, where activity processing can be focused. In this paper, we introduce a proposal method that aims to recover temporal segments containing actions in untrimmed videos. Building on techniques for learning sparse dictionaries, we introduce a learning framework to represent and retrieve activity proposals. We demonstrate the capabilities of our method in not only producing high quality proposals but also in its efficiency. Finally, we show the positive impact our method has on recognition performance when it is used for action detection, while running at 10FPS.'
],
['10754/622775', 2, 16, 2, '17Dw3M6IfIiJHgV18pjOgjpJcXzu9EelB', '1T9DGD2332msAJMeKuBNpZzlETlwTrkcO', 'In Defense of Sparse Tracking: Circulant Sparse Tracker', [77, 3, 0],
[0, 0, 0],
[1],
[0, 0, 'https://youtu.be/6yGEM2f_WZA', 0, 0, 'https://ivul.kaust.edu.sa/Documents/more/supplementary/In%20Defense%20of%20Sparse%20Tracking%20Circulant%20Sparse%20Tracker.pdf', 0], 'Sparse representation has been introduced to visual tracking by finding the best target candidate with minimal reconstruction error within the particle filter framework. However, most sparse representation based trackers have high computational cost, less than promising tracking performance, and limited feature representation. To deal with the above issues, we propose a novel circulant sparse tracker (CST), which exploits circulant target templates. Because of the circulant structure property, CST has the following advantages: (1) It can refine and reduce particles using circular shifts of target templates. (2) The optimization can be efficiently solved entirely in the Fourier domain. (3) High dimensional features can be embedded into CST to significantly improve tracking performance without sacrificing much computation time. Both qualitative and quantitative evaluations on challenging benchmark sequences demonstrate that CST performs better than all other sparse trackers and favorably against state-of-the-art methods.'
],
['10754/623939', 2, 16, 6, '1Qbi9IZ-VdhZFWltZ4f33sJPYj7HOP8Jf', '1AjmUhLofEWC_6guQD_EBj6uzvE4wkxQY', 'Target Response Adaptation for Correlation Filter Tracking', [3, 21, 0],
[0, 0, 0],
[1],
['https://github.com/adelbibi/Target-Response-Adaptation-for-Correlation-Filter-Tracking', 0, 'https://youtu.be/yZVY_Evxm3I', 0, 0, 'https://ivul.kaust.edu.sa/Documents/more/supplementary/Target%20Response%20Adaptation%20for%20Correlation%20Filter%20Tracking-supp.pdf', 0], 'Most correlation filter (CF) based trackers utilize the circulant structure of the training data to learn a linear filter that best regresses this data to a hand-crafted target response. These circularly shifted patches are only approximations to actual translations in the image, which become unreliable in many realistic tracking scenarios including fast motion, occlusion, etc. In these cases, the traditional use of a single centered Gaussian as the target response impedes tracker performance and can lead to unrecoverable drift. To circumvent this major drawback, we propose a generic framework that can adaptively change the target response from frame to frame, so that the tracker is less sensitive to the cases where circular shifts do not reliably approximate translations. To do that, we reformulate the underlying optimization to solve for both the filter and target response jointly, where the latter is regularized by measurements made using actual translations. This joint problem has a closed form solution and thus allows for multiple templates, kernels, and multi-dimensional features. Extensive experiments on the popular OTB100 benchmark [19] show that our target adaptive framework can be combined with many CF trackers to realize significant overall performance improvement (ranging from 3%-13.5% in precision and 3.2%-13% in accuracy), especially in categories where this adaptation is necessary (e.g. fast motion, motion blur, etc.).'
],
['10754/622212', 2, 16, 6, '1ZjNEga2pArQ3tmLyIqUJuobW4rcJv4bg', '1ZtBaAMxmQM8THpR5zx3-aGpN-VgukOl-', 'Large Scale Asset Extraction for Urban Images', [13, 58, 0, 32],
[0, 0, 0, 0],
[],
['https://github.com/lamaaffara/UrbanAssetExtraction', 0, 'https://youtu.be/sc02tD35gi4', 0, 0, 'https://ivul.kaust.edu.sa/Documents/more/supplementary/Large%20Scale%20Asset%20Extraction%20for%20Urban%20Images-supp.pdf', 0], 'Object proposals are currently used for increasing the computational efficiency of object detection. We propose a novel adaptive pipeline for interleaving object proposals with object classification and use it as a formulation for asset detection. We first preprocess the images using a novel and efficient rectification technique. We then employ a particle filter approach to keep track of three priors, which guide proposed samples and get updated using classifier output. Tests performed on over 1000 urban images demonstrate that our rectification method is faster than existing methods without loss in quality, and that our interleaved proposal method outperforms current state-of-the-art. We further demonstrate that other methods can be improved by incorporating our interleaved proposals. © Springer International Publishing AG 2016.'
],
['10754/622126', 2, 16, 6, '1GeX5Jkxi9oGnlWvVCL3B3WAfu5k3ZqJi', '1YDlCOWxH8HMPlmblJ2OvxFdSFecuiM3Z', 'A Benchmark and Simulator for UAV Tracking', [21, 30, 0],
[0, 0, 0],
[],
[0, 'https://ivul.kaust.edu.sa/Pages/Dataset-UAV123.aspx', 0, 0, 0, 'https://ivul.kaust.edu.sa/Documents/more/supplementary/A%20Benchmark%20and%20Simulator%20for%20UAV%20Tracking%20-%20Supplementary%20Material.pdf', 'http://matthias.pw/'], 'In this paper, we propose a new aerial video dataset and benchmark for low altitude UAV target tracking, as well as, a photorealistic UAV simulator that can be coupled with tracking methods. Our benchmark provides the first evaluation of many state-of-the-art and popular trackers on 123 new and fully annotated HD video sequences captured from a low-altitude aerial perspective. Among the compared trackers, we determine which ones are the most suitable for UAV tracking both in terms of tracking accuracy and run-time. The simulator can be used to evaluate tracking algorithms in real-time scenarios before they are deployed on a UAV “in the field”, as well as, generate synthetic but photo-realistic tracking datasets with automatic ground truth annotations to easily extend existing real-world datasets. Both the benchmark and simulator are made publicly available to the vision community on our website to further research in the area of object tracking from UAVs. (https://ivul.kaust.edu.sa/Pages/pub-benchmark-simulator-uav.aspx.). © Springer International Publishing AG 2016.'
],
['10754/604944', 1, 16, 6, '1OWPmF7naJbQYt6F95xJ8DadLXuSKf4dN', '1I36Gj1mbjMEp4XYbN3Cr3PZAxI6z52r-', 'DAPs: Deep Action Proposals for Action Understanding', [15, 14, 25, 0],
[0, 0, 0, 0],
[],
['https://github.com/escorciav/daps', 0, 0, 0, 0, 0, 0], 'Object proposals have contributed significantly to recent advances in object understanding in images. Inspired by the success of this approach, we introduce Deep Action Proposals (DAPs), an effective and efficient algorithm for generating temporal action proposals from long videos. We show how to take advantage of the vast capacity of deep learning models and memory cells to retrieve from untrimmed videos temporal segments, which are likely to contain actions. A comprehensive evaluation indicates that our approach outperforms previous work on a large scale action benchmark, runs at 134 FPS making it practical for large-scale scenarios, and exhibits an appealing ability to generalize, i.e. to retrieve good quality temporal proposals of actions unseen in training.'
],
[0, 3, 16, 9, '1EcYIb1Zh_fjTDI6bh3oc4YYfaC2kGDgN', '1qXnMaFNfj8_L4ZfPT8fYzjF2zFNC1Oms', 'Constrained Submodular Minimization for Missing Labels and Class Imbalance in Multi-label Learning', [16, 76, 0],
[0, 0, 0],
[],
['https://sites.google.com/site/baoyuanwu2015/demo-MMIB-AAAI2016-BYWU.zip?attredirects=0&d=1', 0, 0, 0, 0, 'https://ivul.kaust.edu.sa/Documents/more/supplementary/Constrained%20Submodular%20Minimization%20Towards%20Missing%20Labels.pdf', 0], 'In multi-label learning, there are two main challenges: missing labels and class imbalance (CIB). The former assumes that only a partial set of labels are provided for each training instance while other labels are missing. CIB is observed from two perspectives: first, the number of negative labels of each instance is much larger than its positive labels; second, the rate of positive instances (i.e. the number of positive instances divided by the total number of instances) of different classes are significantly different. Both missing labels and CIB lead to significant performance degradation. In this work, we propose a new method to handle these two challenges simultaneously. We formulate the problem as a constrained submodular minimization that is composed of a submodular objective function that encourages label consistency and smoothness, as well as, class cardinality bound constraints to handle class imbalance. We further present a convex approximation based on the Lovasz extension of submodular functions, leading to a linear program, which can be efficiently solved by the alternative direction method of multipliers (ADMM). Experimental results on several benchmark datasets demonstrate the improved performance of our method over several state-of-the-art methods.'
],
[0, 3, 16, 9, '1Vr-UfsjicGxbR3t95EGZ9GEa4KPa-nHA', '1SfnWLfzlvXar7zVXZ4MLSjCic7XD3kiQ', 'A Proximal Alternating Direction Method for Semi-Definite Rank Minimization', [17, 0],
[0, 0],
[],
['https://ivul.kaust.edu.sa/Documents/more/code/A%20Proximal%20Alternating%20Direction%20Method%20for%20Semi-Definite%20Rank%20Minimization.rar', 0, 0, 0, 0, 'https://ivul.kaust.edu.sa/Documents/more/supplementary/A%20Proximal%20Alternating%20Direction%20Method%20for%20Semi-Definite%20Rank%20Minimization.pdf', 0], 'Semi-definite rank minimization problems model a wide range of applications in both signal processing and machine learning fields. This class of problem is NP-hard in general. In this paper, we propose a proximal Alternating Direction Method (ADM) for the well-known semi-definite rank regularized minimization problem. Specifically, we first reformulate this NP-hard problem as an equivalent biconvex MPEC (Mathematical Program with Equilibrium Constraints), and then solve it using proximal ADM, which involves solving a sequence of structured convex semi-definite subproblems to find a desirable solution to the original rank regularized optimization problem. Moreover, based on the KurdykaŁojasiewicz inequality, we prove that the proposed method always converges to a KKT stationary point under mild conditions. We apply the proposed method to the widely studied and popular sensor network localization problem. Our extensive experiments demonstrate that the proposed algorithm outperforms state-of-the-art low-rank semi-definite minimization algorithms in terms of solution quality.'
],
['10754/615897', 3, 16, 16, '1AM6alAuWtv7kzkxyxfLRP8vGW3QdgXt2', '1CL7DkhialcpK-lQ64AZ24DbUAqICXw1D', 'Facial Action Unit Recognition under Incomplete Data Based on Multi-label Learning with Missing Labels', [84, 16, 0, 83, 50, 69],
[0, 0, 0, 0, 0, 0],
[],
[0, 0, 0, 0, 0, 0, 0], 'Facial action unit (AU) recognition has been applied in a wild range of fields, and has attracted great attention in the past two decades. Most existing works on AU recognition assumed that the complete label assignment for each training image is available, which is often not the case in practice. Labeling AU is expensive and time consuming process. Moreover, due to the AU ambiguity and subjective difference, some AUs are difficult to label reliably and confidently. Many AU recognition works try to train the classifier for each AU independently, which is of high computation cost and ignores the dependency among different AUs. In this work, we formulate AU recognition under incomplete data as a multi-label learning with missing labels (MLML) problem. Most existing MLML methods usually employ the same features for all classes. However, we find this setting is unreasonable in AU recognition, as the occurrence of different AUs produce changes of skin surface displacement or face appearance in different face regions. If using the shared features for all AUs, much noise will be involved due to the occurrence of other AUs. Consequently, the changes of the specific AUs cannot be clearly highlighted, leading to the performance degradation. Instead, we propose to extract the most discriminative features for each AU individually, which are learned by the supervised learning method. The learned features are further embedded into the instance-level label smoothness term of our model, which also includes the label consistency and the class-level label smoothness. Both a global solution using st-cut and an approximated solution using conjugate gradient (CG) descent are provided. Experiments on both posed and spontaneous facial expression databases demonstrate the superiority of the proposed method in comparison with several state-of-the-art works.'
],
['10754/622585', 2, 16, 20, '1vsPALrtV2ZElDeG5wPmKQlKlkZZ8kBKO', '1R4KtSVjQee8t8FqdnVr_6aK1SA4P4kBQ', 'Persistent Aerial Tracking System for UAVs', [21, 24, 30, 0],
[0, 0, 0, 0],
[],
[0, 0, 'https://youtu.be/_vR81qJxnNQ', 0, 0, 0, 'http://matthias.pw/'], 'In this paper, we propose a persistent, robust and autonomous object tracking system for unmanned aerial vehicles (UAVs) called Persistent Aerial Tracking (PAT). A computer vision and control strategy is applied to a diverse set of moving objects (e.g. humans, animals, cars, boats, etc.) integrating multiple UAVs with a stabilized RGB camera. A novel strategy is employed to successfully track objects over a long period, by ‘handing over the camera’ from one UAV to another. We evaluate several state-of-the-art trackers on the VIVID aerial video dataset and additional sequences that are specifically tailored to low altitude UAV target tracking. Based on the evaluation, we select the leading tracker and improve upon it by optimizing for both speed and performance, integrate the complete system into an off-the-shelf UAV, and obtain promising results showing the robustness of our solution in real-world aerial scenarios.'
],
[0, 2, 15, 2, '1Cp4GABrmisjYqbz16HXQ8doWEcpxksG8', '1QA5lYGHPzpVvEmX4eBtd50zEUxGfOECr', 'Robust Manhattan Frame Estimation from a Single RGB-D Image', [0, 1, 25, 14],
[0, 0, 0, 0],
[],
['https://ivul.kaust.edu.sa/Documents/more/code/MFE.zip', 'https://ivul.kaust.edu.sa/Documents/Data/Robust%20Manhattan%20Frame%20Estimation%20from%20a%20Single%20RGB-D%20Image.zip', 'https://ivul.kaust.edu.sa/Documents/more/video/MFE-video.mp4', 0, 0, 'https://ivul.kaust.edu.sa/Documents/more/supplementary/MFE-supp.pdf', 0], 'This paper proposes a new framework for estimating the Manhattan Frame (MF) of an indoor scene from a single RGB-D image. Our technique formulates this problem as the estimation of a rotation matrix that best aligns the normals of the captured scene to a canonical world axes. By introducing sparsity constraints, our method can simultaneously estimate the scene MF, the surfaces in the scene that are best aligned to one of three coordinate axes, and the outlier surfaces that do not align with any of the axes. To test our approach, we contribute a new set of annotations to determine ground truth MFs in each image of the popular NYUv2 dataset. We use this new benchmark to experimentally demonstrate that our method is more accurate, faster, more reliable and more robust than the methods used in the literature. We further motivate our technique by showing how it can be used to address the RGB-D SLAM problem in indoor scenes by incorporating it into and improving the performance of a popular RGB-D SLAM method.'
],
[0, 1, 15, 2, '1PUSxHYFdHLdwQbtnpCDRUFN5xVWUuMAx', '14ZYbhAldrL-bWpV7Y7J36eyL5sXmvVnk', 'ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding', [14, 15, 0, 25],
[0, 0, 0, 0],
[],
[0, 'https://cemse.kaust.edu.sa/ivul/activity-net', 0, 0, 0, 'https://dl.dropboxusercontent.com/u/18955644/website_files/2015/ActivityNet_CVPR2015_supp_material.zip', 'http://activity-net.org/'], 'In spite of many dataset efforts for human action recognition, current computer vision algorithms are still severely limited in terms of the variability and complexity of the actions that they can recognize. This is in part due to the simplicity of current benchmarks, which mostly focus on simple actions and movements occurring on manually trimmed videos. In this paper we introduce ActivityNet, a new largescale video benchmark for human activity understanding. Our benchmark aims at covering a wide range of complex human activities that are of interest to people in their daily living. In its current version, ActivityNet provides samples from 203 activity classes with an average of 137 untrimmed videos per class and 1.41 activity instances per video, for a total of 849 video hours. We illustrate three scenarios in which ActivityNet can be used to compare algorithms for human activity understanding: untrimmed video classification, trimmed activity classification and activity detection.'
],
['10754/556156', 3, 15, 2, '1RV4EKdO2wXMRNkD1_ZIng7ZWb0jg0EBy', '1AtIKusQDMvKdfJ72YRD12FHqdUtxFNvl', 'L0TV: A New Method for Image Restoration in the Presence of Impulse Noise', [17, 0],
[0, 0],
[2],
['http://yuanganzhao.weebly.com/uploads/1/0/7/5/10759809/l0tv.zip', 0, 0, 0, 'http://yuanganzhao.weebly.com/uploads/1/0/7/5/10759809/slide-l0tv.pdf', 'https://dl.dropboxusercontent.com/u/18955644/website_files/L0TV_CVPR2015_supp_material.pdf', 0], 'Total Variation (TV) is an effective and popular prior model in the field of regularization-based image processing. This paper focuses on TV for image restoration in the presence of impulse noise. This type of noise frequently arises in data acquisition and transmission due to many reasons, e.g. a faulty sensor or analog-to-digital converter errors. Removing this noise is an important task in image restoration. State-of-the-art methods such as Adaptive Outlier Pursuit(AOP), which is based on TV with L02-norm data fidelity, only give sub-optimal performance.In this paper, we propose a new method, called L0T V -PADMM, which solves the TV-based restoration problem with L0-norm data fidelity. To effectively deal with the resulting non-convex nonsmooth optimization problem, we first reformulate it as an equivalent MPEC (Mathematical Program with Equilibrium Constraints), and then solve it using a proximal Alternating Direction Method of Multipliers (PADMM). Our L0TV-PADMM method finds a desirable solution to the original L0-norm optimization problem and is proven to be convergent under mild conditions. We apply L0TV-PADMM to the problems of image denoising and deblurring in the presence of impulse noise. Our extensive experiments demonstrate that L0TV-PADMM outperforms state-of-the-art image restoration methods.'
],
['10754/556107', 2, 15, 2, '1HZgjm7BCRIa4K33nkKWsBZYsmeNAje_G', '1FW0FVKUkkQlx-ivpew4YL1GcK1psfm-h', 'Structural Sparse Tracking', [77, 75, 43, 74, 0, 67, 63],
[0, 0, 0, 0, 0, 0, 0],
[],
[0, 0, 0, 0, 0, 0, 0], 'Sparse representation has been applied to visual tracking by finding the best target candidate with minimal reconstruction error by use of target templates. However, most sparse representation based trackers only consider holistic or local representations and do not make full use of the intrinsic structure among and inside target candidates, thereby making the representation less effective when similar objects appear or under occlusion. In this paper, we propose a novel Structural Sparse Tracking (SST) algorithm, which not only exploits the intrinsic relationship among target candidates and their local patches to learn their sparse representations jointly, but also preserves the spatial layout structure among the local patches inside each target candidate. We show that our SST algorithm accommodates most existing sparse trackers with the respective merits. Both qualitative and quantitative evaluations on challenging benchmark image sequences demonstrate that the proposed SST algorithm performs favorably against several state-of-the-art methods.'
],
[0, 1, 15, 2, '1qtqoI5MoANTVmB6gtqOw4Ab4U8fMQJ26', '1nRtaHISp0aTJQYtYP1j4MmDWalXoxl85', 'On the Relationship between Visual Attributes and Convolutional Networks', [15, 25, 0],
[0, 0, 0],
[],
[0, 0, 0, 0, 0, 'https://dl.dropboxusercontent.com/u/18955644/website_files/0052-supp.zip', 0], 'One of the cornerstone principles of deep models is their abstraction capacity, i.e. their ability to learn abstract concepts from ‘simpler’ ones. Through extensive experiments, we characterize the nature of the relationship between abstract concepts (specifically objects in images) learned by popular and high performing convolutional networks (conv-nets) and established mid-level representations used in computer vision (specifically semantic visual attributes). We focus on attributes due to their impact on several applications, such as object description, retrieval and mining, and active (and zero-shot) learning. Among the findings we uncover, we show empirical evidence of the existence of Attribute Centric Nodes (ACNs) within a conv-net, which is trained to recognize objects (not attributes) in images. These special conv-net nodes (1) collectively encode information pertinent to visual attribute representation and discrimination, (2) are unevenly and sparsely distribution across all layers of the conv-net, and (3) play an important role in conv-net based object recognition.'
],
[0, 3, 15, 4, '1_EUnXXFEfoD7H5xznWgiWTUYDXgLWRGH', '1DU1lBeYZMkxhbQGm7yUvH7nCTQATQ0Tw', 'ML-MG: Multi-label Learning with Missing Labels Using a Mixed Graph', [16, 76, 0],
[0, 0, 0],
[],
['https://ivul.kaust.edu.sa/Documents/more/code/ML-MG%20Multi-label%20Learning%20with%20Missing%20Labels%20Using%20a%20Mixed%20Graph.zip', 0, 0, 0, 0, 0, 0], 'This work focuses on the problem of multi-label learning with missing labels (MLML), which aims to label each test instance with multiple class labels given training instances that have an incomplete/partial set of these labels (i.e. some of their labels are missing). To handle missing labels, we propose a unified model of label dependencies by constructing a mixed graph, which jointly incorporates (i) instance-level similarity and class co-occurrence as undirected edges and (ii) semantic label hierarchy as directed edges. Unlike most MLML methods, We formulate this learning problem transductively as a convex quadratic matrix optimization problem that encourages training label consistency and encodes both types of label dependencies (i.e. undirected and directed edges) using quadratic terms and hard linear constraints. The alternating direction method of multipliers (ADMM) can be used to exactly and efficiently solve this problem. To evaluate our proposed method, we consider two popular applications (image and video annotation), where the label hierarchy can be derived from Wordnet. Experimental results show that our method leads to a significant improvement in performance and robustness to missing labels over the state-of-the-art methods.'
],
[0, 2, 15, 4, '190eJi8LIwR4eXe3Gm8srZBLtJS2NkEBu', '1ZPUlX18ugklzFXJ6XaeeomTdPjzf0qWu', 'Intrinsic Scene Decomposition from RGB-D images', [65, 0, 32],
[0, 0, 0],
[],
[0, 0, 'https://ivul.kaust.edu.sa/Documents/more/video/Intrinsic%20Scene%20Decomposition%20from%20RGB-D%20images.mp4', 0, 0, 'https://ivul.kaust.edu.sa/Documents/more/supplementary/Intrinsic%20Scene%20Decomposition%20from%20RGB-D%20images.pdf', 0], 'In this paper, we address the problem of computing an intrinsic decomposition of the colors of a surface into an albedo and a shading term. The surface is reconstructed from a single or multiple RGB-D images of a static scene obtained from different views. We thereby extend and improve existing works in the area of intrinsic image decomposition. In a variational framework, we formulate the problem as a minimization of an energy composed of two terms: a data term and a regularity term. The first term is related to the image formation process and expresses the relation between the albedo, the surface normals, and the incident illumination. We use an affine shading model, a combination of a Lambertian model, and an ambient lighting term. This model is relevant for Lambertian surfaces. When available, multiple views can be used to handle view-dependent non- Lambertian reflections. The second term contains an efficient combination of l2 and l1-regularizers on the illumination vector field and albedo respectively. Unlike most previous approaches, especially Retinex-like techniques, these terms do not depend on the image gradient or texture, thus reducing the mixing shading/reflectance artifacts and leading to better results. The obtained non-linear optimization problem is efficiently solved using a cyclic block coordinate descent algorithm. Our method outperforms a range of state-of-the-art algorithms on a popular benchmark datase'
],
['10754/621295', 3, 15, 4, '1kZY3LJZCNKDo4tbhjZn4nXk0NEuHxe6K', '1IqZ0ItUIn6HHUA1f8yN2CVa4F-IBUuvt', 'What makes an object memorable?', [70, 55, 36, 63, 0],
[0, 0, 0, 0, 0],
[],
[0, 'http://socrates.berkeley.edu/~plab/object-memorability/data/object-memorability-data-2015.zip', 0, 0, 0, 0, 'http://socrates.berkeley.edu/~plab/object-memorability/'], 'Recent studies on image memorability have shed light on what distinguishes the memorability of different images and the intrinsic and extrinsic properties that make those images memorable. However, a clear understanding of the memorability of specific objects inside an image remains elusive. In this paper, we provide the first attempt to answer the question: what exactly is remembered about an image? We augment both the images and object segmentations from the PASCAL-S dataset with ground truth memorability scores and shed light on the various factors and properties that make an object memorable (or forgettable) to humans. We analyze various visual factors that may influence object memorability (e.g. color, visual saliency, and object categories). We also study the correlation between object and image memorability and find that image memorability is greatly affected by the memorability of its most memorable object. Lastly, we explore the effectiveness of deep learning and other computational approaches in predicting object memorability in images. Our efforts offer a deeper understanding of memorability in general thereby opening up avenues for a wide variety of applications. © 2015 IEEE.'
],
[0, 3, 15, 5, '1-Pf3XnvcI1ujVp-YUZTCUynGbsM-sNW-', '1-4vlaeL9hEA0dm3Obgpeen2VCx4LwJNO', 'Multi-Template Scale-Adaptive Kernelized Correlation Filters', [3, 0],
[0, 0],
[],
['https://github.com/adelbibi/Multi-Template-Scale-Adaptive-Kernelized-Correlation-Filters', 0, 0, 0, 0, 'https://ivul.kaust.edu.sa/Documents/more/supplementary/Multi-Template-Scale-Adaptive-Kernelized-Correlation-Filters.pdf', 0], 'This paper identifies the major drawbacks of a very computationally efficient and state-of- the-art-tracker known as the Kernelized Correlation Filter (KCF) tracker. These drawbacks include an assumed fixed scale of the target in every frame, as well as, a heuristic update strategy of the filter taps to incorporate historical tracking information (i.e. simple linear combination of taps from the previous frame). In our approach, we update the scale of the tracker by maximizing over the posterior distribution of a grid of scales. As for the filter update, we prove and show that it is possible to use all previous training examples to update the filter taps very efficiently using fixed-point optimization. We validate the efficacy of our approach on two tracking datasets, VOT2014 and VOT2015.'
],
['10754/556158', 1, 15, 11, '1vUo1BvycNbN_xlrzz8Yuqk_6MNUBETjI', '1mkyOFvJPysX0N7BoWPG4kxeAo_7ahGzb', 'Action Recognition using Discriminative Structured Trajectory Groups', [51, 67, 0],
[0, 0, 0],
[],
[0, 0, 0, 0, 0, 0, 0], 'In this paper, we develop a novel framework for action recognition in videos. The framework is based on automatically learning the discriminative trajectory groups that are relevant to an action. Different from previous approaches, our method does not require complex computation for graph matching or complex latent models to localize the parts. We model a video as a structured bag of trajectory groups with latent class variables. We model action recognition problem in a weakly supervised setting and learn discriminative trajectory groups by employing multiple instance learning (MIL) based Support Vector Machine (SVM) using pre-computed kernels. The kernels depend on the spatio-temporal relationship between the extracted trajectory groups and their associated features. We demonstrate both quantitatively and qualitatively that the classification performance of our proposed method is superior to baselines and several state-of-the-art approaches on three challenging standard benchmark datasets.'
],
['10754/556160', 3, 15, 18, '14UzxogR3V8qLksmUY7lg3_akXGOksr01', '1vPATbgaTSp1h4LyNEmLXgHZdbPaLUWTX', 'Designing Camera Networks by Convex Quadratic Programming', [0, 85, 32],
[0, 0, 0],
[],
[0, 0, 'https://youtu.be/BOE52ivrQwk', 0, 0, 0, 0], '\u200bIn this paper, we study the problem of automatic camera placement for computer graphics and computer vision applications. We extend the problem formulations of previous work by proposing a novel way to incorporate visibility constraints and camera-to-camera relationships. For example, the placement solution can be encouraged to have cameras that image the same important locations from different viewing directions, which can enable reconstruction and surveillance tasks to perform better. We show that the general camera placement problem canbe formulated mathematically as a convex binary quadratic program (BQP) under linear constraints. Moreover, we propose an optimization strategy with a favorable trade-off between speed and solution quality. Our solution is almost as fast as a greedy treatment of the problem, but the quality is significantly higher, so much so that it is comparable to exact solutions that take orders of magnitude more computation time. Because it is computationally attractive, our method also allows users to explore the space of solutions for variations in input parameters. To evaluate its effectiveness, we show a range of 3D results on real-world floorplans (garage, hotel, mall, and airport).\u200b'
],
['10754/556126', 3, 15, 18, '19tG2wvZkLk76CLHe8bQg1zs3yHFj8FfE', '1tQlaoSz-PHcfk7z4WsjbgdymaGSWUJjy', 'Template Assembly for Detailed Urban Reconstruction', [58, 42, 0, 32],
[0, 0, 0, 0],
[],
[0, 0, 0, 0, 0, 0, 'https://www.semanticscholar.org/paper/Template-Assembly-for-Detailed-Urban-Reconstruction-Nan-Jiang/211b995bfe6f86d9d4ffd0d7d241b099d11084b5'], 'We propose a new framework to reconstruct building details by automatically assembling 3D templates on coarse textured building models. In a preprocessing step, we generate an initial coarse model to approximate a point cloud computed using Structure from Motion and Multi View Stereo, and we model a set of 3D templates of facade details. Next, we optimize the initial coarse model to enforce consistency between geometry and appearance (texture images). Then, building details are reconstructed by assembling templates on the textured faces of the coarse model. The 3D templates are automatically chosen and located by our optimization-based template assembly algorithm that balances image matching and structural regularity. In the results, we demonstrate how our framework can enrich the details of coarse models using various data sets.'
],
['10754/621413', 3, 15, 18, '1fs6t6mP9puVpujbKPGWHpVC87oqVzIpp', '1MLwSpr3xB7VdUAJa5U0NnPjvDmn7Kw4x', 'SAR: Stroke Authorship Recognition', [12, 40, 0],
[0, 0, 0],
[],
[0, 'https://cemse.kaust.edu.sa/ivul/sar-dataset', 0, 0, 0, 0, 0], "Are simple strokes unique to the artist or designer who renders them? If so, can this idea be used to identify authorship or to classify artistic drawings? Also, could training methods be devised to develop particular styles? To answer these questions, we propose the Stroke Authorship Recognition (SAR) approach, a novel method that distinguishes the authorship of 2D digitized drawings. SAR converts a drawing into a histogram of stroke attributes that is discriminative of authorship. We provide extensive classification experiments on a large variety of data sets, which validate SAR's ability to distinguish unique authorship of artists and designers. We also demonstrate the usefulness of SAR in several applications including the detection of fraudulent sketches, the training and monitoring of artists in learning a particular new style and the first quantitative way to measure the quality of automatic sketch synthesis tools. © 2015 The Eurographics Association and John Wiley & Sons Ltd."
],
['10754/556124', 2, 15, 19, '11KjEP99o589F4AQitv01t5KymBSzRuc4', '1maUs6oPctHLD_-lIqvB9P5gmEhQTdw6B', 'Robust Visual Tracking via Exclusive Context Modeling', [77, 0, 75, 43, 67],
[0, 0, 0, 0, 0],
[],
[0, 0, 0, 0, 0, 0, 0], 'In this paper, we formulate particle filter-based object tracking as an exclusive sparse learning problem that exploits contextual information. To achieve this goal, we propose the context-aware exclusive sparse tracker (CEST) to model particle appearances as linear combinations of dictionary templates that are updated dynamically. Learning the representation of each particle is formulated as an exclusive sparse representation problem, where the overall dictionary is composed of multiple {group} dictionaries that can contain contextual information. With context, CEST is less prone to tracker drift. Interestingly, we show that the popular L₁ tracker [1] is a special case of our CEST formulation. The proposed learning problem is efficiently solved using an accelerated proximal gradient method that yields a sequence of closed form updates. To make the tracker much faster, we reduce the number of learning problems to be solved by using the dual problem to quickly and systematically rank and prune particles in each frame. We test our CEST tracker on challenging benchmark sequences that involve heavy occlusion, drastic illumination changes, and large pose variations. Experimental results show that CEST consistently outperforms state-of-the-art trackers.'
],
[0, 2, 14, 1, '1K_J2S3r4O9CzHGin_L8cuEK0aLaBvwpJ', '1UXMh9DvAxqCTK8eyHaBx4fW3hr8KUUF4', 'Robust Visual Tracking Via Consistent Low-Rank Sparse Learning', [77, 75, 67, 63, 0],
[0, 0, 0, 0, 0],
[],
[0, 0, 0, 0, 0, 0, 0], 'Object tracking is the process of determining the states of a target in consecutive video frames based on properties of motion and appearance consistency. In this paper, we propose a consistent low-rank sparse tracker (CLRST) that builds upon the particle filter framework for tracking. By exploiting temporal consistency, the proposed CLRST algorithm adaptively prunes and selects candidate particles. By using linear sparse combinations of dictionary templates, the proposed method learns the sparse representations of image regions corresponding to candidate particles jointly by exploiting the underlying low-rank constraints. In addition, the proposed CLRST algorithm is computationally attractive since temporal consistency property helps prune particles and the low-rank minimization problem for learning joint sparse representations can be efficiently solved by a sequence of closed form update operations. We evaluate the proposed CLRST algorithm against 14 state-of-the-art tracking methods on a set of 25 challenging image sequences. Experimental results show that the CLRST algorithm performs favorably against state-of-the-art tracking methods in terms of accuracy and execution time.'
],
[0, 1, 14, 10, '17zqG6Va-QoiPRQfobydlHUFGBncGyOvm', '1P5NsvzYHEYHnhCILnQ8tyt_dWedNEz0B', 'Improving Head And Body Pose Estimation Through Semi-Supervised Manifold Alignment', [38, 52, 0, 67, 53],
[0, 0, 0, 0, 0],
[],
[0, 0, 0, 0, 0, 0, 0], 'In this paper, we explore the use of a semi-supervised manifold alignment method for domain adaptation in the context of human body and head pose estimation in videos. We build upon an existing state-of-the-art system that leverages on external labelled datasets for the body and head features, and on the unlabeled test data with weak velocity labels to do a coupled estimation of the body and head pose. While this previous approach showed promising results, the learning of the underlying manifold structure of the features in the train and target data and the need to align them were not explored despite the fact that the pose features between two datasets may vary according to the scene, e.g. due to different camera point of view or perspective. In this paper, we propose to use a semi-supervised manifold alignment method to bring the train and target samples closer within the resulting embedded space. To this end, we consider an adaptation set from the target data and rely on (weak) labels, given for example by the velocity direction whenever they are reliable. These labels, along with the training labels are used to bias the manifold distance within each manifold and to establish correspondences for alignment.'
],
['10754/556148', 2, 14, 10, '1sNVnIqs7YW4Wpdogb110LQ5Rcd2uvIH5', '1dNY5uvr7g2oCuKoCUXlKU0mA15MXnucM', '3D Aware Correction and Completion of Depth Maps in Piecewise Planar Scenes', [1, 7, 45, 0],
[0, 0, 0, 0],
[],
[0, 0, 0, 0, 0, 0, 0], 'RGB-D sensors are popular in the computer vision community, especially for problems of scene understanding, semantic scene labeling, and segmentation. However, most of these methods depend on reliable input depth measurements, while discarding unreliable ones. This paper studies how reliable depth values can be used to correct the unreliable ones, and how to complete (or extend) the available depth data beyond the raw measurements of the sensor (i.e. infer depth at pixels with unknown depth values), given a prior model on the 3D scene. We consider piecewise planar environments in this paper, since many indoor scenes with man-made objects can be modeled as such. We propose a framework that uses the RGB-D sensor’s noise profile to adaptively and robustly fit plane segments (e.g. floor and ceiling) and iteratively complete the depth map, when possible. Depth completion is formulated as a discrete labeling problem (MRF) with hard constraints and solved efficiently using graph cuts. To regularize this problem, we exploit 3D and appearance cues that encourage pixels to take on depth values that will be compatible in 3D to the piecewise planar assumption. Extensive experiments, on a new large-scale and challenging dataset, show that our approach results in more accurate depth maps (with 20 % more depth values) than those recorded by the RGB-D sensor. Additional experiments on the NYUv2 dataset show that our method generates more 3D aware depth. These generated depth maps can also be used to improve the performance of a state-of-the-art RGB-D SLAM method.'
],
[0, 1, 14, 10, '1WR_YeQkYtO06gHu7k9LUcDvEOF4JdN2v', '1sihKksW5opuFAINS2-7K2LBBia0Gb08f', 'Camera Motion and Surrounding Scene Appearance as Context for Action Recognition', [14, 1, 25, 0],
[0, 0, 0, 0],
[],
[0, 0, 0, 0, 0, 0, 0], 'Abstract. RGB-D sensors are popular in the computer vision community, especially for problems of scene understanding, semantic scene labeling, and segmentation. However, most of these methods depend on reliable input depth measurements. The reliability of these depth values deteriorates significantly with distance. In practice, unreliable depth measurements are discarded, thus, limiting the performance of methods that use RGB-D data. This paper studies how reliable depth values can be used to correct the unreliable ones, and how to complete (or extend) the available depth data beyond the raw measurements of the sensor (i.e. infer depth at pixels with unknown depth values), given a prior model on the 3D scene. We consider piecewise planar environments in this paper, since many indoor scenes with man-made objects can be modeled as such. We propose a framework that uses the RGB-D sensor’s noise profile to adaptively and robustly fit plane segments (e.g. floor and ceiling) and iteratively complete the depth map, when possible. Depth completion is formulated as a discrete labeling problem (MRF) with hard constraints and solved efficiently using graph cuts. To regularize this problem, we exploit 3D and appearance cues that encourage pixels to take on depth values that will be compatible in 3D to the piecewise planar assumption. Extensive experiments, on a new large-scale and challenging dataset, show that our approach results in more accurate depth maps (with 20% more depth values) than those recorded by the RGB-D sensor. Additional experiments on the NYUv2 dataset show that our method generates more 3D aware depth. These generated depth maps can also be used to improve the performance of a state-of-the-art RGB-D SLAM method.'
],
[0, 3, 14, 10, '1whzdguQApivG5EdMHhnmJwtm_MFP4SKQ', '1IC0w58Qe2RlABr-e4wbF9-BPQb-5Wv-s', 'Improving Saliency Models by Predicting Human Fixation Patches', [70, 37, 0],
[0, 0, 0],
[],
[0, 0, 0, 0, 0, 0, 0], 'There is growing interest in studying the Human Visual System (HVS) to supplement and improve the performance of computer vision tasks. A major challenge for current visual saliency models is predicting saliency in cluttered scenes (i.e. high false positive rate). In this paper, we propose a fixation patch detector that predicts image patches that contain human fixations with high probability. Our proposed model detects sparse fixation patches with an accuracy of 84% and eliminates non-fixation patches with an accuracy of 84% demonstrating that lowlevel image features can indeed be used to short-list and identify human fixation patches. We then show how these detected fixation patches can be used as saliency priors for popular saliency models, thus, reducing false positives while maintaining true positives. Extensive experimental results show that our proposed approach allows state-of-the-art saliency methods to achieve better prediction performance on benchmark datasets.'
],
['10754/562405', 2, 13, 1, '1O3KDJmXRVkfYz5VUrkW-b3I1-nZ8oqbg', '1QJgr-nEavVJeY6soN52upB-3vnrimH6a', 'Robust Visual Tracking via Structured Multi-Task Sparse Learning', [77, 0, 75, 67],
[0, 0, 0, 0],
[],
[0, 0, 0, 0, 0, 0, 0], 'In this paper, we formulate object tracking in a particle filter framework as a structured multi-task sparse learning problem, which we denote as Structured Multi-Task Tracking (S-MTT). Since we model particles as linear combinations of dictionary templates that are updated dynamically, learning the representation of each particle is considered a single task in Multi-Task Tracking (MTT). By employing popularsparsity-inducinglp,q mixednorms(specificallyp ∈ {2, ∞} and q = 1), we regularize the representation problem to enforce joint sparsity and learn the particle representations together. As compared to previous methods that handle particles independently, our results demonstrate that mining the interdependencies between particles improves tracking performance and overall computational complexity. Interestingly, we show that the popular L1 tracker (Mei and Ling, IEEE Trans Pattern Anal Mach Intel 33(11):2259–2272, 2011) is a special case of our MTT formulation (denoted as the L11 tracker) when p = q = 1. Under the MTT framework, some of the tasks (particle representations) are often more closely related and more likely to share common relevant covariates than other tasks. Therefore, we extend the MTT framework to take into account pairwise structural correlations between particles (e.g. spatial smooth- ness of representation) and denote the novel framework as S-MTT. The problem of learning the regularized sparse representation in MTT and S-MTT can be solved efficiently using an Accelerated Proximal Gradient (APG) method that yields a sequence of closed form updates. As such, S-MTT and MTT are computationally attractive. We test our pro- posed approach on challenging sequences involving heavy occlusion, drastic illumination changes, and large pose variations. Experimental results show that S-MTT is much better than MTT, and both methods consistently outperform state-of-the-art trackers.'
],
['10754/575813', 1, 13, 3, '1ZGpL7wU5Z6F4xljlWM6qLtaRtsWKU7Vg', '1tV9N12QW0c4jXR6dEjRjyswR9kALIoSa', 'Automatic Recognition of Offensive Team Formation in American Football Plays', [51, 0, 73, 56, 67],
[0, 0, 0, 0, 0],
[],
[0, 0, 0, 0, 0, 0, 0], "Compared to security surveillance and military applications, where automated action analysis is prevalent, the sports domain is extremely under-served. Most existing software packages for sports video analysis require manual annotation of important events in the video. American football is the most popular sport in the United States, however most game analysis is still done manually. Line of scrimmage and offensive team formation recognition are two statistics that must be tagged by American Football coaches when watching and evaluating past play video clips, a process which takes many man hours per week. These two statistics are also the building blocks for more high-level analysis such as play strategy inference and automatic statistic generation. In this paper, we propose a novel framework where given an American football play clip, we automatically identify the video frame in which the offensive team lines in formation (formation frame), the line of scrimmage for that play, and the type of player formation the offensive team takes on. The proposed framework achieves 95% accuracy in detecting the formation frame, 98% accuracy in detecting the line of scrimmage, and up to 67% accuracy in classifying the offensive team's formation. To validate our framework, we compiled a large dataset comprising more than 800 play-clips of standard and high definition resolution from real-world football games. This dataset will be made publicly available for future comparison. © 2013 IEEE."
],
['10754/564735', 2, 13, 3, '1HLfBLdsu1RBSewAB1ab5D1xQhck5ADwl', '1_DNTC-f-WwG8OlWzeXefmcDTLyHbiR_4', 'Object Tracking by Occlusion Detection via Structured Sparse Learning', [77, 0, 43, 67],
[0, 0, 0, 0],
[],
[0, 0, 0, 0, 0, 0, 0], "Sparse representation based methods have recently drawn much attention in visual tracking due to good performance against illumination variation and occlusion. They assume the errors caused by image variations can be modeled as pixel-wise sparse. However, in many practical scenarios these errors are not truly pixel-wise sparse but rather sparsely distributed in a structured way. In fact, pixels in error constitute contiguous regions within the object's track. This is the case when significant occlusion occurs. To accommodate for non-sparse occlusion in a given frame, we assume that occlusion detected in previous frames can be propagated to the current one. This propagated information determines which pixels will contribute to the sparse representation of the current track. In other words, pixels that were detected as part of an occlusion in the previous frame will be removed from the target representation process. As such, this paper proposes a novel tracking algorithm that models and detects occlusion through structured sparse learning. We test our tracker on challenging benchmark sequences, such as sports videos, which involve heavy occlusion, drastic illumination changes, and large pose variations. Experimental results show that our tracker consistently outperforms the state-of-the-art. © 2013 IEEE."
],
['10754/556147', 2, 13, 4, '1jTzAQZ6UHi-wnfNNVmqpFYW6TjWxbl-X', '1lJQkRzUEXY1KC-vByo3XhdxRFYzOc5ZN', 'Low-Rank Sparse Coding for Image Classification', [77, 0, 75, 43, 67],
[0, 0, 0, 0, 0],
[],
[0, 0, 0, 0, 0, 0, 0], 'In this paper, we propose a low-rank sparse coding (LRSC) method that exploits local structure information among features in an image for the purpose of image-level classification. LRSC represents densely sampled SIFT descriptors, in a spatial neighborhood, collectively as low-rank, sparse linear combinations of code words. As such, it casts the feature coding problem as a low-rank matrix learning problem, which is different from previous methods that encode features independently. This LRSC has a number of attractive properties. (1) It encourages sparsity in feature codes, locality in codebook construction, and low-rankness for spatial consistency. (2) LRSC encodes local features jointly by considering their low-rank structure information, and is computationally attractive. We evaluate the LRSC by comparing its performance on a set of challenging benchmarks with that of 7 popular coding and other state-of-the-art methods. Our experiments show that by representing local features jointly, LRSC not only outperforms the state-of-the-art in classification accuracy but also improves the time complexity of methods that use a similar sparse linear representation model for feature coding.'
],
[0, 1, 13, 12, '1IiFYXiUcH-wsZfBzGwvLwnNc8PF9SlD9', '1sIa-NFqaQ9iGfVQLeKK9AdUcQxvVhZfq', 'A Topic Model Approach to Represent and Classify American Football Plays', [52, 51, 73, 0, 67],
[0, 0, 0, 0, 0],
[],
[0, 0, 0, 0, 0, 0, 0], 'We address the problem of modeling and classifying American Football offense teams’ plays in video, a challenging example of group activity analysis. Automatic play classification will allow coaches to infer patterns and tendencies of opponents more ef- ficiently, resulting in better strategy planning in a game. We define a football play as a unique combination of player trajectories. To this end, we develop a framework that uses player trajectories as inputs to MedLDA, a supervised topic model. The joint maximiza- tion of both likelihood and inter-class margins of MedLDA in learning the topics allows us to learn semantically meaningful play type templates, as well as, classify different play types with 70% average accuracy. Furthermore, this method is extended to analyze individual player roles in classifying each play type. We validate our method on a large dataset comprising 271 play clips from real-world football games, which will be made publicly available for future comparisons.'
],
[0, 3, 13, 17, '1vbBjTCtPm5Quprr181psaJgL1Y2VOsXr', '1ux336oOu3lZCy1if5mmUAU5nRMfdF0vt', 'BILGO: Bilateral Greedy Optimization for Large Scale Semidefinite Programming', [17, 0, 86, 87],
[0, 0, 0, 0],
[],
[0, 0, 0, 0, 0, 0, 0], 'Many machine learning tasks (e.g. metric and manifold learning problems) can be formulated as convex semidefinite programs. To enable the applica- tion of these tasks on a large-scale, scalability and computational efficiency are considered desirable properties for a practical semidefinite programming algorithm. In this paper, we theoretically analyse a new bilateral greedy opti- mization(denoted BILGO) strategy in solving general semidefinite programs on large-scale datasets. As compared to existing methods, BILGO employs a bilateral search strategy during each optimization iteration. In such an iteration, the current semidefinite matrix solution is updated as a bilater- al linear combination of the previous solution and a suitable rank-1 matrix, which can be efficiently computed from the leading eigenvector of the descent direction at this iteration. By optimizing for the coefficients of the bilateral combination, BILGO reduces the cost function in every iteration until the KKT conditions are fully satisfied, thus, it tends to converge to a global opti- mum. For an ǫ-accurate solution, we prove the number of BILGO iterations needed for convergence is O(ǫ−1). The algorithm thus successfully combines the efficiency of conventional rank-1 update algorithms and the effectiveness of gradient descent. Moreover, BILGO can be easily extended to handle low rank constraints. To validate the effectiveness and efficiency of BILGO, we apply it to two important machine learning tasks, namely Mahalanobis metric learning and maximum variance unfolding. Extensive experimental results clearly demonstrate that BILGO can solve large-scale semidefinite programs efficiently.'
],
['10754/564560', 2, 12, 2, '1U9WLbY1uP4laarG4apAlBTW7k6OAjYz2', '1I8D9vk56o7smg9TvJT54GBI9xps0jOnQ', 'Robust Visual Tracking via Multi-Task Sparse Learning', [77, 0, 75, 67],
[0, 0, 0, 0],
[2],
[0, 0, 0, 0, 0, 0, 0], 'In this paper, we formulate object tracking in a particle filter framework as a multi-task sparse learning problem, which we denote as Multi-Task Tracking (MTT). Since we model particles as linear combinations of dictionary templates that are updated dynamically, learning the representation of each particle is considered a single task in MTT. By employing popular sparsity-inducing p, q mixed norms (p D; 1), we regularize the representation problem to enforce joint sparsity and learn the particle representations together. As compared to previous methods that handle particles independently, our results demonstrate that mining the interdependencies between particles improves tracking performance and overall computational complexity. Interestingly, we show that the popular L 1 tracker [15] is a special case of our MTT formulation (denoted as the L 11 tracker) when p q 1. The learning problem can be efficiently solved using an Accelerated Proximal Gradient (APG) method that yields a sequence of closed form updates. As such, MTT is computationally attractive. We test our proposed approach on challenging sequences involving heavy occlusion, drastic illumination changes, and large pose variations. Experimental results show that MTT methods consistently outperform state-of-the-art trackers. © 2012 IEEE.'
],
['10754/564495', 2, 12, 6, '1nWCT84Qtw1YW4UY3pnARfas6RL7ZohrW', '1A1vVO2HEjD4KxqGoi7Cjs9UGeXourFhb', 'Low-Rank Sparse Learning for Robust Visual Tracking', [77, 0, 75, 67],
[0, 0, 0, 0],
[],
[0, 0, 0, 0, 0, 0, 0], 'In this paper, we propose a new particle-filter based tracking algorithm that exploits the relationship between particles (candidate targets). By representing particles as sparse linear combinations of dictionary templates, this algorithm capitalizes on the inherent low-rank structure of particle representations that are learned jointly. As such, it casts the tracking problem as a low-rank matrix learning problem. This low-rank sparse tracker (LRST) has a number of attractive properties. (1) Since LRST adaptively updates dictionary templates, it can handle significant changes in appearance due to variations in illumination, pose, scale, etc. (2) The linear representation in LRST explicitly incorporates background templates in the dictionary and a sparse error term, which enables LRST to address the tracking drift problem and to be robust against occlusion respectively. (3) LRST is computationally attractive, since the low-rank learning problem can be efficiently solved as a sequence of closed form update operations, which yield a time complexity that is linear in the number of particles and the template size. We evaluate the performance of LRST by applying it to a set of challenging video sequences and comparing it to 6 popular tracking methods. Our experiments show that by representing particles jointly, LRST not only outperforms the state-of-the-art in tracking accuracy but also significantly improves the time complexity of methods that use a similar sparse linear representation model for particles [1]. © 2012 Springer-Verlag.'
],
[0, 3, 12, 13, '1qHb2K-ABSRaIfRj9mVgV5P3qvBaoqjGN', '1Gg7u3qegUtAehXiEzcx8exXAO5wTuQ8x', 'Do Humans Fixate on Interest Points', [37, 70, 0],
[1, 1, 0],
[2],
[0, 0, 0, 0, 0, 0, 0], 'Interest point detectors (e.g. SIFT, SURF, and MSER) have been successfully applied to numerous applications in high level computer vision tasks such as object detection, and image classification. Despite their popularity, the perceptual relevance of these detectors has not been thoroughly studied. Here, perceptual relevance is meant to define the correlation between these point detectors and free-viewing human fixations on images. In this work, we provide empirical evidence to shed light on the fundamental question: “Do humans fixate on interest points in images?”. We believe that insights into this question may play a role in improving the performance of vision systems that utilize these interest point detectors. We conduct an extensive quantitative comparison between the spatial distributions of human fixations and automatically detected interest points on a recently released dataset of 1003 images. This comparison is done at both the global (image) level as well as the local (region) level. Our experimental results show that there exists a weak correlation between the spatial distributions of human fixation and interest points.'
],
[0, 1, 12, 13, '19YxaTQca7x_cmTT2n_jEZsHdOSPtxwpn', '1rrojnprXHBJ56dxud-T4ZomjcOezwo4U', 'Context-Aware Learning For Automatic Sports Highlight Recognition', [0, 62, 59, 77],
[0, 1, 1, 0],
[],
[0, 0, 0, 0, 0, 0, 0], 'Video highlight recognition is the procedure in which a long video sequence is summarized into a shorter video clip that depicts the most “salient” parts of the sequence. It is an important technique for content delivery systems and search systems which create multimedia content tailored to their users’ needs. This paper deals specifically with capturing highlights inherent to sports videos, especially for American football. Our proposed system exploits the multimodal nature of sports videos (i.e. visual, audio, and text cues) to detect the most important segments among them. The optimal combination of these cues is learned in a data-driven fashion using user preferences (expert input) as ground truth. Unlike most highlight recognition systems in the literature that define a highlight to be salient only in its own right (globally salient), we also consider the context of each video segment w.r.t. the video sequence it belongs to (locally salient). To validate our method, we compile a large dataset of broadcast American football videos, acquire their ground truth highlights, and evaluate the performance of our learning approach.'
],
[0, 1, 12, 13, '1IJFBtpTfwq3D41MfnNa_mDH-nU-pMfU5', '1PZKjWK3rT4xDe3amIjZ2dsFYbETr96_i', 'Trajectory-based Fisher Kernel Representation for Action Recognition in Videos', [51, 0, 67],
[0, 0, 0],
[2],
[0, 0, 0, 0, 0, 0, 0], 'Action recognition is an important computer vision problem that has many applications including video indexing and retrieval, event detection, and video summarization. In this paper, we propose to apply the Fisher kernel paradigm to action recognition. The Fisher kernel framework combines the strengths of generative and discriminative models. In this approach, given the trajectories extracted from a video and a generative Gaussian Mixture Model (GMM), we use the Fisher Kernel method to describe how much the GMM parameters are modified to best fit the video trajectories. We experiment in using the Fisher Kernel vector to create the video representation and to train an SVM classifier. We further extend our framework to select the most discriminative trajectories using a novel MIL-KNN framework. We compare the performance of our approach to the current state-of-the-art bag-of-features (BOF) approach on two benchmark datasets. Experimental results show that our proposed approach outperforms the state-ofthe-art method [8] and that the selected discriminative trajectories are descriptive of the action class.'
],
[0, 3, 12, 14, '1dMK-4NEeaORio9GtxGxqZfye0yV93_be', '1C3zQIz5vcIolThs_yGtB4UCDulYG_hmH', 'Modeling Dynamic Swarms', [0, 67],
[0, 0],
[],
[0, 0, 0, 0, 0, 0, 0], 'This paper proposes the problem of modeling video sequences of dynamic swarms (DSs). We define a DS as a large layout of stochastically repetitive spatial configurations of dynamic objects (swarm elements) whose motions exhibit local spatiotemporal interdependency and stationarity, i.e., the motions are similar in any small spatiotemporal neighborhood. Examples of DS abound in nature, e.g., herds of animals and flocks of birds. To capture the local spatiotemporal properties of the DS, we present a probabilistic model that learns both the spatial layout of swarm elements (based on low-level image segmentation) and their joint dynamics that are modeled as linear transformations. To this end, a spatiotemporal neighborhood is associated with each swarm element, in which local stationarity is enforced both spatially and temporally. We assume that the prior on the swarm dynamics is distributed according to an MRF in both space and time. Embedding this model in a MAP framework, we iterate between learning the spatial layout of the swarm and its dynamics. We learn the swarm transformations using ICM, which iterates between estimating these transformations and updating their distribution in the spatiotemporal neighborhoods. We demonstrate the validity of our method by conducting experiments on real and synthetic video sequences. Real sequences of birds, geese, robot swarms, and pedestrians evaluate the applicability of our model to real world data.'
],
[0, 1, 12, 15, '1GiHMRifw0k9X1-2doFlE_3_SC5LMKkBY', '1_6j4CGoJrzWhY6rRRrB27yqXPf3VUtc7', 'Robust Video Registration Applied to Field-Sports Video Analysis', [0, 77, 67],
[0, 0, 0],
[],
[0, 0, 0, 0, 0, 0, 0], 'Video (image-to-image) registration is a fundamental problem in computer vision. Registering video frames to the same coordinate system is necessary before meaningful inference can be made from a dynamic scene in the presence of camera motion. Standard registration techniques detect specific structures (e.g. points and lines), find potential correspondences, and use a random sampling method to choose inlier correspondences. Unlike these standards, we propose a parameter-free, robust registration method that avoids explicit structure matching by matching entire images or image patches. We frame the registration problem in a sparse representation setting, where outlier pixels are assumed to be sparse in an image. Here, robust video registration (RVR) becomes equivalent to solving a sequence of `1 minimization problems, each of which can be solved using the Inexact Augmented Lagrangian Method (IALM). Our RVR method is made efficient (sublinear complexity in the number of pixels) by exploiting a hybrid coarse-to-fine and random sampling strategy along with the temporal smoothness of camera motion. We showcase RVR in the domain of sports videos, specifically American football. Our experiments on real-world data show that RVR outperforms standard methods and is useful in several applications (e.g. automatic panoramic stitching and non-static background subtraction).'
],
[0, 2, 12, 15, '1lqshI5YYIrpkejiczVQioxOPFB5_z5Aw', '1YmS6W6XP058Kov768jY-f75UDPMfI9r7', 'Robust multi-object tracking via cross-domain contextual information for sports video analysis', [77, 0, 67],
[0, 0, 0],
[],
[0, 0, 0, 0, 0, 0, 0], 'Multiple player tracking is one of the main building blocks needed in a sports video analysis system. In an uncalibrated camera setting, robust mutli-object tracking can be very dif- ficult due to a number of reasons including the presence of noise, occlusion, fast camera motion, low-resolution image capture, varying viewpoints and illumination changes. To address the problem of multi-object tracking in sports videos, we go beyond the video frame domain and make use of information in a homography transform domain that is denoted the homography field domain. We propose a novel particle filter based tracking algorithm that uses both object appearance information (e.g. color and shape) in the image domain and cross-domain contextual information in the field domain to improve object tracking. In the field domain, the effect of fast camera motion is significantly alleviated since the underlying homography transform from each frame to the field domain can be accurately estimated. We use contextual trajectory information (intra-trajectory and inter-trajectory context) to further improve the prediction of object states within an particle filter framework. Here, intra-trajectory contextual information is based on history tracking results in the field domain, while inter-trajectory contextual information is extracted from a compiled trajectory dataset based on tracks computed from videos depicting the same sport. Experimental results on real world sports data show that our system is able to effectively and robustly track a variable number of targets regardless of background clutter, camera motion and frequent mutual occlusion between targets.'
],
['10754/562701', 3, 13, 17, '1aaM8RXk9US-MQoG3PVAJF7SI2lCP1sIa', '1eg5fpAx3rSmsAs2aOwKqPWIJigXND88-', 'Low-Rank Quadratic Semidefinite Programming', [17, 86, 0, 87],
[0, 1, 1, 0],
[],
[0, 0, 0, 0, 0, 0, 0], 'Low rank matrix approximation is an attractive model in large scale machine learning problems, because it can not only reduce the memory and runtime complexity, but also provide a natural way to regularize parameters while preserving learning accuracy. In this paper, we address a special class of nonconvex quadratic matrix optimization problems, which require a low rank positive semidefinite solution. Despite their non-convexity, we exploit the structure of these problems to derive an efficient solver that converges to their local optima. Furthermore, we show that the proposed solution is capable of dramatically enhancing the efficiency and scalability of a variety of concrete problems, which are of significant interest to the machine learning community. These problems include the Top-k Eigenvalue problem, Distance learning and Kernel learning. Extensive experiments on UCI benchmarks have shown the effectiveness and efficiency of our proposed method. © 2012.'
]
]
};