Skip to content
/ I3D Public

Action Recognition with an Inflated 3D CNN

Notifications You must be signed in to change notification settings

hhaemin/I3D

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

15 Commits
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Action Recognition with an Inflated 3D CNN

Action Classification์˜ ๋‹ค์–‘ํ•œ ๋…ผ๋ฌธ ๋ถ„์„ ๋ฐ ๊ธฐ๋Šฅ ๊ตฌํ˜„

Video Recognition

  • ๋™์˜์ƒ ๋ฐ์ดํ„ฐ๋Š” ๊ธฐ๋ณธ์ ์œผ๋กœ ๊ณต๊ฐ„์ , ์‹œ๊ฐ„์  ์š”์†Œ๋กœ ๋ถ„ํ•ด๋  ์ˆ˜ ์žˆ๋‹ค.
  • ๊ณต๊ฐ„์  ๋ถ€๋ถ„์€ ๋™์˜์ƒ์—์„œ ๋ฌ˜์‚ฌ๋œ ์žฅ๋ฉด๊ณผ ๋ฌผ์ฒด์— ๊ด€ํ•œ ์ •๋ณด๋ฅผ ๋‹ด๊ณ ์žˆ๋‹ค.
  • ์‹œ๊ฐ„์  ๋ถ€๋ถ„์€ ๊ด€์ฐฐ์ž(์นด๋ฉ”๋ผ)์™€ ๋ฌผ์ฒด์˜ ์›€์ง์ž„์— ๊ด€ํ•œ ์ •๋ณด๋ฅผ ๋‹ด๊ณ  ์žˆ๋‹ค.

๋…ผ๋ฌธ ๋ถ„์„

  1. 2D ConvNet + LSTM
  • ๊ฐ frame๋ณ„๋กœ features๋ฅผ extractํ•˜๊ณ  ๋น„๋””์˜ค ์ „์ฒด์— ๋Œ€ํ•ด ์˜ˆ์ธก์„ ์‹ค์‹œํ•˜๋Š” image classification์˜ ๊ฐœ๋…์„ ๊ทธ๋Œ€๋กœ ์ ์šฉํ•œ ๋ฐฉ์‹
  • Bag of words image modeling ๋ฐฉ์‹ ์ ‘๊ทผ -> temporal structure(์‹œ๊ฐ„)๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š”๋ฐ ์–ด๋ ค์›€์ด ์žˆ๋‹ค.
  • LSTM๊ณผ ๊ฐ™์€ recurrent layer์„ ์ ์šฉํ•˜์—ฌ temporal ordering, ๋„“์€๋ฒ”์œ„์— ๋Œ€ํ•œ ์˜์กด์„ฑ์„ ์žก์•„๋‚ผ ์ˆ˜ ์žˆ๋‹ค.
  • LSTM(RNN์˜ ํ•œ ์ข…๋ฅ˜) : ๊ธด ์˜์กด๊ธฐ๊ฐ„์„ ํ•„์š”๋กœ ํ•˜๋Š” ํ•™์Šต์„ ์ˆ˜ํ–‰ํ•  ๋Šฅ๋ ฅ์„ ๊ฐ–๊ณ ์žˆ๋‹ค.
  • ๊ธด ์˜์กด๊ธฐ๊ฐ„์˜ ๋ฌธ์ œ๋ฅผ ํ”ผํ•˜๊ธฐ ์œ„ํ•ด ๋ช…์‹œ์ ์œผ๋กœ ์„ค๊ณ„๋จ
  1. 3D ConvNet
  • ๊ธฐ์กด์˜ convolutional networks์— spatio-temporal filter๋ฅผ ์ถ”๊ฐ€ํ•จ์œผ๋กœ์จ ์‹œ๊ณต๊ฐ„๋ฐ์ดํ„ฐ๋ฅผ ๊ณ„์ธต์ ์œผ๋กœ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๋‹ค.
  • ํ•˜์ง€๋งŒ 2D Conv์™€ ๋น„๊ตํ•˜๋ฉด ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๊ฐ€ ๊ธ‰๊ฒฉํžˆ ๋Š˜์–ด๋‚˜๊ธฐ ๋•Œ๋ฌธ์— train์„ ์‹œํ‚ค๋Š”๋ฐ ๋ฌด๋ฆฌ๊ฐ€ ์žˆ๋‹ค.
  • architecture C3D : ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•ด์„œ BN์„ ๋งค conv,fc layer์ดํ›„์— ์ˆ˜ํ–‰
  • ์ฒซ ๋ฒˆ์งธ pooling layer์—์„œ๋„ temporal stride๋ฅผ 1์—์„œ 2๋กœ ์ฆ๊ฐ€ํ•˜์—ฌ ์‚ฌ์šฉ
  1. Two-Stream
  • ConvNet + LSTM์€ high-level variation์„ modelingํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, low-level motion์„ ์žก์•„๋‚ด๋Š”๋ฐ ์–ด๋ ค์›€์ด ์žˆ๋‹ค.
  • ๋˜ํ•œ ๋‹ค์ˆ˜์˜ frame์€ ์—ญ์ „ํŒŒ๋ฅผ ์ˆ˜ํ–‰ํ•ด์•ผํ•˜๊ธฐ ๋•Œ๋ฌธ์— high computation power ํ•„์š”
  • Two-Stream networks๋Š” short temporal snapshots(๋น„๋””์˜ค์˜ RGB ์ด๋ฏธ์ง€ 1๊ฐœ)์™€ ์™ธ๋ถ€์—์„œ ์—ฐ์‚ฐํ•œ optical flow N๊ฐœ ์Œ“์•„์„œ averaging์„ ํ†ตํ•ด classification์„ ์ˆ˜ํ–‰
  • Optical flow์˜ ๊ฒฝ์šฐ horizental, vertical 2๊ฐœ์˜ ์ฑ„๋„๋กœ ๊ตฌ์„ฑ๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์— conv layer 2๊ฐœ ์‚ฌ์šฉ
  • test์‹œ์—๋Š” ๋น„๋””์˜ค๋กœ๋ถ€ํ„ฐ multiple snapshots์„ ์ƒ˜ํ”Œ๋งํ•˜๊ณ  ์˜ˆ์ธก๊ฐ’์˜ average๊ฐ’์„ ์‚ฌ์šฉ
  1. 3D-Fused Two-Stream
  • Two-Stream network์—์„œ ๋งˆ์ง€๋ง‰ conv layer ์ดํ›„์— ๊ณต๊ฐ„๊ณผ flow stream์„ ๊ฒฐ์ •ํ•œ ๊ฒƒ
  • time, x, y, dimensions๊ฐ€ 3x3x3 3D conv layer (output 512 channels)
  • 3x3x3 3D max-pooling layer, fc layer๋ฅผ ํ†ต๊ณผํ•˜๋Š” ํ˜•ํƒœ
  • two-stream/ 3D-fused two-stream ๋ชจ๋‘ end-to-end ๋ฐฉ์‹์œผ๋กœ train
  1. Inflating 2D ConvNets into 3D (I3D)
  • 2D ConvNets์„ 3D ConvNet์œผ๋กœ convertํ•˜๊ธฐ ์œ„ํ•ด ๋งค์šฐ ๊ฐ„๋‹จํ•œ ๋ฐฉ๋ฒ• ์‚ฌ์šฉ
  • temporal dimension์„ ์ถ”๊ฐ€ํ•˜๋Š” ๊ฒƒ์œผ๋กœ NxNํ•„ํ„ฐ๋ฅผ NxNxN๋กœ ๋ณ€๊ฒฝ -> ๋ชจ๋“  ํ•„ํ„ฐ์™€ pooling kernel์— ์ ์šฉ

Dataset

UCF-101

  • 101๊ฐœ ๋™์ž‘ ์นดํ…Œ๊ณ ๋ฆฌ์˜ 13320๊ฐœ ๋น„๋””์˜ค๋ฅผ ํ†ตํ•ด UCF101์€ ๋™์ž‘ ์ธก๋ฉด์—์„œ ๊ฐ€์žฅ ํฐ ๋‹ค์–‘์„ฑ์„ ์ œ๊ณตํ•˜๋ฉฐ ์นด๋ฉ”๋ผ ์›€์ง์ž„, ๊ฐœ์ฒด ๋ชจ์–‘ ๋ฐ ํฌ์ฆˆ, ๊ฐœ์ฒด ํฌ๊ธฐ, ๊ด€์ , ์–ด์ˆ˜์„ ํ•œ ๋ฐฐ๊ฒฝ, ์กฐ๋ช… ์กฐ๊ฑด ๋“ฑ์˜ ํฐ ๋ณ€ํ™”๊ฐ€ ์žˆ๋Š” ๊ฐ€์žฅ ํฐ ๋‹ค์–‘์„ฑ์„ ์ œ๊ณต
  • Human-Object Interaction
  • Body-Motion Only
  • Human-Human Interaction
  • Playing Musical Instruments
  • Sports

๊ฒฐ๋ก 

  • ๋…ผ๋ฌธ ๋ถ„์„ ๊ฒฐ๊ณผ Two-Stream I3D์„ ์ด์šฉํ•œ ๋ชจ๋ธ์ด ๊ฐ€์žฅ ์„ฑ๋Šฅ์ด ์ข‹์•˜๋‹ค.

  • ๊ธฐ๋ณธ ์˜คํ”ˆ์†Œ์Šค๋Š” kinetics-400์„ ์ด์šฉํ•˜์˜€์ง€๋งŒ Kinetics-600์œผ๋กœ ๋ฐ”๊ฟ”์ฃผ์—ˆ๋‹ค.

  • ํ”„๋ ˆ์ž„์ด 50์ผ๋•Œ๋ณด๋‹ค ํ”„๋ ˆ์ž„ 200์ผ ๋•Œ ๋” ์ •ํ™•ํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋„์ถœํ•˜๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์—ˆ๋‹ค.

  • ์˜์ƒ์˜ ๊ฐ๋„, ์œ„์น˜์— ๋”ฐ๋ผ์„œ ๋‹ค๋ฅธ ๊ฐ’์„ ๋„์ถœํ•ด๋‚ด๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.

  • ์•„์ง ์ •ํ™•ํ•œ ์˜ˆ์ธก๊ฐ’์„ ๋‚ด๊ธฐ์—๋Š” ๋ถ€์กฑํ•œ ๋ชจ๋ธ์ด๋ผ๊ณ  ์ƒ๊ฐ์ด ๋“ค์ง€๋งŒ, ๊ณ„์†ํ•ด์„œ ์„ฑ์žฅํ•˜๋Š” ๋”ฅ๋Ÿฌ๋‹ ๋ถ„์•ผ์—์„œ ๋„ˆ๋ฌด ์˜ค๋ž˜๋œ ๋ชจ๋ธ์„ ์ด์šฉํ•ด์„œ ํ”„๋กœ์ ํŠธ๋ฅผ ์ง„ํ–‰ํ•œ ํƒ“๋„ ์žˆ๋Š” ๊ฒƒ ๊ฐ™๋‹ค.

About

Action Recognition with an Inflated 3D CNN

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published