Skip to content

Available models

The algorithms in zamba are designed to identify species of animals that appear in camera trap videos. There are three models that ship with the zamba package: time_distributed, slowfast, and european. For more details of each, read on!

Model summary

Model Geography Relative strengths Architecture
time_distributed Central and West Africa Better than slowfast at duikers, chimps, and gorillas and other larger species Image-based TimeDistributedEfficientNet
slowfast Central and West Africa Better than time_distributed at blank detection and small species detection Video-native SlowFast
european Western Europe Trained on non-jungle ecologies Finetuned time_distributedmodel

All models support training, fine-tuning, and inference. For fine-tuning, we recommend using the time_distributed model as the starting point.

What species can zamba detect?

time_distributed and slowfast are both trained to identify 32 common species from Central and West Africa. The output labels in these models are:

  • aardvark
  • antelope_duiker
  • badger
  • bat
  • bird
  • blank
  • cattle
  • cheetah
  • chimpanzee_bonobo
  • civet_genet
  • elephant
  • equid
  • forest_buffalo
  • fox
  • giraffe
  • gorilla
  • hare_rabbit
  • hippopotamus
  • hog
  • human
  • hyena
  • large_flightless_bird
  • leopard
  • lion
  • mongoose
  • monkey_prosimian
  • pangolin
  • porcupine
  • reptile
  • rodent
  • small_cat
  • wild_dog_jackal

european is trained to identify 11 common species in western Europe. The possible class labels are:

  • bird
  • blank
  • domestic_cat
  • european_badger
  • european_beaver
  • european_hare
  • european_roe_deer
  • north_american_raccoon
  • red_fox
  • unidentified
  • weasel
  • wild_boar

time_distributed model

Architecture

The time_distributed model was built by re-training a well-known image classification architecture called EfficientNetV2 (Tan, M., & Le, Q., 2019) to identify the species in our camera trap videos. EfficientNetV2 models are convolutional neural networks designed to jointly optimize model size and training speed. EfficientNetV2 is image native, meaning it classifies each frame separately when generating predictions. The model is wrapped in a TimeDistributed layer which enables a single prediction per video.

Training data

time_distributed was trained using data collected and annotated by partners at The Max Planck Institute for Evolutionary Anthropology and Chimp & See.

The data included camera trap videos from:

Country Location
Cameroon Campo Ma'an National Park
Korup National Park
Central African Republic Dzanga-Sangha Protected Area
Côte d'Ivoire Comoé National Park
Guiroutou
Taï National Park
Democratic Republic of the Congo Bili-Uele Protect Area
Salonga National Park
Gabon Loango National Park
Lopé National Park
Guinea Bakoun Classified Forest
Moyen-Bafing National Park
Liberia East Nimba Nature Reserve
Grebo-Krahn National Park
Sapo National Park
Mozambique Gorongosa National Park
Nigeria Gashaka-Gumti National Park
Republic of the Congo Conkouati-Douli National Park
Nouabale-Ndoki National Park
Senegal Kayan
Tanzania Grumeti Game Reserve
Ugalla River National Park
Uganda Budongo Forest Reserve
Bwindi Forest National Park
Ngogo and Kibale National Park

Default configuration

The full default configuration is available on Github.

By default, an efficient object detection model called MegadetectorLite is run on all frames to determine which are the most likely to contain an animal. Then time_distributed is run on only the 16 frames with the highest predicted probability of detection. By default, videos are resized to 240x426 pixels following frame selection.

The default video loading configuration for time_distributed is:

video_loader_config:
  model_input_height: 240
  model_input_width: 426
  crop_bottom_pixels: 50
  fps: 4
  total_frames: 16
  ensure_total_frames: true
  megadetector_lite_config:
    confidence: 0.25
    fill_mode: score_sorted
    n_frames: 16

You can choose different frame selection methods and vary the size of the images that are used by passing in a custom YAML configuration file. The only requirement for the time_distributed model is that the video loader must return 16 frames.

slowfast model

Architecture

The slowfast model was built by re-training a video classification backbone called SlowFast (Feichtenhofer, C., Fan, H., Malik, J., & He, K., 2019). SlowFast refers to the two model pathways involved: one that operates at a low frame rate to capture spatial semantics, and one that operates at a high frame rate to capture motion over time.

Architecture showing the two pathways of the slowfast model

Source: Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6202-6211).

Unlike time_distributed, slowfast is video native. This means it takes into account the relationship between frames in a video, rather than running independently on each frame.

Training data

The slowfast model was trained using the same data as the time_distributed model.

Default configuration

The full default configuration is available on Github.

By default, an efficient object detection model called MegadetectorLite is run on all frames to determine which are the most likely to contain an animal. Then slowfast is run on only the 32 frames with the highest predicted probability of detection. By default, videos are resized to 240x426 pixels.

The full default video loading configuration is:

video_loader_config:
  model_input_height: 240
  model_input_width: 426
  crop_bottom_pixels: 50
  fps: 8
  total_frames: 32
  ensure_total_frames: true
  megadetector_lite_config:
    confidence: 0.25
    fill_mode: score_sorted
    n_frames: 32

You can choose different frame selection methods and vary the size of the images that are used by passing in a custom YAML configuration file. The two requirements for the slowfast model are that: - the video loader must return 32 frames - videos inputted into the model must be at least 200 x 200 pixels

european model

Architecture

The european model starts from the trained time_distributed model, and then replaces and trains the final output layer to predict European species.

Training data

The european model is finetuned with data collected and annotated by partners at The Max Planck Institute for Evolutionary Anthropology. The finetuning data included camera trap videos from Hintenteiche bei Biesenbrow, Germany.

Default configuration

The full default configuration is available on Github.

The european model uses the same frame selection as the time_distributed model. By default, an efficient object detection model called MegadetectorLite is run on all frames to determine which are the most likely to contain an animal. Then european is run on only the 16 frames with the highest predicted probability of detection. By default, videos are resized to 240x426 pixels following frame selection.

The full default video loading configuration is:

video_loader_config:
  model_input_height: 240
  model_input_width: 426
  crop_bottom_pixels: 50
  fps: 4
  total_frames: 16
  ensure_total_frames: true
  megadetector_lite_config:
    confidence: 0.25
    fill_mode: score_sorted
    n_frames: 16

As with all models, you can choose different frame selection methods and vary the size of the images that are used by passing in a custom YAML configuration file. The only requirement for the european model is that the video loader must return 16 frames.

MegadetectorLite

Frame selection for video models is critical as it would be infeasible to train neural networks on all the frames in a video. For all the species detection models that ship with zamba, the default frame selection method is an efficient object detection model called MegadetectorLite that determines the likelihood that each frame contains an animal. Then, only the frames with the highest probability of detection are passed to the model.

MegadetectorLite combines two open-source models:

  • Megadetector is a pretrained image model designed to detect animals, people, and vehicles in camera trap videos.
  • YOLOX is a high-performance, lightweight object detection model that is much less computationally intensive than Megadetector.

While highly accurate, Megadetector is too computationally intensive to run on every frame. MegadetectorLite was created by training a YOLOX model using the predictions of the Megadetector as ground truth - this method is called student-teacher training.