Chapter 8 Grouping

Carving the scene into objects is not the only segmentation challenge our visual systems solve. Our visual brains also provide us with an experience of groups of objects. Can tracking operate over groups, allowing us to track multiple groups of objects? Alzahabi and Cain (2021) attempted to investigate this using clusters of discs as targets and as distractors. These clusters maintained a constant spatial arrangement as they wandered about the display. Participants seemed to do well at tracking these clusters. However, the associated experiments did not rule out the possibility that participants were tracking just one disc of each cluster. I am not aware of any work providing good evidence that a tracking focus can track an entire group.

Yantis (1992) hypothesized that during MOT experiments, participants track an imaginary shape formed by the targets, specifically a polygon whose vertices are the target positions. This became a popular idea, but progress has been slow in understanding whether all participants do this or just a minority do, and in what circumstances. Merkel et al. (2014) found a result that they took as evidence that some participants track a shape defined by the targets. In their task, at the end of the trial when the targets and distractors stopped moving, four of the objects were highlighted. The task was to press one button if all four were targets (match), and to press a different button otherwise (non-match). Error rates were lowest when none of the objects highlighted were targets, and errors were progressively more common as the number of highlighted objects that were targets increased. This was unsurprising. For trials where all four of the highlighted objects were targets (match), however, error rates were much lower than when only three were targets (a non-match). Merkel et al. (2014) suggested that this reflected a “perceptual strategy of monitoring the global shape configuration of the tracked target items.” They went on to split the participants based on whether they had a relatively low error rate in the match condition, investigated the neural correlates of that, and drew conclusions about the neural processing that underlies virtual shape tracking.

The inferences of Merkel et al. (2014) are based on the split of participants based on low error rate in the match condition compared to the condition where none of the highlighted objects match. The idea seems to be that if participants weren’t using a shape tracking strategy, error rates would steadily increase from the trials where none of the highlighted objects were targets to the trials where all of the objects highlighted were targets. While this is not necessarily the case, it does seem likely that participants use the shape to rapidly classify the trial as match or non-match. People can certainly see shapes defined only by dots at their vertices; the ancients saw animals and human figures in the stars. Subitizing, a related ability that involves processing several objects as a group, allows one to enumerate a collection much faster than by considering individual dot positions. So using shape is indeed a natural way to check for a match. However, it is harder to know how much grouping contributes to the ability to track.

In addition to their behavioral measures, Merkel, Hopf, and Schoenfeld (2017) also measured the electrical brain response (the evoked response potential or ERP) to a probe that was flashed while the objects were moving. The probe was flashed either directly on the virtual outline shape formed by the targets, outside that shape, or inside it. The participants were not informed of the probe and most reported not being aware of it. However, the early ERP response to the probe (~100 ms) was significantly greater when it lay on the shape than outside it or inside it. By 200 ms, the response to the probe when inside the virtual outline shape was similar to that on the virtual outline, and greater than when the probe was outside the shape. This suggests that at least some of the participants were continuously tracking the shape. The source of the ERP shape advantage was localized to around the lateral occipital complex, which is known to be particularly involved in shape processing. Future work ought to test with probes in both hemifields to assess whether this virtual shape grouping occurs independently in both hemifields, as it should if it is to explain our capacity for tracking.

Representing moving objects by the virtual shape they define is merely one way that position representations may be configural rather than retinotopic. A few studies speak to this issue by manipulating the stability of different coordinate frames, and find evidence for non-retinotopic processing; I point to some papers on this in the “Omissions” part of Section 13.

8.1 Hierarchical relations

In the real world, the movement of object images on our retinas is rarely as independent as the movement of the objects in an MOT experiment. In everyday scenes, often there is a strong motion element throughout the visual field created by the movement of the observer, and recovering true object movement involves detecting deviations from that overall motion (Warren and Rushton 2007). Even when the observer and their eyes do not move, hierarchical motion relationships are common. When one views a tree on a windy day, the largest branches sway slowly, while the smaller limbs attached to the larger branches move with the larger branches but also, being more flexible and lighter, have their own, more rapid motion.

These aspects of the structure of the visual world may be one reason that our visual systems are tuned to relative motion (Tadin et al. 2002; Maruya, Holcombe, and Nishida 2013). When we see a wheel roll by, we experience individual features on the wheel as moving forward, reflecting the global frame of the entire wheel, but also as moving in a circle, reflecting the motion relative to the center of the wheel.

This decomposition of a wheel rim’s movement is so strong that people systematically mis-report the trajectory of the points on the wheel (Proffitt, Kaiser, and Whelan 1990). The red curve in Figure 11 reveals that a point on a rolling wheel follows a trajectory with up, down, and forward motion, but no backward motion. Yet people report that they see a circular trajectory, including backward motion. What we perceive, then, reflects a sophisticated parsing and grouping of retinal motion.

The red curve is that traced out by a point on a rolling wheel, by Zorgit

Figure 11: The red curve is that traced out by a point on a rolling wheel, by Zorgit

Bill et al. (2020) varied the motion pattern of the discs of an MOT task to show that hierarchical relations in the stimulus can facilitate tracking. Unfortunately, they did not investigate whether those relations affected how the discs’ motion was perceived, like in a rolling wheel or a flock of birds. The attentional demands, if any, of such hierarchical motion decomposition has not been explored much. Thus it remains unclear to what extent the hierarchical relations are calculated by the application of tracking or other attentional resources, versus tracking operating on a representation of hierarchical relations that was determined pre-attentively.

8.2 Eyes to the center

The human visual system performs a rapid global analysis of visual scenes, providing summary information sometimes referred to as “ensemble statistics” (G. A. Alvarez and Oliva 2009). One such ensemble statistic is the location of the center or centroid of a set of objects. This is useful for eye movement planning, among other things — to monitor a group of objects, it is helpful to look at the center of the group, as that can minimize how far into peripheral vision the objects are situated.

Zelinsky and Neider (2008) and Fehd and Seiffert (2008) both reported that during multiple object tracking, the eyes of many participants frequently are directed at blank locations near the center of the array of targets. This finding has been replicated by subsequent work (Hyönä, Li, and Oksama 2019). The nature of the central point that participants tend to look at (in addition to the individual targets, which they also look at) is not entirely clear. Researchers have suggested that the point may be the average of the targets’ locations, or the average location of all the moving objects (both targets and distractors). Another possibility that has been investigated is that participants tend to look at the centroid of the shape formed by the targets, which recalls the Yantis (1992) hypothesis that what is tracked is the shape defined by the targets. Lukavskỳ (2013) introduced the idea of an “anti-crowding point”, which minimizes the ratio between each target’s distance from the gaze point and distance from every distractor. The idea was that participants move their gaze closer to a target when it is near a distractor to avoid confusing targets with distractors. Note, however, that the Lukavsky metric does not take into account the limited range of the empirical crowding zone, which is about half the eccentricity of an object.

In a comparison of several metrics against the eyetracking data, Lukavskỳ (2013) found that the anti-crowding point best predicted participants’ gaze in his experiment, followed by the average of the target locations. These points both matched the data better than the centroid of the targets. This undermines the Yantis (1992) hypothesis that a virtual polygon is tracked, and the finding of best performance for the anti-crowding point is consistent with other results that participants tend to look closer to targets that are near other objects (Vater, Kredel, and Hossner 2017; Zelinsky and Todor 2010).

More work must be done to understand the possible role of the anti-crowding eye movement strategy suggested by Lukavskỳ (2013). Spatial interference does not seem to extend further than half an object’s eccentricity, in both static identification tasks Gurnsey, Roddy, and Chanab (2011) and multiple object tracking (Alex O. Holcombe, Chen, and Howe 2014), but the anti-crowding point devised by Lukavskỳ (2013) does not incorporate such findings. Its performance should be compared to a measure that is similar but excludes from the calculation distractors further than about half an object’s eccentricity.


Alvarez, G A, and A Oliva. 2009. “Spatial Ensemble Statistics Are Efficient Codes That Can Be Represented with Reduced Attention.” Proceedings of the National Academy of Sciences of the United States of America 106 (18): 7345–50.
Alzahabi, Reem, and Matthew S. Cain. 2021. “Ensemble Perception During Multiple-Object Tracking.” Attention, Perception, & Psychophysics 83 (3): 1263–74.
Bill, Johannes, Hrag Pailian, Samuel J. Gershman, and Jan Drugowitsch. 2020. “Hierarchical Structure Is Employed by Humans During Visual Motion Perception.” Proceedings of the National Academy of Sciences 117 (39): 24581–89.
Fehd, Hilda M., and Adriane E. Seiffert. 2008. “Eye Movements During Multiple Object Tracking: Where Do Participants Look?” Cognition 108 (1): 201–9.
Gurnsey, Rick, G Roddy, and W Chanab. 2011. “Crowding Is Size and Eccentricity Dependent.” Journal of Vision 11: 1–17.
Holcombe, Alex O, W Chen, and Piers D L Howe. 2014. “Object Tracking: Absence of Long-Range Spatial Interference Supports Resource Theories.” Journal of Vision 14 (6): 1–21.
Hyönä, Jukka, Jie Li, and Lauri Oksama. 2019. “Eye Behavior During Multiple Object Tracking and Multiple Identity Tracking.” Vision 3 (3): 37.
Lukavskỳ, Jiří. 2013. “Eye Movements in Repeated Multiple Object Tracking.” Journal of Vision 13 (7): 1–16.
Maruya, Kazushi, Alex O Holcombe, and Shinya Nishida. 2013. “Rapid Encoding of Relationships Between Spatially Remote Motion Signals.” Journal of Vision 13 (4): 1–20.
Merkel, Christian, Jens-Max Hopf, and Mircea Ariel Schoenfeld. 2017. “Spatio-Temporal Dynamics of Attentional Selection Stages During Multiple Object Tracking.” NeuroImage 146 (February): 484–91.
Merkel, Christian, Christian M. Stoppel, Steven A. Hillyard, Hans-Jochen Heinze, Jens-Max Hopf, and Mircea Ariel Schoenfeld. 2014. “Spatio-Temporal Patterns of Brain Activity Distinguish Strategies of Multiple-Object Tracking.” Journal of Cognitive Neuroscience 26 (1): 28–40.
Proffitt, Dennis R, Mary K Kaiser, and Susan M Whelan. 1990. “Understanding Wheel Dynamics.” Cognitive Psychology 22 (3): 342–73.
Tadin, Duje, Joseph S. Lappin, Randolph Blake, and Emily D. Grossman. 2002. “What Constitutes an Efficient Reference Frame for Vision?” Nature Neuroscience 5 (10): 1010–15.
Vater, Christian, Ralf Kredel, and Ernst-Joachim Hossner. 2017. “Disentangling Vision and Attention in Multiple-Object Tracking: How Crowding and Collisions Affect Gaze Anchoring and Dual-Task Performance.” Journal of Vision 17 (5): 1–13.
Warren, Paul A., and Simon K. Rushton. 2007. “Perception of Object Trajectory: Parsing Retinal Motion into Self and Object Movement Components.” Journal of Vision 7 (11): 2.
Yantis, S. 1992. Multielement Visual Tracking: Attention and Perceptual Organization. Cognitive Psychology 24 (3): 295–340.
Zelinsky, Gregory J., and Mark B. Neider. 2008. “An Eye Movement Analysis of Multiple Object Tracking in a Realistic Environment.” Visual Cognition 16 (5): 553–66.
Zelinsky, Gregory J., and Andrei Todor. 2010. “The Role of ‘Rescue Saccades’ in Tracking Objects Through Occlusions.” Journal of Vision 10 (14): 1–13.