Section 6 The role of motion
Motion is obviously part and parcel of multiple object tracking. The visual system of animals such as humans have specialized motion processors that are not driven by arbitrary signals of an object’s displacement, rather they have particular signals associated with motion that they are especially responsive to (Nishida 2011). These signals can be selectively disrupted, while preserving the physical displacment of the objects. Clair, Huff, and Seiffert (2010) did this in an MOT paradigm by having the texture elements that patterned the moving objects move in a different direction than the objects themselves. Tracking was still possible, but was disrupted quite a lot. This suggests that tracking is driven by the specialized motion detectors, which likely help carry along a selection focus of tracking as a target moves. This is further supported by studies involving large jumps between successive appearances of a moving target, also known as “apparent motion,” which find lower temporal limits on tracking for these stimuli (the notion of temporal limits will be explained in chapter 7) (Kanaya and Sato 2012; Verstraten, Cavanagh, and Labianca 2000).
It does not seem controversial that motion processing is somewhat distinct from the tracking resource, and that it is used by tracking. A more open question is whether motion signals are used in two particular ways. One is whether the direction and speed of a target is used to anticipate future positions of the objects, then performance should be better when objects maintain straight-line trajectories than when they frequently change their direction.
The evidence below suggests that the capacity limit on the use of motion information during tracking may be more severe than that on the use of position. That is, in conditions where participants can use position information to accurately track four or five targets, they may only use motion information for one or two of the targets. This may mean that the use of motion information can be identified with the extended cognitive processing of an object that likely can only occur for one or at most a few targets, which was referred to in Chapter 4 as C≈1 processes.
Piers DL Howe and Holcombe (2012) compared a condition in which the objects moved in straight lines, only changing direction when they bounced off the arena’s boundaries, to when the objects’ trajectories were not predictable because they changed direction randomly about every half second. However, this advantage for tracking the predictable trajectories was found when there were two targets, but not when there were four targets. Vul et al. (2010) asked participants to track three targets and varied how much and how often the objects changed their velocity. They found little to no detriment of the velocity changes on participants’ estimates of difficulty. Unfortunately they did not assess whether velocity continuity became beneficial with fewer targets.
In the Piers DL Howe and Holcombe (2012) experiments, participants were allowed to move their eyes. They may have moved their eyes to follow one target, or alternatively something like the centroid of the targets (see @Lukavskỳ (2013)), and as eye movements have some associated inertia, that tendency to continue moving the eyes in the same direction might have contributed to the predictable trajectory benefit, and it makes sense that this would boost conditions with fewer targets more given that the eyes only move in one direction at a time. Luu and Howe (2015) followed up on the Piers DL Howe and Holcombe (2012) results using similar experiment parameters but added a requirement that participants fixate at the center of the screen throughout a trial, and found a very similar pattern of results.
These results converge nicely with those of Fencsik, Klieger, and Horowitz (2007), who made targets invisible for a brief period (307 ms) during tracking. The targets continued moving, invisibly, during the disappearance interval, and participants were able to continue tracking afterward when there were one or two targets but not four targets, as evidenced by better performance compared to a control condition where prior motion information was not available.
Wang and Vul (2021) devised displays that allowed them to compare the extent to which participants used position information, velocity, and acceleration during tracking. Consistent with previous investigations, velocity was used less than position, and was subject to a more severe capacity limit. Acceleration (extrapolation of change in motion direction) did not seem to be used at all.
6.1 Velocity as a feature for correspondence matching?
For tracking, motion information could be used in two different ways. One is to solve what is referred to as the “correspondence problem.” To understand this, imagine that the moving objects in an MOT display were sampled by a computer only once every three hundred milliseconds (something like this may be what the brain does when there are several targets, if attention samples objects serially - 8). The correspondence problem is to determine the correspondence between the objects of the two frames. That is, solving the correspondence problem means knowing where an object in the first frame is in the second frame. The rise of CCTV a few decades ago sparked a rapid growth in the development of algorithms for tracking objects in low frame rate video (Kamkar et al. 2020). The correct answers for which objects in frames 1 and 2 correspond to each other can in many situations be obtained by nearest-neighbor matching. Nearest-neighbor here simply means matching each object in frame 1 to that closest to it in frame 2.
For some combinations of object trajectories and sampling frequencies, the nearest-neighbor match yields the wrong answer to the correspondence problem. For example, if in the interval between the two sampled frames, a distractor moving toward the target ends up very close to a target’s location in frame 1, while the target has moved farther from its frame 1 location, then using nearest0neighbor will mistakenly match the target in frame 1 with a distractor in frame 2. This is called a “false correspondence.”
Using nearest-velocity matching in conjunction can help avoid false correspondences. Velocity refers to both an object’s direction and its speed. Because moving objects maintain their current velocity for a few hundred milliseconds or more, depending on the display, when two objects in frame 2 are both very close to the location a target occupied in frame 1, the target is likely to be the object whose velocity is most similar to the velocity of the target in frame 1. This topic will be discussed more in Chapter 9, because it relates to the motion streaks idea of that chapter..
6.2 Velocity for position estimation
The use of velocity matching to solve the correspondence problem must be distinguished from the topic of this section, using velocity to estimate position. A velocity signal can be used to predict or extrapolate the next position of a moving object. Consider a discrete sampling situation where one has a set of sensory signals of object locations on frames 1, 2, and 3. One can use the velocity signal for a target at frame 2 to extrapolate where it should be on frame 3. Then, when the sensory signals for frame 3 arrives, one can use the extrapolated target position as the input for solving the correspondence problem rather than the frame 2 position. I call this extrapolation of the present because if the brain uses this scheme, the idea is not that a person would perceive a moving object in a potential future position. Instead, the process is one of using the trajectory a target was on to help determine which new sensory signal corresponds to it.
The experiments reviewed in the first section of this chapter found evidence for the use of motion information, but that type of evidence could not distinguish between the use of motion for extrapolating the present and the use of motion for velocity matching. As we will see next, the results from two other paradigms find little to no evidence of extrapolation, which suggests that velocity matching is the way that motion information is used.
The first paradigm that was used to go looking for evidence of extrapolation, the “target recovery” paradigm, was developed by Keane and Pylyshyn (2006). They had objects abruptly disappear during MOT and then reappear hundreds of milliseconds later. In their “move” conditions, they re-appeared further along the trajectory they would take had they continued with the same velocity, whereas in “non-move” conditions they would reappear in the same position they had disappeared in. Performance was uniformly worse in the move conditions than in the non-move conditions. A follow-up study by Steven L. Franconeri, Pylyshyn, and Scholl (2012) found the same result.
The possibility of extrapolation has also been explored by simply asking participants to report the last location of a target or targets after they disappear, by clicking with a mouse on the screen. If the brain extrapolates the present, that should result in participants reporting, on average, the correct last position of the target, although individual reports might be quite noisy. The brain might alternatively extrapolate the future, as has often been suggested (e.g. Nijhawan (2008)), such that on average participants would click on a position ahead of a target’s last position. Instead, studies have predominantly found that the locations participants report lag the final locations of the target, and this lag increases with the number of targets tracked (Christina J. Howard and Holcombe 2008; Christina J. Howard, Masom, and Holcombe 2011). One exception is from Iordanescu, Grabowecky, and Suzuki (2009), who found that people clicked on average slightly ahead of the target’s last position. However, Christina J. Howard, Masom, and Holcombe (2011) tried but failed to replicate this result, instead finding lags again.
Further evidence that the visual system uses a lagged representation comes from an MOT eye-tracking study by Lukavský and Děchtěrenko (2016). They were able to assess whether eye position either anticipated future positions of the objects or instead lagged their present position in an ingenious model-free way. They contrasted the eye movements in pairs of trials with object paths that were identical except that their trajectories were time-reversed. After reversing the timeline of the eye movement data from the backward trials, they time-shifted that data to find the time shift that maximized the correspondence of the eye movements for the two kinds of trials. In their first and second experiment, with four targets, the time-shifting technique of Lukavský and Děchtěrenko (2016) revealed that eye movements lagged the targets in every participant, with a mean lag of 110 ms in the first experiment and 108 ms in the second experiment.
Recall that a promising theory of tracking is that a process switches among the targets to update their positions - this would explain the dramatic worsening of temporal limits with additional targets reviewed in 7. such a theory also entails that not only temporal limits, but also lags should worsen with additional targets. This is precisely what was found by Christina J. Howard and Holcombe (2008) varied the number of targets from one to seven and found that the lag of the positions participants clicked on increased with the number of targets tracked. This supports a serial position sampling theory, as discussed in 8. A trend but not a statistically significant increase, however, was found by Christina J. Howard, Masom, and Holcombe (2011); they only varied the number of targets from one to three, and perhaps that was not enough. Lukavský and Děchtěrenko (2016) also investigated whether the lag changed with the number of targets, in their case the lag of eye position. In their second experiment numerically the mean lag was 15 ms less (93 ms) for two targets than for four, although again this was not a statistically significant difference - the 95% confidence interval spanned from 33 ms of lag to 2 ms of extrapolation. Thus while their results were compatible with the proposition that there is less lag with fewer targets, the data did not strongly support it. More work should be done in this area.
6.3 Simulation evidence indicates that extrapolation has little value in MOT
The evidence of the preceding sections suggests that tracking processes do little in the way of extrapolation, or even velocity matching, except for when there are only a few targets, when more limited-capacity cognitive processes may play a larger role. The paucity of evidence for extrapolation is surprising in light of the popularity of predictive frameworks for conceptualizing what the brain does. Many researchers believe that prediction is a critical component of much of perception. So, is the brain leaving a lot of performance gains on the table by not using extrapolation when there are more than a few targets?
A computational investigation by Zhong et al. (2014) found that there is little to be gained by extrapolation in standard MOT tasks. Zhong et al. (2014) took an approach resembling what is often called an “ideal observer” approach. The idea is to build into a model the relevant properties of our sensory limitations and then assess how well an optimal algorithm for processing those signals would do, and investigate how it would be affected by task parameters. Zhong et al. (2014) did this by turning a Kalman filter loose on estimating object positions for use to solve the correspondence problem in MOT. In the term “Kalman filter,” the word “filter” has a tendency to mislead people, as it is not a filter in the conventional sense. The Kalman filter is instead an algorithm that learns to estimate, in Bayesian fashion, the current position of the targets. Bayesian estimation is appropriate because the sensory estimates of the objects are not precise - the simulations of Zhong et al. (2014) assume that the sensory error is Gaussian-distributed, which is a reasonable approximation, although Zhong et al. (2014) also make various simplifying assumptions, such as that the Gaussian error has the same variance throughout the visual field.
The Kalman filter makes a prediction of the object’s current position, based on its best estimate of the object’s last position and its velocity. This prediction, based on previous sensory position signals and a velocity estimated from them, is combined with the current sensory position signal to yield the estimate of the object’s current position. The relative weights assigned to the prediction and the sensory signal are determined by an updating process that arrives at the optimal weights under certain assumptions.
Zhong et al. (2014) took the position estimates of the targets provided by the Kalman filter on each time step and used them to solve the correspondence problem. That is, rather than matching the sensory position data of the current frame to each sensory position datum from the previous frame believed to have been from a target, instead of this sensory data from the previous frame, they used the Kalman filter estimates of each target’s position. Zhong et al. (2014) expected that simulated MOT task accuracy would be substantially higher when the Kalman filter was used, because the Kalman filter estimates of each target’s position are substantially more accurate than the ‘raw’ sensory data.
To the surprise of the researchers, simulated MOT performance was not substantially higher for the Kalman filter than when the raw sensory data was used. This finding was robust to a range of parameter values for the simulation, so Zhong et al. (2014) concluded that extrapolation has very little benefit for the MOT tasks they investigated.
To understand this result, Zhong et al. (2014) suggested that one must first consider the situations that lead to errors in MOT. As we have suggested elsewhere in this book, most errors may arise during close encounters between targets and distractors. During the periods of an MOT trial when the targets and distractors are far from each other, there is no correspondence ambiguity and computational models such as that of Zhong et al. (2014) do not make mistakes, so extrapolation and velocity matching are certainly of no benefit there. During close encounters, by contrast, one might expect that extrapolation would reduce false correspondences. In their simulations, Zhong et al. (2014) found that extrapolation did reduce false correspondences, improving task performance but that this benefit was extremely small in size.
Why is there only a trivial benefit of extrapolation in the Zhong et al. (2014) simulations? False correspondences in the simulations are caused by noise in the incoming sensory position signals. The Kalman filter’s representation is less noisy than the sensory signals, in part due to extrapolation, but the improvement in accuracy is dwarfed by the sensory noise, as far as resulting false correspondences. In other words, targets end up being swapped for distractors (false correspondence) largely due to the ambiguity in correspondence created by the sensory noise. This remained true for each of the different levels of sensory noise and intermittency of sampling that Zhong et al. (2014) simulated.
More work needs to be done with this sort of approach. Zhong et al. (2014) made some assumptions that are known to be false, such as that there is a uniform level of sensory noise across the visual field, and some that are implausible, such as that the brain can determine a global solution for the correspondence problem that minimizes the sum of the distances between the targets’ position estimates provided by the Kalman filter and the new sensory observations. At least some of their results are probably robust to these assumptions, but possibly not all.
6.4 A C≈1 extrapolation effect?
Another behavioral paradigm in which participants report the last position of an object does frequently elicit evidence of extrapolation. In this “representational momentum” paradigm, participants are typically shown only a single moving object and asked to report the object’s final position after it suddenly disappears. On average, participants usually indicate a position displaced in the object’s final direction of motion. Hubbard (2005) provides an extensive review of a large literature on this. Participants were also found to displace the last position of a target in the direction of gravity. The phenomenon may reflect C=1 cognitive processes, but this remains uncertain because the number of objects is almost never varied in this literature.
Another extrapolation phenomenon has been reported for frozen-action photographs that imply motion. For example, Freyd (1983) presented a photograph such as of waves crashing on a beach, and participants judged whether a subsequently presented probe photograph was the same as the original photograph or different. The pattern of response time suggested that participants’ memory of the photograph was closer to one from later in the series of photographs than the original. Hafri, Boger, and Firestone (2022) found that this form of extrapolation generalized to changes in state that could not easily be reduced to motion. For example, they found that an image of a burning log was remembered as being more burnt than it was in the original photograph.
In the literature, the term “representational momentum” is applied to both the extrapolations of the state of a stimulus like a burning log and the reported position of a moving object whose trajectory is abruptly terminated, although I don’t know of strong evidence that these reflect the same phenomenon. However, it is plausible that these reflect a C≈1 process and thus would not show hallmarks of MOT such as hemifield independence. Because this sort of likely-cognitive or memorial process exists, researchers who are interested in the processes that underlie tracking should assess whether their findings can be explained by C≈1 processes before assuming that what they are studying is perceptual or attentional rather than cognitive. This issue is discussed further in the Recommendations section.
It is also possible that representational momentum is a memory effect, perhaps reflecting the same mechanisms that yield the boundary extension effect discovered by Helene Intraub (Intraub and Bodamer 1993). However, Nakayama and Holcombe (2021) recently discovered an extrapolation effect that appears to be perceptual, in that this “twinkle goes” illusion shows up with immediate report and is immediately perceived by many observers in demonstrations. The results from one experiment investigating this effect suggest that it is highly resource-intensive, however, because when attention was split across two targets the effect was greatly diminished. Possibly, then, the twinkle-goes illusion reflects a C≈1 effect, although more work is needed.
In summary then, there is little evidence for extrapolation playing a role in multiple object tracking, with the possible exception of a C≈1 effect for one of the targets, if it is particularly attended to.