Intriguing... Who would have thought that a miniture image called an epitome would produce such results. I really like how the masking (or segmenting) works with a reconstructed image. The algorithm really doesn't have to "detect" anything just look for a particular patch. So, like in figure 3, if the patch of the ground is found in the epitome, it is easy to mask (or segment) this when the image is reconstructed.
Although the paper doesn't talk about tracking much, I can imagine this could easily be extended for tracking. Especially if an object in a video sequence is being tracked, a single epitome might be sufficient to track the object. Of course, if tracked objects are entering and exiting the scene, the epitome would need to be updated but probably not often (like every frame). This could have some great advantages.
Oh, and I downloaded the matlab code and tried it out. It doesn't match the results from the paper but it produces similar results. If you don't have time to run the code, I've put the results up here:
http://www-cse.ucsd.edu/~mclothie/matlab_epitome.jpg