Because a significant-resolution picture might contain a lot of pixels, chunked into 1000s of patches, the attention map immediately will become huge. Because of this, the quantity of computation grows quadratically because the resolution of your picture improves.While the challenge of “vision” is trivially solved by humans (even by little ones