A recent paper called ReconVLA attempted to solve this. I spent a significant stretch of time reading it carefully, stress-testing its assumptions, and thinking about what it would mean to implement and extend it. What I found impressed me in some ways and genuinely troubled me in others.
hackernoon.com
hackernoon.com
Create attached notes ...
