Overview: This paper leveraged the WikiPedia to create a paired satelite image and WikiPedia text dataset called WikiSatNet. A pretrained model achieves SOTA on the fMoW benchmark.
WikiSatNet: Among 47 million articles on WikiPedia, there are 1 million that is written in English and geo-located, i.e. associated with a latitude and longitude. The authors use Digital Globe Satellites to get high resolution images that has a ground sampling distance(GSD) of 0.3-0.5m. The Final dataset because a collection of 888,696 article-image pairs.
Weakly Superised Learning: They first train a model use machine generated labels from the WikiPedia article: a) manually compile a list of 97 potentiall categories; b) use RegEx to match the terms of the article; c) merge small classes and remove classes without enough samples.
Text Matching Learning: The second approach they take is to use the Doc2Vect to compress the article as a vector and train a CNN on top of the satellite image to match the embdding of the article.
Result: They achieve state of art result on fMoW dataset, beating ImageNet pretrained model by more than 2%.
One point is that they found 1000 by 1000 pixel (approximately 900 m2) cover most of the ROI with the GSD 0.3-0.5m. It would be interesting to see what’s the miinmal requirement to get semantical meaningful images to human from satellite (both in terms of the resolution and the ROI scope). Another point is the Semi-supervised leanring v.s. Weakly supervised learning debating implicitly posted by the paper: It seems in this case the weakly-supervised model (weak label for WikiPedia article) is beating the semi-supervised or transferred one (pretrained on ImageNet). This could open a discussion of data efficiency.