Search iNaturalist Photos With Text

We are excited to announce the launch of our Vision Language Demo, developed in collaboration with our long-time partners at the University of Massachusetts Amherst, the University of Edinburgh, the University College London and MIT, with generous support from Microsoft AI for Earth. This demo enables you to search a snapshot of 10 million iNaturalist photos using text queries. For instance, typing in "a bird eating fruit" will return matching photos ranked by their relevance to your query.



By clicking the “View these observations in Identify” button at the bottom, you can open these photos in the iNaturalist Identify tool where you can add the observations to projects or add observation fields or annotations. We are excited to learn if you find this tool useful for finding and organizing observations representing different life stages (“a caterpillar”), flowering phenology (“a cluster of red berries on a leafy green branch”), captive/cultivated (“a houseplant in a pot”) etc. into projects and with annotations.



Unlike the iNaturalist Computer Vision Model and Geomodel which we train ourselves off of iNaturalist observations, we did not train this model nor is it trained on iNaturalist data. This demo is built off a freely available Vision Language Model that was trained on millions of captioned images not necessarily relating to the natural world. This means that it knows about other things in addition to living organisms (e.g. "a bird perched on a car") but it also means that it currently has biases and may return inappropriate or offensive results that we don’t fully understand. Please keep that in mind when using it.

You can help us and our research collaborators understand how this model (or other Vision Language Models models we may explore or build) perform by clicking on the “Help us Improve” button. By marking the photos on the page that are relevant or not relevant to your search (e.g. "Mating dragonflies") and clicking submit we will be able to compare the performance of different Vision Language Models at this image retrieval task.



We built this demo to better understand the potential of Vision Language Models to help the community organize, explore, and explain the information contained within iNaturalist images. Building this demo has helped us understand the opportunities and challenges associated with this new technology. For example, while these models sometimes demonstrate a surprising ability to describe what is happening in images at a coarse level, they also fail to grasp more complex, finer concepts such as species names.

Two exciting possible future avenues are:

1. Helping to explore and organize iNaturalist images

iNaturalist data have been used in more than 4,900 scientific publications. While many scientific applications stem from qualities of the data that are already easy to filter (species, location, date etc.), an increasing number of studies are leveraging “secondary data” contained within the images themselves ranging from species interactions, to animal behavior, to phenotypic patterns revealed in the images such as color. Here are some examples of recent published studies that resulted from pulling patterns out of iNaturalist images:



Conservation: Recovery plans for the endangered Red-bellied Macaw were premised on the belief that the parrot relies on fruits from a single species of palm for food. Silva and colleagues examined iNaturalist images of this parrot eating fruit and found that it has a much more diverse diet than previously thought.



Climate Change: To reveal how plants are adapting to climate change, Funkano and colleagues examined iNaturalist images of wood sorrels from around the world and found that leaf color is evolving to become redder in urban heat islands.



Animal Behavior: Jagiello and colleagues examined thousands of iNaturalist images of hermit crabs and found that they are increasingly utilizing lighter weight plastic trash in lieu of shells. This study reveals how certain animals are able alter their behavior to take advantage of the Anthropocene and resulting impacts on the ecosystem.



Evolution: Most mammals are thought to have brown eyes. Tabin and Chiasson examined iNaturalist images to test this. They found an exception in the cat branch of the family tree where eye-color is extremely variable and explored the role that sexual selection plays. This paper was covered by Science magazine.



Mimicry: Muñoz-Amezcua and colleagues used computer vision models to examine iNaturalist images and found that many more insects mimic spiders than previously thought. This study reveals how in addition to more efficiently surfacing patterns that the human eye can detect (e.g. cats with blue eyes), vision models can also detect patterns that have gone undetected (e.g. moths that resemble jumping spiders).

We’re very excited to explore whether Vision Language Models can make it explore and organize the rich data contained within iNaturalist images.

2. Explaining Computer Vision species identifications

As anyone using tools such as ChatGPT knows, multimodal Vision Language Models can help explain images in a way that complements more traditional Computer Vision systems. The iNaturalist Computer Vision AI does a great job of telling us what species is in a photo, but it doesn’t do a great job of explaining why that species is suggested.



Offering explanations is something the identifier community does quite well by sharing expertise in text remarks (e.g. “This is Striped Rocket Frog and not Rainforest Rocket Frog because the white stripe extends from the eye to above the leg rather than to the groin.”). We’re interested in building Vision Language Models trained on iNaturalist images and remarks that will help iNautralist users understand why the Computer Vision AI is suggesting certain species and how to distinguish between them.



Deeply integrating Vision Language Models into iNaturalist is still far off and will require new funding opportunities and lots of product and engineering work. But we are very excited to share this small milestone on that journey. Please share your feedback on this exciting new demo!

Publicado el 26 de junio de 2024 por loarie loarie

Comentarios

Exciting!

Publicado por sunrise_again hace 3 meses
Publicado por dianastuder hace 3 meses

This is very interesting. Unfortunately, my initial explorations haven't yet turned up any practical ways to use the vision language search. I think there are a couple of obstacles.

1. The VLM is not trained on iNaturalist images or identifications. That means that it is best at "understanding" quite general visual attributes, such as color, basic classes of organisms and basic types of behavior. This is a pretty limited way to explore the dataset of ~200m iNat observations, but it could be a good complement to the structured data already present in iNat.

2. At present, the visual language search doesn't let you include existing structured data from iNat. I would love to be able to do searches such as:

- All observations in Mexico identified within Angiosperms but not below Genus, with yellow flowers and brown spots.
- All unknown observations in Peru that look like animals with brown fur and a stripe on the face
- All photos of Iridaceae in Brazil that show a pollinator

Is it technically possible (in the future) to combine these two styles of search?

Publicado por rupertclayton hace 3 meses

Clearly still in its infancy stages. I requested very specific things such as Common Starling taking nectar from flowers, and cat hybrids. The results that came up were very generic and nothing that really pertained to my request.

So if you want anything kinda useful, you'd have to enter a broad generic prompt and nothing really specific.

In another test, I also filtered for a very specific species (Cape Parrot) as I was curious to test if this could potentially show what this species consumes in its diet. Since this species does not have many observations on iNaturalist, the results page was blank. Is there a minimum threshold of observations for a given species to show in the results?

Publicado por dinofelis hace 3 meses

With a little experimentation I discovered that you can restrict the search to a taxon of your choosing. Just add "&taxon_id=123456" (or whatever taxon ID) to the end of the visual text search URL. It does not appear that you can do the same with place IDs.

Publicado por rupertclayton hace 3 meses

@rupertclayton, there are fields for taxon (and/or taxonomic group) filtering that you can use rather than manually altering the URL, but that works too

Publicado por loarie hace 3 meses

Use case #2 - explaining visible markers to differentiates species - will be game-changing, so I'm excited for this small step.

I do worry about exacerbating the feedback loop where one model IDs a picture and another model learns to explain based on that ID, reinforcing incorrect information, especially with occasional well-meaning input from a user who clicks "agree!" without critically examining the evidence. I'm sure experts are considering this deeply, and I wonder if they'll be using some of the more commonly over-specified taxon records here on iNat as testbed. I'm thinking of the 100,000+ observations labelled Taraxacum officinale, many of which are quite probably another species in the complex...

Publicado por hmheinz hace 3 meses

Being picky, it is University College London not University of College London.

Publicado por nyoni-pete hace 3 meses

@loarie: Ah, thanks! I had misinterpreted the "Birds" example as implying that you could only choose iconic taxa. Now that I realize it's a standard taxon input field that makes it a lot easier. It would be great if we could include other observation search parameters such as lrank, hrank, without_taxon_id and all the location parameters.

Publicado por rupertclayton hace 3 meses

@dinofelis its a random sample of 10M photos, so species with fewer photos have a higher probability of not having ended up in the sample. Its a bit surprising but not alarming that Cape Parrot with 132 photos didn't end up in the sample by chance.

Publicado por loarie hace 3 meses

Interesting demo! After testing out the improve function, I'd request a format more like the identify format on iNat. I would find it easier to be able to click through the photos and have hotkeys to quickly mark as relevant or not relevant.

Publicado por csledge hace 3 meses

Fascinating!! Excited to use this new functionality!

Publicado por invertebratist hace 3 meses

Cool stuff but I'm still afraid AI like this will eventually lead to the death of the internet via fake and biased results.

Publicado por ipomopsis hace 3 meses

Oh this is fun! It works on aesthetic qualities too - cartoon, creepy, abstract. Careful with the human faces in tree bark - but "angry face" is excellent.

Publicado por jellyturtle hace 3 meses

Could the AI be used to rapidly populate this project or similar with suitable orange and black observations ? https://www.inaturalist.org/projects/lycidae-mimics-of-africa

Currently insects and spiders that may be mimics of lycidae are added manually and slowly to the project, which only covers Africa. And it is just a traditional project.

Perhaps somebody clever could create a project for lycidae mimics of the whole globe, using AI for rapid construction.

Publicado por botswanabugs hace 3 meses

Bravo!
It will only improve with time and alternating with different queries, will provide you something to play with. I tried Ants and Caterpillars with Lycaenidae as the taxon, and it did identify quite a few images. The ones where it failed were the ones which were very blurry images, or butterfly which were puddling in a group. It did limit the observations pretty much to the taxon, except one or two Pieridae caterpillar observations. Next I tried Birds feeding on flowers, with Magpie robins the results were random, but with hummingbirds they were nearly perfect. So the Photographer's perspective plays a part, especially if it is an interaction type of observation. We all can help it improve by adding corrections to which search images are right or wrong by clicking the like/dislike button under each set of image.
Overall, very exciting!

Publicado por gs5 hace 3 meses

I tried 'Birds in Flight' and the added the observations to a project of the same name.

Publicado por andrewgillespie hace 3 meses

What is the ecological impact of using this AI? They are notoriously destructive to the environment due to excessive power use.
While the VLM may be free to use, where were the training images taken from? iNaturalist? Wikipedia commons? a library of CC-BY images? or were the images scraped indiscriminately?

Publicado por astra_the_dragon hace 3 meses

This is unbelievably rad, needed, and will help naturalists and scientists alike!!! Well done iNat team!!!!!

Publicado por ecologistchris hace 3 meses

I am generally hesitant about machine learning applications--I especially object to it being used to generate its own media--but using it for searching/sifting the volume of observations is useful. I have used it so far to search some terms of specific evidence types and used "&not_in_project=" to find and place those observations into projects where experts can lend their identification skill. Is there a forum thread yet for users to discuss their thoughts and potential use cases?

Publicado por roboraptor hace 3 meses

Tried this
https://www.inaturalist.org/vision_language_demo?q=%27rafnia+amplexicaulis%27+with+gall

I get galls - tick.
But it ignores the host.
If I put the host in the species box - I get nothing - which is probably the right answer.

Publicado por dianastuder hace 3 meses

These innovations are really exciting to see! It is especially exciting to think about how this technology will improve as it continues to be refined. Thanks, iNat team, for continuing to iterate, innovate, and help improve ways to connect access to data / results. I know, there are limitations and danger with incorrect data or assumptions, though I also have to balance error rates with positive impacts as more people get connected with these tools and hopefully use them for positive future impacts.

Publicado por scarletskylight hace 3 meses

Hi @astra_the_dragon,

What is the ecological impact of using this AI? They are notoriously destructive to the environment due to excessive power use.

On the inference (output) side, we do predictions on a CPU. The power required isn't much different than doing computer vision suggestions.

It took a few days of running a few GPUs to prepare the 10 million photos for the demo. Based on the % of carbon free power where our offices are, and the draw at the wall of our servers, I estimate this cost as 3.8 lbs of CO2e. I think this is equivalent to 4.3 miles driven in an average gas powered vehicle.

The CLIP model itself was trained by OpenAI. I don't know the specifics of how it was trained, but this paper by Emma Strubell from UMass Amherst (https://arxiv.org/abs/1906.02243) is a good starting point for exploring the cost of training LLMs and related models like CLIP. One advantage to using a shared model like we did, instead of training our own CLIP model, is that each useful search helps amortize the cost of producing the original CLIP model.

where were the training images taken from? iNaturalist? Wikipedia commons? a library of CC-BY images? or were the images scraped indiscriminately?

Here's the paper that describes the CLIP architecture and how the model was trained: https://arxiv.org/abs/2103.00020

Best,
alex

Publicado por alexshepard hace 3 meses

Thank you for the details. I think iNaturalist has earned a lot of trust from its users and partners through the way that it carefully thinks through its decisions, successfully ties them back to a core goal related to conservation and environmental education, and transparently communicates them to its users. I hope that is a comfort to some when thinking about the partnerships that iNaturalist has had with large companies like Microsoft or Amazon, or in discussing new subjects like large language models.

Publicado por natrudkins hace 3 meses

I tried it with the term 'road kill' and, although not that enjoyable to see the results, it was very accurate and might serve as a helpful tool to add such observations to projects and/or add respective observation fields.

Publicado por carnifex hace 3 meses

Thank you @alexshepard for the rapid and comprehensive reply! I value the transparency and appreciate the information.

Publicado por astra_the_dragon hace 3 meses

"Cemeteries" as a visual search also returns interesting results - we have some projects in my area centered around studying sensitive species in remnant land on cemeteries. While there are other ways to limit results to areas that contain cemetery with mapped data, this could also be an interesting way to locate things. Unfortunately "cemetery" is too broad and is returning photos of piles of rocks, trash bins in the background that look like rocks, and retaining walls built of rock. "Gravestone" brings up pictures with things that have a similar texture to granite cemetery markers - sandy backgrounds, khaki pants, etc. It's really interesting to test and think about the possibilities as visual models improve!

Editing to add, my search was limited to Panicum genus with this exercise.

Publicado por scarletskylight hace 3 meses

Exciting!

Publicado por texas_nature_family hace 3 meses

Personally, I'd prefer having a better UI for advanced searches over a natural language model search. The learning curve may be a bit steeper for folks who haven't used advanced search options before, but after that the speed of determining search parameters is WAY faster than and clearer than trying to figure out what series of words will get a computer to actually show me what I'm looking for.

Publicado por slugcycles hace 3 meses

I'm really happy to see this experiment; general text search would be an extremely helpful tool to me. While the results were pretty low-quality with my searches (snake eating bird, snake eating frog), I did get pictures of snakes eating something (though only about 5-10% of the photos returned), and surfaced one record that the Snake Predation Records project hadn't already found. Excited to see the next iteration.

Publicado por isaac_krone hace 3 meses

Exciting!

Publicado por jtcouncil hace 3 meses

I'm interested in how the research team envisages improving this vision language search capability. The CLIP model currently used is derived from general (non-iNaturalist) images and uses a different training approach than the one we're familiar with through iNaturalist computer vision (CV).

If I understand correctly, CLIP is first pre-trained with a bunch of images that have descriptions and attributes, and then the bulk of the training is performed using images that do not have explicit descriptions but can be related to the earlier, known images. The idea is to build a model that can generalize about how to describe images rather than one that only recognizes the stuff that it was trained on. That makes a lot of sense for general purpose applications of finding images that match a written description.

However, my impression from exploring various searches is that CLIP is pretty limited in its awareness of the type of things that would interest an iNat user or researcher. And I don't see how CLIP's knowledge is likely to improve.

It's great to have a model that tries to answer the question "Show me photos of this type of thing" in addition to CV's ability to "Show me the most likely IDs for this photo". However, I'd say that the capabilities of the current vision language demo range from moderately successful down to useless. Even searching for something like "white flower" within a genus doesn't seem to produce predominately white flowers. I get the strong impression that the general purpose image understanding embodied in CLIP doesn't much care about the things that an iNat user is looking for in an image. Of course, this is version 0.1, and we shouldn't set the bar too high.

But it also seems that we lack an effective feedback loop to improve the capabilities of the vision language model. With CV, every human-provided ID can potentially improve the model's ability to recognize species. With CLIP, the model training is unconnected to iNat data. The only feedback right now seems to be the ability to provide a thumbs up/down on the search results. It's hard to see how the the limited information and low volume of that feedback could make a significant difference to the model.

I'm guessing that the research team is still working to find a way to iteratively train a vision language model. But I would be interested to know what ideas people might have.

Publicado por rupertclayton hace 3 meses

Thanks for sharing those photos with us. I appreciate your work. I love those photos.

Publicado por mattheewherrnandez hace 2 meses

Platform of diverse information

Publicado por suheelahmad99 hace 2 meses

Agregar un comentario

Acceder o Crear una cuenta para agregar comentarios.