SEO

SEO image of multimodal AI

Over the last decade, the image of SEO has been a big technical hygiene issue:

  • Compresses JPEGs to appease impatient visitors.
  • Writing another text to make it accessible.
  • Using lazy loading to keep LCP scores green.

While these processes remain fundamental to a healthy site, the rise of large, multimodal models like ChatGPT and Gemini have presented new opportunities and challenges.

Multimodal search embeds content types in a shared vector space.

Now we prepare for the “machine view.”

Generative search makes most content machine-readable by segmenting media and extracting text from the visual using optical character recognition (OCR).

Images should be readable by the machine’s eye.

If AI can’t decipher the text on a product’s packaging because of low contrast or it’s blurring out information because it’s not clear, that’s a big problem.

This article removes the mechanical appearance, shifting the focus from loading speed to machine readability.

Technical hygiene is still important

Before developing machine intelligence, we must respect the gatekeeper: performance.

Images are a double-edged sword.

They drive engagement but are often the main cause of structural instability and slow speeds.

The “good enough” standard has gone beyond WebP.

Once the material is loaded, the real work begins.

Dig deep: How multimodal acquisition is redefining SEO in the age of AI

Designing the machine eye: Pixel-level readability

In large-scale linguistic models (LLMs), images, audio, and video are structured data sources.

They use a process called visual tokenization to break down the image into a grid of patches, or visual tokens, turning raw pixels into linear vectors.

This integrated modeling allows AI to process “a picture of a [image token] on the table” as one coherent sentence.

These systems rely on OCR to extract text directly from the interface.

This is where quality becomes a factor.

If the image is heavily compressed with lossy artifacts, the visual tokens result in noise.

Improper adjustments can cause the model to misinterpret those tokens, leading to false positives where the AI ​​over-confidently interprets objects or text that aren’t really there because the “visual words” weren’t clear.

Reorders another text as a base

In the main language models, the different text uses a new function: to put down.

It acts as a semantic cue that forces the model to resolve ambiguous visual tokens, helping to confirm its image interpretation.

As Zhang, Zhu, and Tambe note:

  • “By inserting text tokens next to the relevant visual fragments, we create semantic symbols that reveal truth points based on the attentional content of cross-paths, which guides the model.”

Tip: By describing the visual features of an image – light, texture, and text on an object – you provide high-quality training data that helps the machine eye to associate visual tokens with text tokens.

Evaluation of OCR failure points

Search engines like Google Lens and Gemini use OCR to read ingredients, instructions, and features directly from images.

They can then answer difficult user questions.

As a result, the image of SEO now extends to the physical installation.

Current labeling regulations – FDA 21 CFR 101.2 and EU 1169/2011 – allow type sizes as small as 4.5 pt to 6 pt, or 0.9 mm, for compact packaging.

  • “In the case of packaging or containers the largest area of ​​which is less than 80 cm², the length x of the font size referred to in section 2 shall be equal to or greater than 0.9 mm.”

While this is pleasing to the human eye, it fails to look at the machine.

The minimum pixel resolution required for OCR readable text is very high.

Character height should be at least 30 pixels.

Low contrast is also a problem. Contrast should reach 40 grayscale values.

Beware of stylized fonts, which can cause OCR systems to mistake “l” for “1” or “b” for “8.”

Without comparison, the elimination of the display creates additional problems.

Glossy packages reflect light, producing a light that hides the text.

Packaging should be treated as a machine-readable feature.

If the AI ​​can’t analyze the packaging image because of the blurry font or text font, it may display the information or, worse, abandon the product altogether.

Original as a representative of knowledge and effort

Originality may sound like an independent creative factor, but it can be measured as a quantifiable data point.

The original images act as a canonical signal.

The Google Cloud Vision API includes a feature called WebDetection, which returns a list of FullMatchingImages – which are exact duplicates found across the web – and pagesWithMatchingImages.

If your URL has the first index date of a different set of visual tokens (ie, a specific angle of the product), Google credits your page as the origin of that visual information, improving its “experience” score.

Dive deeper: Visual content and SEO: How to use images and videos

Get the newsletter search marketers rely on.


Co-occurrence test

AI identifies everything in an image and uses their relationships to reveal features about the product, price point, and target audience.

This makes product proximity a quality signal. To test it, you need to test your physical properties.

You can test this using tools like the Google Vision API.

To get a systematic test of the entire media library, you need to pull the raw JSON using the OBJECT_LOCALIZATION attribute.

The API returns labels for things like “watch,” “plastic bag” and “disposable cup.”

Google provides this example, where the API returns the following information about the objects in the image:

Name in the middle The result Limits
A bicycle wheel /m/01bqk0 0.89648587 (0.32076266, 0.78941387), (0.43812272, 0.78941387), (0.43812272, 0.97331065), (0.32076266, 0.97331065)
A bicycle /m/0199g 0.886761 (0.312, 0.6616471), (0.638353, 0.6616471), (0.638353, 0.9705882), (0.312, 0.9705882)
A bicycle wheel /m/01bqk0 0.6345275 (0.5125398, 0.760708), (0.6256646, 0.760708), (0.6256646, 0.94601655), (0.5125398, 0.94601655)

Good to know: in the middle contains a machine generated identifier (MID) corresponding to the Google Knowledge Graph entry for the label.

The API does not know whether this context is good or bad.

You do, so check if the neighbors in sight tell the same story as your price range.

Lord Leathercraft green leather watch band

By picturing a green leather watch next to an antique brass compass and a warm wood grain surface, Lord Leathercraft engineers a specific semantic signal: exploring gems.

The juxtaposition of analog mechanics, old steel, and tactile suede offers a timeless adventure with old-world sophistication.

Picture the same clock next to a neon energy drink and a plastic digital stopwatch, and the narrative turns to dissonance.

Visual content now reflects mass market usage, refining the perceived value of the business.

Dive deep: How to make products machine-readable for multimodal AI search

Measuring emotional resonance

Besides objects, these models are increasingly capable of learning emotions.

APIs, such as Google Cloud Vision, can measure emotional attributes by assigning confidence scores to emotions such as “happiness,” “sadness,” and “surprise” detected on people’s faces.

This creates a new vector for optimization: emotional alignment.

If you’re selling fun summer clothes, but the models seem edgy or neutral – a common trope in high fashion photography – AI may devalue the image for that question because the visual emotion conflicts with the search intent.

For more information without writing code, use Google Cloud Vision’s live drag-and-drop demo to review the four main emotions: happiness, sadness, anger, and surprise.

For good intentions, such as “happy family dinner,” you want the happiness attribute to be registered as VERY_LIKELY.

If it reads POSSIBLE or UNLIKELYthe signal is too weak for the machine to confidently identify the image as happy.

For rigorous research:

  • Run a bunch of images with the API.
  • Directly look for the faceAnnotations object in the JSON response by sending a request to the FACE_DETECTION attribute.
  • Review the possible fields.

The API returns these values ​​as enums or static classes.

This example comes directly from the official documentation:

          "rollAngle": 1.5912293,
          "panAngle": -22.01964,
          "tiltAngle": -1.4997566,
          "detectionConfidence": 0.9310801,
          "landmarkingConfidence": 0.5775582,
          "joyLikelihood": "VERY_LIKELY",
          "sorrowLikelihood": "VERY_UNLIKELY",
          "angerLikelihood": "VERY_UNLIKELY",
          "surpriseLikelihood": "VERY_UNLIKELY",
          "underExposedLikelihood": "VERY_UNLIKELY",
          "blurredLikelihood": "VERY_UNLIKELY",
          "headwearLikelihood": "POSSIBLE"

API grades sensitivity on a fixed scale.

The goal is to remove the main images from POSSIBLE to LIKELY or VERY_LIKELY with the target emotion.

  • UNKNOWN (data gap).
  • VERY_UNLIKELY (strong negative signal).
  • UNLIKELY.
  • POSSIBLE (neutral or ambiguous).
  • LIKELY.
  • VERY_LIKELY (strong direct signal – point this out).

Use these benchmarks

You can’t adjust for emotion if the machine can’t see the person properly.

If detectionConfidence is less than 0.60, the AI ​​struggles to recognize faces.

As a result, any emotion reading tied to that face is statistically unreliable noise.

  • 0.90+ (Good): High definition, forward facing, bright. AI is sure. Trust the emotional points.
  • 0.70-0.89 (Acceptable): Great for background faces or secondary lifestyle shots.
  • < 0.60 (Fail): The face may be too small, dim, have a side profile, or be obscured by shadows or sunglasses.

While Google’s documentation does not provide this guidance, and Microsoft provides limited access to its Azure AI Face service, Amazon Rekognition’s documentation notes that:

  • “[A] a lower threshold (eg, 80%) may be sufficient to identify family members in photographs.”

Bridging the semantic gap between pixels and meaning

Treat visual assets with the same planning and strategic intent as primary content.

The semantic gap between image and text disappears.

Images are processed as part of a language sequence.

The quality, clarity, and semantic accuracy of the pixels themselves are now as important as the keywords on the page.

Contributing writers are invited to create content for Search Engine Land and are selected for their expertise and contribution to the search community. Our contributors work under the supervision of editorial staff and contributions are assessed for quality and relevance to our students. Search Engine Land is owned by Semrush. The contributor has not been asked to speak directly or indirectly about Semrush. The opinions they express are their own.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button