Product matching for ecommerce: resolving the same product across sellers
How product matching connects ecommerce listings across sellers using identifiers, attributes, images, and similarity checks.
Product matching decides whether two ecommerce pages are talking about the same physical item.
It is easy to describe and annoying to do well.
One retailer writes "Salomon Speedcross 6 GTX." Another writes "Salomon Men's Speedcross 6 Gore-Tex Trail Running Shoes." A marketplace seller leaves out the model year. A French store translates the color. One page has a GTIN, another only has an MPN, and a third has neither.
A person can usually see the match. A system needs evidence.
If matching is weak, price comparison gets noisy. Market intelligence double-counts products. Product data enrichment gives you cleaner listings, but still does not tell you which sellers carry the same item. You can scrape thousands of pages and still miss the question people actually ask: where else is this sold, and at what price?
The unit you match matters
In ecommerce, product matching connects store listings to a stable product identity.
The unit matters. A lot of bad matching comes from mixing these levels:
| Level | Question | Example |
|---|---|---|
| Listing | What did one source page say? | One Nike page for one shoe color |
| Variant | Is this the same physical configuration across sellers? | The same shoe model and color on Nike, Foot Locker, and Amazon |
| Product family | Are these sibling variants of the same line? | Black, white, and red colorways of the same model |
Most ecommerce scraping starts with listings. A listing has a URL, title, images, seller, price, options, SKU, and source identifiers. Matching turns those listings into comparable variants. Product families come later, after the exact variants are under control.
That order is boring but useful. If you match too loosely, you compare a six-pack against a twelve-pack or last year's model against this year's model. If you match too strictly, the same product stays split across several identities and every downstream report inherits the mess.
Where the evidence breaks down
Identifiers help. They just do not cover enough of the open web.
GTINs, UPCs, EANs, MPNs, ASINs, and source SKUs are useful when they are present and trustworthy. Many ecommerce pages omit them, bury them in page data, expose store-specific SKUs only, or reuse the same parent identifier across several options.
Marketplaces make it worse. One product detail page can contain offers from many sellers, each with its own seller metadata, fulfillment terms, and price history. The page identity, offer identity, and product identity are not the same thing.
Titles are useful, but brittle. Sellers add SEO phrases, reorder tokens, translate terms, abbreviate sizes, or merge pack count into the name. A string matcher that works for shoes can fail on skincare, supplements, replacement parts, or grocery bundles.
Images help, but they do not settle the question alone. Two sellers may use the same manufacturer image for different pack sizes. The same shoe can be photographed from different angles. A retailer may show a lifestyle image instead of the product on a white background.
Attributes catch many of the mistakes that identifiers, titles, and images miss. Color, material, capacity, gender, size system, flavor, pack count, compatibility, and model year all matter, but not in the same way for every category. Those fields need to be extracted and normalized before they can carry much weight.
A matching pipeline for real catalogs
A matching system that survives real catalogs starts with hard evidence and uses similarity only where it has context.
Start with identifiers. If two listings share a reliable GTIN, and the brand and category context agree, they are usually the same exact product configuration. Brand plus MPN can work when GTIN is missing. Marketplace identifiers can help, but they need source context. Not every marketplace ID behaves like a universal product ID.
Normalize the product before matching. Raw merchant text is noisy. Classify each listing into a taxonomy, translate important text to a common language, extract category-specific attributes, preserve identifiers, and keep listing identity separate from offer observations. That is the enrichment step.
Then narrow the candidate set. Do not compare every listing with every other listing. Filter by category, brand, identifier hints, and sometimes country or market. Smaller matching questions produce cleaner decisions.
Use similarity checks when identifiers are missing or messy. Text and image similarity can catch cases where two titles look different but the product is the same. It works better after enrichment because the input is cleaner: normalized titles, attributes, category names, and product signals instead of raw page noise.
Keep the final decision inspectable. Exact identifiers should beat fuzzy similarity. Similarity matches should be scoped by brand and category, then accepted only above a threshold. Edge cases should stay unmatched until more evidence appears.
How Extralt handles it
Extralt treats product matching as a product-data problem, not a dashboard trick.
Extract gets the source-page evidence: product text, images, SKUs, offers, prices, identifiers, options, availability, and seller context.
Enrich turns that source data into structured ecommerce records. It classifies products into a taxonomy, translates text to English when needed, extracts category-specific attributes, preserves identifiers, and creates Listings plus append-only Offers.
Extend does cross-seller product matching. It works over enriched data, not raw pages. Identifier matches come first. Similarity checks are scoped to compatible product contexts. The output is a product relationship graph: same-product variants, alternate listings, and alternative products. Complements are a later relationship type.
The useful part is reuse. Once the same product is resolved across sellers, that identity can power price monitoring, digital shelf analytics, product data enrichment, market intelligence, and agent-facing discovery.
Where matching gets used
Price comparison
If the same product appears across ten stores, matching connects those offers to one product identity. You can compare current prices, availability, shipping context, and seller type without maintaining a spreadsheet of URLs.
MAP monitoring
Brands need to know whether resellers are advertising below the minimum advertised price. That requires matching seller offers to the product under policy, even when marketplace titles or seller SKUs differ from the internal catalog.
Assortment intelligence
Category teams need to know which brands and products are gaining share. Without matching, duplicates inflate counts. A market can look larger than it is simply because the same product appears under five titles.
Product data enrichment
A merchant's internal catalog can be enriched from open-web evidence: identifiers, category paths, attributes, price ranges, and external seller coverage. Matching keeps that evidence attached to the right product instead of a near-duplicate.
AI shopping agents
Agents cannot make reliable "where should I buy it?" recommendations if they cannot tell that two listings are the same product. Product matching gives agentic commerce a product identity it can query instead of a pile of similar pages.
Where matches go wrong
Title matching without product structure works in demos and breaks in production. Variants, bundles, translated descriptions, and marketplace pages all create false positives.
Identifier-only matching has the opposite problem. GTINs are strong evidence, but missing or malformed identifiers are common. A system that only matches by identifier leaves too many duplicates behind.
Another failure is hiding uncertainty. Product matching is probabilistic when identifiers are absent. The system should keep confidence and evidence available instead of pretending every match is equally certain.
The expensive failure is treating matching as a dashboard feature. If matching only exists inside a UI, the rest of your pipeline cannot reuse it. APIs, exports, analytics, and agents need stable product identities they can join against.
Questions to ask before you trust it
Ask these questions before trusting a vendor or internal system:
- Which identifiers do you preserve, and which do you treat as universal?
- How do you avoid matching different pack sizes, model years, regions, or variants?
- Do you use category-specific attributes, or only title similarity?
- Can I inspect the evidence behind a match?
- Does the matched identity appear in the exported data, or only inside the dashboard?
The answers reveal whether the system is matching products or only clustering similar pages.
Matching makes ecommerce data reusable
Scraping gives you listings. Enrichment gives those listings structure. Product matching connects the same product across sellers so the rest of the stack can reuse it.
If you are building ecommerce intelligence from the open web, matching cannot sit at the end as a manual cleanup step. It has to create product identities that analysts, applications, and agents can query.
For the product workflow, read about Extend and the product data enrichment use case.