Normalizing Item Names & Synonyms

Posted by RabidFire on Stack Overflow See other posts from Stack Overflow or by RabidFire
Published on 2011-01-04T06:12:53Z Indexed on 2011/01/04 6:54 UTC
Read the original article Hit count: 274

Consider an e-commerce application with multiple stores. Each store owner can edit the item catalog of his store.

My current database schema is as follows:

item_names: id | name | description | picture | common(BOOL)
items: id | item_name_id | picture | price | description | picture
item_synonyms: id | item_name_id | name | error(BOOL)

Notes: error indicates a wrong spelling (eg. "Ericson"). description and picture of the item_names table are "globals" that can optionally be overridden by "local" description and picture fields of the items table (in case the store owner wants to supply a different picture for an item). common helps separate unique item names ("Jimmy Joe's Cheese Pizza" from "Cheese Pizza")

I think the bright side of this schema is:

Optimized searching & Handling Synonyms: I can query the item_names & item_synonyms tables using name LIKE %QUERY% and obtain the list of item_name_ids that need to be joined with the items table. (Examples of synonyms: "Sony Ericsson", "Sony Ericson", "X10", "X 10")

Autocompletion: Again, a simple query to the item_names table. I can avoid the usage of DISTINCT and it minimizes number of variations ("Sony Ericsson Xperia™ X10", "Sony Ericsson - Xperia X10", "Xperia X10, Sony Ericsson")

The down side would be:

Overhead: When inserting an item, I query item_names to see if this name already exists. If not, I create a new entry. When deleting an item, I count the number of entries with the same name. If this is the only item with that name, I delete the entry from the item_names table (just to keep things clean; accounts for possible erroneous submissions). And updating is the combination of both.

Weird Item Names: Store owners sometimes use sentences like "Harry Potter 1, 2 Books + CDs + Magic Hat". There's something off about having so much overhead to accommodate cases like this. This would perhaps be the prime reason I'm tempted to go for a schema like this:

items: id | name | picture | price | description | picture

(... with item_names and item_synonyms as utility tables that I could query)

  • Is there a better schema you would suggested?
  • Should item names be normalized for autocomplete? Is this probably what Facebook does for "School", "City" entries?
  • Is the first schema or the second better/optimal for search?

Thanks in advance!

References: (1) Is normalizing a person's name going too far?, (2) Avoiding DISTINCT

© Stack Overflow or respective owner

Related posts about database

Related posts about database-design