Normalizing Item Names & Synonyms
Posted
by
RabidFire
on Stack Overflow
See other posts from Stack Overflow
or by RabidFire
Published on 2011-01-04T06:12:53Z
Indexed on
2011/01/04
6:54 UTC
Read the original article
Hit count: 279
Consider an e-commerce application with multiple stores. Each store owner can edit the item catalog of his store.
My current database schema is as follows:
item_names: id | name | description | picture | common(BOOL)
items: id | item_name_id | picture | price | description | picture
item_synonyms: id | item_name_id | name | error(BOOL)
Notes: error
indicates a wrong spelling (eg. "Ericson"). description
and picture
of the item_names
table are "globals" that can optionally be overridden by "local" description
and picture
fields of the items
table (in case the store owner wants to supply a different picture for an item). common
helps separate unique item names ("Jimmy Joe's Cheese Pizza" from "Cheese Pizza")
I think the bright side of this schema is:
Optimized searching & Handling Synonyms: I can query the item_names
& item_synonyms
tables using name LIKE %QUERY%
and obtain the list of item_name_id
s that need to be joined with the items
table. (Examples of synonyms: "Sony Ericsson", "Sony Ericson", "X10", "X 10")
Autocompletion: Again, a simple query to the item_names
table. I can avoid the usage of DISTINCT
and it minimizes number of variations ("Sony Ericsson Xperia™ X10", "Sony Ericsson - Xperia X10", "Xperia X10, Sony Ericsson")
The down side would be:
Overhead: When inserting an item, I query item_names
to see if this name already exists. If not, I create a new entry. When deleting an item, I count the number of entries with the same name. If this is the only item with that name, I delete the entry from the item_names
table (just to keep things clean; accounts for possible erroneous submissions). And updating is the combination of both.
Weird Item Names: Store owners sometimes use sentences like "Harry Potter 1, 2 Books + CDs + Magic Hat". There's something off about having so much overhead to accommodate cases like this. This would perhaps be the prime reason I'm tempted to go for a schema like this:
items: id | name | picture | price | description | picture
(... with item_names
and item_synonyms
as utility tables that I could query)
- Is there a better schema you would suggested?
- Should item names be normalized for autocomplete? Is this probably what Facebook does for "School", "City" entries?
- Is the first schema or the second better/optimal for search?
Thanks in advance!
References: (1) Is normalizing a person's name going too far?, (2) Avoiding DISTINCT
© Stack Overflow or respective owner