Clean up domain list in Excel - regex / macros?

Posted by Tim on Stack Overflow See other posts from Stack Overflow or by Tim
Published on 2012-11-05T09:19:49Z Indexed on 2012/11/05 11:01 UTC
Read the original article Hit count: 306

Filed under:
|
|
|
|

I have a huge spreadsheet of domains that I need to clean up as follows:

  1. Remove all http:// (simple replace all - "http://" with "")
  2. Remove any www. (simple replace all - "www." with "")
  3. Delete any sub-domains (delete the actual row completely, not just the subdomain from the url)
  4. Remove anything after the domain extension (i.e. website.com/blah/blahbah/ becomes just website.com (simple replace all - "/*" with "", then replace all "/" with "")

So what I'm left with is just a spreadsheet of clean domains like "website.com".

I think I've got 1, 2 and 4 sorted (as above), but I'm really struggling with 3.

Any ideas? Can I do this with / , and actually delete the row completely?

Sample data:

http://www.scholastic.com/kids/stacks/games/
http://imgworld.teamworkonline.com/
http://topfreegraphics.com/
http://www.workcircle.co.uk/
http://www.healthycanadians.gc.ca/index-eng.php
http://gsociology.icaap.org/methods/soft.html

Post 1, 2 and 4 would leave me with:

scholastic.com
imgworld.teamworkonline.com
topfreegraphics.com
workcircle.co.uk
healthycanadians.gc.ca
gsociology.icaap.org

It's those pesky sub-domains I need to just delete completely, just delete the row. I've realised I can't just search for 2 x ".", because obviously plenty of domain extensions (i.e .co.uk) include that.

Any help appreciated.

© Stack Overflow or respective owner

Related posts about regex

Related posts about excel