Clean up domain list in Excel - regex / macros?
- by Tim
I have a huge spreadsheet of domains that I need to clean up as follows:
Remove all http:// (simple replace all - "http://" with "")
Remove any www. (simple replace all - "www." with "")
Delete any sub-domains (delete the actual row completely, not just the subdomain from the url)
Remove anything after the domain extension (i.e. website.com/blah/blahbah/ becomes just website.com (simple replace all - "/*" with "", then replace all "/" with "")
So what I'm left with is just a spreadsheet of clean domains like "website.com".
I think I've got 1, 2 and 4 sorted (as above), but I'm really struggling with 3.
Any ideas? Can I do this with regexp / vba, and actually delete the row completely?
Sample data:
http://www.scholastic.com/kids/stacks/games/
http://imgworld.teamworkonline.com/
http://topfreegraphics.com/
http://www.workcircle.co.uk/
http://www.healthycanadians.gc.ca/index-eng.php
http://gsociology.icaap.org/methods/soft.html
Post 1, 2 and 4 would leave me with:
scholastic.com
imgworld.teamworkonline.com
topfreegraphics.com
workcircle.co.uk
healthycanadians.gc.ca
gsociology.icaap.org
It's those pesky sub-domains I need to just delete completely, just delete the row. I've realised I can't just search for 2 x ".", because obviously plenty of domain extensions (i.e .co.uk) include that.
Any help appreciated.