Home > POP / PPV / Redirect >

Simple SEO tip to find more targets (5)


12-20-2011 12:38 PM #1 julien (Member)
Simple SEO tip to find more targets

When you want to target all the URLs a domain have, you have 3 solutions:

1/ Google
Long and boring.

2/ site: on Google
For example, 'site:bestbuy.com' in Google.
You'll find all the URL indexed by Google.
Problem is, you can scrape them with some scrapers, but you won't always have a complete list, and you may kill your IP fast.

3/ Use the work of SEO at your advantage:
It's very easy.
Find out where the sitemap.xml file is located.
You find it very often in the robots.txt file.
99% of domains have a robots.txt to the root.

For example: http://www.bestbuy.com/robots.txt
You have the information you want:
'Sitemap: http://www.bestbuy.com/sitemap_p_index.xml'
Just browse through the different sitemaps and you'll find every URL of the domain.


12-20-2011 01:37 PM #2 tijn (Moderator)

nice one for posting this.

i used scrapebox in the past to do bulk sitemap scraping

but never thought of using this for PPV


12-20-2011 01:47 PM #3 julien (Member)

Yes, I used Scrapebox too, very useful tool.
But with 'site:' , it have some limits.
That's why I tried to scrape URL this way

I'm still a noob at PPV, but a noob with a lot of targets


01-10-2012 06:05 PM #4 brianb (Member)

Hey Julien, LOVE this tip. Thanks. I'm having issues actually viewing the sitemaps. I'm finding lots of sitemap URLs by searching google with "inurl:sitename.com sitemap", but when I try to view the pages, I think they're saying not found because I'm not a crawler or something. What's the simplest way to take the sitemap URL and extract all domains from it? Do you have a simple script that does that? If that's not something you want to share, can you describe the process in a general way, so I know how to approach it?

thanks!


01-14-2012 02:18 AM #5 jred2002 (AMC Alumnus)

Quote Originally Posted by brianb View Post
Hey Julien, LOVE this tip. Thanks. I'm having issues actually viewing the sitemaps. I'm finding lots of sitemap URLs by searching google with "inurl:sitename.com sitemap", but when I try to view the pages, I think they're saying not found because I'm not a crawler or something. What's the simplest way to take the sitemap URL and extract all domains from it? Do you have a simple script that does that? If that's not something you want to share, can you describe the process in a general way, so I know how to approach it?

thanks!
After I realized I just repeated Julien's first post I felt bad and thought I would contribute some this a little more useful. I wrote something to get those urls out of the sitemaps and it fakes the useragent to be yahoo slurp so it should be able to access most sitemaps.

NOTE: This was written very fast and not tested a whole lot. Feel free to let me know if anyone has any issues and also the GZipped sitemaps take a long time to download.

Hope you find it useful.

Code:
http://www.mailshed.net/xml-sitemaps/


Home > POP / PPV / Redirect >