Hello. How do you do?
I've read your requirements carefully
let me outline my experiences briefly
1. Be able to show a history of experience in finding hard-to-find data across various web sources.
I have completed around 30k web scrapers so far.
Especially, my experiences cover over searching data over the various sources, lead generation, getting past the scraping bot, cleaning data, data mining and so on.
One of the most excited projects I have done was when I scraped around 10k US automobile websites.
The number of sites is around 10000 and I had to make a well-made script to crawl one time every day.
I decided to build the scrapers in python scrapy.
To speed up the crawling, I used Broad Crawl Method(it is a built-in function to make concurrent requests at once. )
Also, I had to avoid the captcha and limitation.
To implement it, I used the IP and user agents-rotating method and adjusted the download timeout. (really I used 1000 proxies)
There are around 1k ~5k ads in each site and the crawled html pages are stored in the specified folder with a unique name.
The crawled items such as VIN, Mileage, Price, Model, Name and so on are stored in mongo.
The crawler was deployed on AWS.
To distribute the crawlers, I used 100 Centos VMs and made a system to give commands from the server to all crawlers and gather the results to the server. It is similar to Gearman.
The crawler keeps working 24/7.
Thank you
Regards