Web-scrape 120929 companies data including +7000 email addresses from [login to view URL] supplier reference

已取消 已发布的 5 年前 货到付款
已取消 货到付款

Web-scrape 120929 companies data including +7000 email addresses from [login to view URL] supplier reference

I work in a surveyor firm as a salesperson in hong kong. I need to find clients on a daily basis, but my boss company has virtually zero support on lead generation. I decide to build an excel spreadsheet , based on this Trade Development Council , TDC link : [login to view URL]

If you type in "limited" in search engine of company name in the website, 120929 companies data are available as at yesterday 16/6/2018 across 2016 pages using 60 results per page. Some problems are found.

One problem is: both the fax and the telephone numbers are stored in a picture in jpeg image. I advise OCR is used to convert the two set of numbers to text .

Second problem is: downloading the first 12 pages is smooth. downloading the 13th pages is blocked by the website.

Third problem: some 7000 email addresses are on the TDC pages of suppliers references .

In short, the project consists of 2 parts. First page is turn webpage to data. Download +120,000 companies datasets, where MS excel spreadsheet fields include

1. Company name

2. Year of Establishment,

3. Number of Staff

4. nature of business

5. Annual Turnover

6. Industry product/services range

7. Office address

8. Country / Region

9. Telephone, need translate from OCR

10. Fax, need translate from OCR

11. Website

12. Contact person

13. Title of contact person

Visit : [login to view URL],

Product page are NO need. Important pages are the COMPANY and CONTACT.

Second part of this project is +7,000 email addresses in companies with +7,000 supplier reference.

14. "+7000 email addresses from +7000 companies with "suppliers supplier references" ✓✓logo

3-day time is expected to finish the project. 21 June 2018 is deadline.

Language in English.

My budget is US $50, awarded to project winner. I pay NDA fee.(non disclosure agreement)

数据挖掘 光学字符识别 网页搜罗

项目ID: #17186366

关于项目

3个方案 远程项目 活跃的5 年前