Webcrawler that checks websites on the internet for adsense script code when found contact information for site is stored in a data base.

已取消 已发布的 Jul 26, 2011 货到付款
已取消 货到付款

Webcrawler that checks websites on the internet for google adsense script code, when found contact information for site is stored in a data base. If adsense script is not found then contact information for the site should be stored in a separate (no google adsense data base).

There are two different adsense source codes available. One contains the word "google_ad_client" and the other the word "GA_googleFillSlot". Depending on the found word the third database value must be inserted.

If there is a link to a contact website the crawler should pass this website and try to find a mail address (only the first one if there are more than 1 available). Often the webmaster try to mask the mail address to prevent spiders from grabbing [url removed, login to view] spider (crawler) should be able to find patterns which COULD be an email address. The crawler must not dekrypt the masked mail address, This is not it's job. Just find patterns which look like a email address and write the found into the data base. Pattern markers being @ or dot in braces are a very good [url removed, login to view] easiest instructions for your crawler is this: If there is one of the following expressions in the source code, grab all before and after this marker (including the marker) up to the next html or script tag:

<a href=<mailto:at%7CAT%7Cdot%7CDOT%>

opening brace: [ or ( or {

closing brace: ] or ) or }

Crawler should recognized (at) and (" dot' ] as markers. (of course it is enough to find the first marker). So the crawler should grab all between the html tag before and after the marker.

the crawler should proccess 5 web addresses simultanously.

Any questions please ask.

软件构架

项目ID: #3468675

关于项目

远程项目 活跃的Jul 26, 2011