crawler

已取消 已发布的 Jul 29, 2010 货到付款
已取消 货到付款

*expert rating removed as it seems it is not very popuplar here

Crawler needs to be capable of completing below tasks:

1. collect external links from "list pages" specified by URL (eg.: specified categories of dmoz, yahoo, and other a 5-10 other major link directories)

<!-- -->

1.

<!-- -->

1. Analyze PageRank of links and save only sites with pagerank higher than 3.

2. Find RSS feed(s) on the pages found above. Save RSS feed(s) in DB and make crawling result downloadable in CSV.

2. Collect twitter URLs on specified sections of [[url removed, login to view]][1]'s. Visit twitters where follower number is greater than X (defined at the beginning of the crawling with the URL - section of wefollow) and save the RSS of the twitter.

1. there is a DFD image which clarifies this but I can't attach it here...

3. minimalistic UI where results of crawlings can be accessed in a list and results downloadable in CSV. List contains: date of crawling, URL where crawling stared.

## Deliverables

DFD attached.

工程 Linux MySQL PHP 项目管理 软件构架 软件测试 网络主机 网站管理 网站测试

项目ID: #3608648

关于项目

远程项目 活跃的Jul 29, 2010