Scrape hp website for drivers. MUST USE: Perl, Web::Scraper
Looking for quality, clean, reusable, modern Perl here. Comments expected, so fluent English speakers only.
## Deliverables
Start URL:
[login to view URL]
We need to follow the following product links on the start url:
Handheld Printing ?
Multifunction and All-in-One ?
Network Print Servers ?
Printers ?
Second URLs:
for each of the above products, we need to traverse the links under it until we reach the product page (they may or may not be multiple levels deep). For example, follow these links:
Printers > HP LaserJet Printers > HP LaserJet P4500 Printer series > HP LaserJet P4515xm Printer (you should reach the following url if you followed the instructions correctly: [login to view URL])
You will now have arrived at the product page for the $PRODUCT_NAME1 HP LaserJet P4515xm Printer. $PRODUCT_NAME1's value should be the text of the last link we traversed to get here. We will also need $PRODUCT_NAME2 to be set to 'HP LaserJet P4510 Printer series' which is on the product page itself.
We will need the scraper to do all languages and all operating systems. For the purposes of this explaination, make sure English (American) is selected, and select Microsoft Windows 7 (32-bit) to reach the third url.
THIRD URL:
You should be here if you followed the instructions correctly: [login to view URL]
Here is where we get the rest of our variables. First variable will be $TYPE and $TYPE_DESCRIPTION. (examples: 'Driver - Universal Print Driver' $TYPE = Driver $TYPE_DESCRIPTION = Universal Print Driver) (Note: sometimes it will just say like 'Firmware', in which case set both variables to 'Firmware' or whatever the single type is)
For each set ($TYPE,$TYPE_DESCRIPTION) we need to get each download and the information for it. For the first download on our page we could create a row (csv, tab delimited, or mySQL) that would look like:
PRODUCT_NAME1,PRODUCT_NAME2,TYPE,TYPE_DESCRIPTION,DESCRIPTION,VERSION,DATE,SIZE,PRODUCT_URL,DRIVER_URL
HP LaserJet P4515xm Printer,HP LaserJet P4510 Printer series,Driver,Universal Print Driver,1 - HP Universal Print Driver for Windows PCL6,5.5.0.12834,27 Jun 2012,[login to view URL],$DIRECT_URL_TO_DOWNLOAD
Notice the last value, DRIVER_URL, which has a value of $DIRECT_URL_TO_DOWNLOAD. I'm leaving that for you to figure out, as the download button uses javascript to construct a url.
NOTES:
1. If, on the product's download page a download item's description says '(Downloadable Driver Not Available)' then skip. (ex. [login to view URL])
2. Follow the same rule if the download link says 'obtain software' (ex. same as above example)
REQUIREMENTS:
1. Written in Perl, using Web::Scraper
2. Be familiar with Perl best practices. Modular, documented, don't repeat yourself, etc.
3. Modern Perl please