Internet Programming Project 1 1/2

Due Friday Feb 5th

Summary:

you will write a java program (web-bot) which will go to a web page, download the page, retrieve relevent content and save it to a file.


Details

Your bot should automatically follow redirects.

Additional Libraries:
you may use any class in the java library. In addition you may use any class in the html Parser project. Any additional libraries need to be cleared by the instructor. ( I have html parser installed in my home directory at the moment)

Part I

Write a bot in java that will allow a user to give a web page, a start marker and an end marker. The bot should download the text for the web site, and then grab all the text between each start and end marker in the file. Once you have the text, strip out the html and write the text to a file with each "chunk" (each piece of text from between a start and end marker) separated with a separation string on a line by itself (a reasonable separation string might be "+++++++++++++++++++++++++++" or  perhaps "========================" etc.)

You have to strip the html out after slicing the file up because an html tag will be an acceptable value for a start or end tag.

Part II


Extend your bot  to be able to follow one link for each start-end marker text chunk. Follow the link and include the text from page you get from following the link in your file after the initial chunk that you got the link from. See below for sample requests and outputs. You should separate the text from the linked page from the source 'chunk' with a different separater than your separater from the part 1 completion.

You should have an option for the user to choose either the end marker as the link to follow, or the link immediately before the start marker at the users discresion. One decision for the entire page.

User input for both levels.
For all user input, you choose how it will be done, you can pop up a window, have command line arguments to the program or use the scanner class to read in user input from the command line one at a time.

Sample pages:

I'm likely to try your program on websites with formats similar to the following sites:
http://www.fanblogs.com/ (look at id=post)
http://sportsblogs.org/
http://www.massively.com/
http://slashdot.org/

all of them have similar formats.

Example output

If we take slashdot as the example:

I might use the following


If you ran your bot that has done part 1 but not part II as I'm typing this, it would produce the following text as one of your sections

==================================

Your Rights Online: Buffalo Tech Gets New Trial On Wi-Fi Patent 2008-10-07 15:52
Posted by kdawson on Tuesday October 07, @03:52PM
from the oh-give-me-a-home-router dept.

MrLint writes "It's been a long, nearly two years of silence since CSIRO won a patent battle against Buffalo Tech, causing an injunction preventing the Austin company from selling wireless routers. On September 19, 2008, a Federal Circuit Court of Appeals ruled that CSIRO patent claims are invalid and Buffalo is getting a new trial. With any luck, we will be able to get our grubby hands on low-cost Wi-Fi routers again!"
Read More

======================================
If I ran your level 2 bot with the following user input


I would get the following output:

==================================

Your Rights Online: Buffalo Tech Gets New Trial On Wi-Fi Patent 2008-10-07 15:52
Posted by kdawson on Tuesday October 07, @03:52PM
from the oh-give-me-a-home-router dept.

MrLint writes "It's been a long, nearly two years of silence since CSIRO won a patent battle against Buffalo Tech, causing an injunction preventing the Austin company from selling wireless routers. On September 19, 2008, a Federal Circuit Court of Appeals ruled that CSIRO patent claims are invalid and Buffalo is getting a new trial. With any luck, we will be able to get our grubby hands on low-cost Wi-Fi routers again!"
Read More
******************************************
Buffalo Tech Gets New Trial On Wi-Fi Patent
Posted by kdawson on Tuesday October 07, @03:52PM
from the oh-give-me-a-home-router dept.

MrLint writes "It's been a long, nearly two years of silence since CSIRO won a patent battle against Buffalo Tech, causing an injunction preventing the Austin company from selling wireless routers. On September 19, 2008, a Federal Circuit Court of Appeals ruled that CSIRO patent claims are invalid and Buffalo is getting a new trial. With any luck, we will be able to get our grubby hands on low-cost Wi-Fi routers again!"

Related Stories

[+] Hardware: CSIRO Wireless Patent Reaffirmed In US Court 147 comments
An anonymous reader writes ""The CSIRO has won a landmark US legal battle against Buffalo Technology, under which it could receive royalties from every producer of wireless local area network (WLAN) products worldwide." From the article: "The patent, granted to CSIRO in 1996, encompasses elements of the 802.11a/g wireless technology that is now an industry standard. It stems from a system developed by CSIRO in the early '90s, 'to exchange large amounts of information wirelessly at high speed, within environments such as offices and homes,' said a CSIRO spokeswoman."
Firehose:Buffalo gets new trial on WiFi patents by MrLint (519792)
Buffalo Tech Gets New Trial On Wi-Fi Patent 49 More | Login | Reply
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.

  • Are they expensive? (Score:4, Informative)

    by geekoid (135745) <dadinportland@@@yahoo...com> on Tuesday October 07, @03:53PM (#25291029) Homepage Journal

    I paid 29 bucks for mine.

    Reply to This

    ...and yes you would likely get a lot more comments here.

======================================

If your output has a few text bits that were not visible in the web browser thats ok. There seem to be some hidden tags that show up as text when extracting the text. For a 2 1/2 week (with a midterm in the middle) project we will focus on the main points.

Writeup:

Yes this is one of my classes, so you have to tell me what you did, and how you did it rather than just passing things in and hoping I can figure it out.

You need a text based writeup in standard written english which addresses the following:
As always, the writeup will be worth a fair portion of the project grade (20-35% usually)

Submitting:

Zip up your folder and rename your zip file to include your last name. (I might use SantoreLab1.zip)

Just submit this one to me by email. I don't want to have you all learning one system only to move to another one on a different server for the next project.