Due Friday Feb 5th
Summary:
you will write a java program (web-bot) which will go to a web page,
download the page, retrieve relevent content and save it to a file.
Details
Your bot should automatically follow redirects.
Additional Libraries:
you may use any class in the java library. In addition you may use any class in the
html Parser project.
Any additional libraries need to be cleared by the instructor. ( I have
html parser installed in my home directory at the moment)
Part I
Write a bot in java that will allow a user to give a web page, a start
marker and an end marker. The bot should download the text for the web
site, and then grab all the text between each start and end marker in
the file. Once you have the text, strip out the html and write the text
to a file with each "chunk" (each piece of text from between a start
and end marker) separated with a separation string on a line by itself
(a reasonable separation string might be "+++++++++++++++++++++++++++"
or perhaps "========================" etc.)
You have to strip the html out after slicing the file up because an html tag will be an acceptable value for a start or end tag.
Part II
Extend your bot to be able to follow one link for each start-end
marker text chunk. Follow the link and include the text from page you
get from following the link in your file after the initial chunk that
you got the link from. See below for sample requests and outputs. You
should separate the text from the linked page from the source 'chunk'
with a different separater than your separater from the part 1
completion.
You should have an option for the user to choose either the end marker
as the link to follow, or the link immediately before the start marker
at the users discresion. One decision for the entire page.
User input for both levels.
For all user input, you choose how it will be done, you can pop up a
window, have command line arguments to the program or use the scanner
class to read in user input from the command line one at a time.
Sample pages:
I'm likely to try your program on websites with formats similar to the following sites:
http://www.fanblogs.com/ (look at id=post)
http://sportsblogs.org/
http://www.massively.com/
http://slashdot.org/
all of them have similar formats.
Example output
If we take slashdot as the example:
I might use the following
- page: http://slashdot.org
- start tag: Your Rights Online
- end tag: Read More...
If you ran your bot that has done part 1 but not
part II as I'm typing this, it would produce the following text as one
of your sections
==================================
Your Rights Online: Buffalo Tech Gets New Trial On Wi-Fi Patent 2008-10-07 15:52
Posted
by
kdawson
on Tuesday October 07, @03:52PM
from the oh-give-me-a-home-router dept.
MrLint writes "It's been a long,
nearly two years of silence since CSIRO won a patent battle against
Buffalo Tech,
causing an injunction preventing the Austin company from selling
wireless routers. On September 19, 2008, a Federal Circuit Court of
Appeals ruled that CSIRO patent claims are invalid and Buffalo is
getting a new trial. With any luck, we will be able to get our grubby
hands on low-cost Wi-Fi routers again!"
Read More
======================================
If I ran your level 2 bot with the following user input
- page: http://slashdot.org
- start tag: Your Rights Online
- end tag: Read More
- link: end
I would get the following output:
==================================
Your Rights Online: Buffalo Tech Gets New Trial On Wi-Fi Patent 2008-10-07 15:52
Posted
by
kdawson
on Tuesday October 07, @03:52PM
from the oh-give-me-a-home-router dept.
MrLint writes "It's been a long,
nearly two years of silence since CSIRO won a patent battle against
Buffalo Tech,
causing an injunction preventing the Austin company from selling
wireless routers. On September 19, 2008, a Federal Circuit Court of
Appeals ruled that CSIRO patent claims are invalid and Buffalo is
getting a new trial. With any luck, we will be able to get our grubby
hands on low-cost Wi-Fi routers again!"
If
your output has a few text bits that were not visible in the web
browser thats ok. There seem to be some hidden tags that show up as
text when extracting the text. For a 2 1/2 week (with a midterm in the
middle) project we will focus on the main points.
Writeup:
Yes this is one of my classes, so you have to tell me what you did, and
how you did it rather than just passing things in and hoping I can
figure it out.
You need a text based writeup in standard written english which addresses the following:
- Your name!!
- Which level of completion do you want to be graded on?
- How do I use your program?
- and how should I enter the user preferences
- what keyword are you using for the link location if going for level 2 completion?
- did you use any additional libraries beyond the standard java library and if so which ones?
- How did you write your program?
- what design decisions did you make?
- how well did it work?
- what false starts did you have?
- How well did your program work?
- is there any bugs that I will encounter?
- does your program do everything? is it easy to use?
- propose an extension to your program that another student who completed this class could make in a straightforward manner.
- describe where in the code the extension should be hooked in
- describe what the extension should do
As always, the writeup will be worth a fair portion of the project grade (20-35% usually)
Submitting:
Zip up your folder and rename your zip file to include your last name.
(I might use SantoreLab1.zip)
Just submit this one to me by email. I don't want to have you all
learning one system only to move to another one on a different server
for the next project.
Are they expensive? (Score:4, Informative)
I paid 29 bucks for mine.
Reply to This