Web crawler assignment

In this project you are going to implement the core of a Web crawler and then you are going to crawl the following URLs (to be considered as domains for the purposes of this assignment) and paths: As a concrete deliverable of this project besides the code itself you must submit a report containing answers to the following questions: What to submit: a zip file containing your modified crawler code and the report. Grader meetings: this project requires a meeting of all members of your group with one of the TAs/Readers where all of you will be asked questions about your crawler — your code and the operation of the crawler. These meetings will occur a few days after the submission deadline. Instructions will be sent at the time. To get started fork or get the crawler code from https://github.com/Mondego/spacetime-crawler4py (Links to an external site.) Read the instructions in the README.md file up to and including the section “Execution”. This is enough to implement the simple crawler for this project. In short this is the minimum amount of work that you need to do: Requirements: tbd *.ics.uci.edu/* *.cs.uci.edu/* *.informatics.uci.edu/* *.stat.uci.edu/* today.uci.edu/department/information_computer_sciences/* How many unique pages did you find? Uniqueness for the purposes of this assignment is ONLY established by the URL but discarding the fragment part. So for example http://www.ics.uci.edu#aaa and http://www.ics.uci.edu#bbb are the same URL. Even if you implement additional methods for textual similarity detection please keep considering the above definition of unique pages for the purposes of counting the unique pages in this assignment. What is the longest page in terms of the number of words? (HTML markup doesnâ€™t count as words) What are the 50 most common words in the entire set of pages crawled under these domains? (Ignore English stop words which can be found for example here (Links to an external site.)) Submit the list of common words ed by frequency. How many subdomains did you find in the ics.uci.edu domain? Submit the list of subdomains ed alphabetically and the number of unique pages detected in each subdomain. The content of this list should be lines containing URL number for example:http://vision.ics.uci.edu 10 (not the actual number here) Install the dependencies Set the USERAGENT variable in Config.ini so that it contains all students IDs separated by a comma (the numbers! e.g. IR UW21 12312321312312312123123 ) of the group members and please also modify the quarter information (i.e. UW21 for Undergraduate Winter 2021 US21 for Undergraduate Spring 2021 etc). If you fail to do this properly your crawler will not exist in the servers log which will put your grade for this project at risk. (This is the meat of the crawler) Implement the scraper function in scraper.py. The scraper function receives a URL and corresponding Web response (for example the first one will be “http://www.ics.uci.edu” and the Web response will contain the page itself). Your task is to parse the Web response extract enough information from the page (if its a valid page) so to be able to answer the questions for the report and finally return the list of URLs “scrapped” from that page. Some important notes:Make sure to return only URLs that are within the domains and paths mentioned above! (see is_valid function in scraper.py — you need to change it)Make sure to defragment the URLs i.e. remove the fragment part.You can use whatever libraries make your life easier to parse things. Optional dependencies you might want to look at: BeautifulSoup lxml (nudge nudge wink wink!)Optionally in the scraper function you can also save the URL and the web page on your local disk. Make sure to return only URLs that are within the domains and paths mentioned above! (see is_valid function in scraper.py — you need to change it) Make sure to defragment the URLs i.e. remove the fragment part. You can use whatever libraries make your life easier to parse things. Optional dependencies you might want to look at: BeautifulSoup lxml (nudge nudge wink wink!) Optionally in the scraper function you can also save the URL and the web page on your local disk. Run the crawler from your laptop/desktop or from an ICS openlab machine (you can use either the classical ssh&scp to openlab.ics.uci.edu or you can use the web interface hub.ics.uci.edu from your browser; I would recommend you to use ssh such that you will learn a skill that will be probably important for the rest of your professional life… note that to install any software in machines that you do not own or that you are authorized to sudo you need to install them to your user folder and in pip/pip3 you need to use the –user option to do so). Note that this will take several hours possibly a day! It may even never end if you are not careful with your implementation! Note that you need to be inside the campus network or you wont be able to crawl. If your computer is outside UCI use the VPN. Monitor what your crawler is doing. If you see it trapped in a Web trap or malfunctioning in any way stop it fix the problem in the code and restart it. Sometimes you may need to restart from scratch. In that case delete the frontier file (frontier.shelve) or move it to a backup location before restarting the crawler. How many unique pages did you find? Uniqueness for the purposes of this assignment is ONLY established by the URL but discarding the fragment part. So for example http://www.ics.uci.edu#aaa and http://www.ics.uci.edu#bbb are the same URL. Even if you implement additional methods for textual similarity detection please keep considering the above definition of unique pages for the purposes of counting the unique pages in this assignment.What is the longest page in terms of the number of words? (HTML markup doesnâ€™t count as words)What are the 50 most common words in the entire set of pages crawled under these domains? (Ignore English stop words which can be found for example here (Links to an external site.)) Submit the list of common words ed by frequency.How many subdomains did you find in the ics.uci.edu domain? Submit the list of subdomains ed alphabetically and the number of unique pages detected in each subdomain. The content of this list should be lines containing URL number for example:http://vision.ics.uci.edu 10 (not the actual number here) Install the dependenciesSet the USERAGENT variable in Config.ini so that it contains all students IDs separated by a comma (the numbers! e.g. IR UW21 12312321312312312123123 ) of the group members and please also modify the quarter information (i.e. UW21 for Undergraduate Winter 2021 US21 for Undergraduate Spring 2021 etc). If you fail to do this properly your crawler will not exist in the servers log which will put your grade for this project at risk.(This is the meat of the crawler) Implement the scraper function in scraper.py. The scraper function receives a URL and corresponding Web response (for example the first one will be “http://www.ics.uci.edu” and the Web response will contain the page itself). Your task is to parse the Web response extract enough information from the page (if its a valid page) so to be able to answer the questions for the report and finally return the list of URLs “scrapped” from that page. Some important notes:Make sure to return only URLs that are within the domains and paths mentioned above! (see is_valid function in scraper.py — you need to change it)Make sure to defragment the URLs i.e. remove the fragment part.You can use whatever libraries make your life easier to parse things. Optional dependencies you might want to look at: BeautifulSoup lxml (nudge nudge wink wink!)Optionally in the scraper function you can also save the URL and the web page on your local disk.Run the crawler from your laptop/desktop or from an ICS openlab machine (you can use either the classical ssh&scp to openlab.ics.uci.edu or you can use the web interface hub.ics.uci.edu from your browser; I would recommend you to use ssh such that you will learn a skill that will be probably important for the rest of your professional life… note that to install any software in machines that you do not own or that you are authorized to sudo you need to install them to your user folder and in pip/pip3 you need to use the –user option to do so). Note that this will take several hours possibly a day! It may even never end if you are not careful with your implementation! Note that you need to be inside the campus network or you wont be able to crawl. If your computer is outside UCI use the VPN.Monitor what your crawler is doing. If you see it trapped in a Web trap or malfunctioning in any way stop it fix the problem in the code and restart it. Sometimes you may need to restart from scratch. In that case delete the frontier file (frontier.shelve) or move it to a backup location before restarting the crawler. Make sure to return only URLs that are within the domains and paths mentioned above! (see is_valid function in scraper.py — you need to change it)Make sure to defragment the URLs i.e. remove the fragment part.You can use whatever libraries make your life easier to parse things. Optional dependencies you might want to look at: BeautifulSoup lxml (nudge nudge wink wink!)Optionally in the scraper function you can also save the URL and the web page on your local disk.

Continue to order Get a quote

Calculate the price of your order

Type of paper needed:

Pages:

550 words

Academic level:

We'll send you the first draft for approval by September 11, 2018 at 10:52 AM

Total price:

$26

The price is based on these factors:

Academic level

Number of pages

Urgency

Basic features

Free title page and bibliography
Unlimited revisions
Plagiarism-free guarantee
Money-back guarantee
24/7 support

On-demand options

Writer’s samples
Part-by-part delivery
Overnight delivery
Copies of used sources
Expert Proofreading

Paper format

275 words per page
12 pt Arial/Times New Roman
Double line spacing
Any citation style (APA, MLA, Chicago/Turabian, Harvard)

Web crawler assignment

Calculate the price of your order

Our guarantees

Money-back guarantee

Zero-plagiarism guarantee

Free-revision policy

Privacy policy

Fair-cooperation guarantee