Web crawler assignment

In this project you are going to implement the core of a Web crawler and then you are going to crawl the following URLs (to be considered as domains for the purposes of this assignment) and paths: As a concrete deliverable of this project besides the code itself you must submit a report containing answers to the following questions: What to submit: a zip file containing your modified crawler code and the report. Grader meetings: this project requires a meeting of all members of your group with one of the TAs/Readers where all of you will be asked questions about your crawler — your code and the operation of the crawler. These meetings will occur a few days after the submission deadline. Instructions will be sent at the time. To get started fork or get the crawler code from https://github.com/Mondego/spacetime-crawler4py (Links to an external site.) Read the instructions in the README.md file up to and including the section “Execution”. This is enough to implement the simple crawler for this project. In short this is the minimum amount of work that you need to do: Requirements: tbd *.ics.uci.edu/* *.cs.uci.edu/* *.informatics.uci.edu/* *.stat.uci.edu/* today.uci.edu/department/information_computer_sciences/* How many unique pages did you find? Uniqueness for the purposes of this assignment is ONLY established by the URL but discarding the fragment part. So for example http://www.ics.uci.edu#aaa and http://www.ics.uci.edu#bbb are the same URL. Even if you implement additional methods for textual similarity detection please keep considering the above definition of unique pages for the purposes of counting the unique pages in this assignment. What is the longest page in terms of the number of words? (HTML markup doesn’t count as words) What are the 50 most common words in the entire set of pages crawled under these domains? (Ignore English stop words which can be found for example here (Links to an external site.)) Submit the list of common words ordered by frequency. How many subdomains did you find in the ics.uci.edu domain? Submit the list of subdomains ordered alphabetically and the number of unique pages detected in each subdomain. The content of this list should be lines containing URL number for example:http://vision.ics.uci.edu 10 (not the actual number here) Install the dependencies Set the USERAGENT variable in Config.ini so that it contains all students IDs separated by a comma (the numbers! e.g. IR UW21 12312321312312312123123 ) of the group members and please also modify the quarter information (i.e. UW21 for Undergraduate Winter 2021 US21 for Undergraduate Spring 2021 etc). If you fail to do this properly your crawler will not exist in the servers log which will put your grade for this project at risk. (This is the meat of the crawler) Implement the scraper function in scraper.py. The scraper function receives a URL and corresponding Web response (for example the first one will be “http://www.ics.uci.edu” and the Web response will contain the page itself). Your task is to parse the Web response extract enough information from the page (if its a valid page) so to be able to answer the questions for the report and finally return the list of URLs “scrapped” from that page. Some important notes:Make sure to return only URLs that are within the domains and paths mentioned above! (see is_valid function in scraper.py — you need to change it)Make sure to defragment the URLs i.e. remove the fragment part.You can use whatever libraries make your life easier to parse things. Optional dependencies you might want to look at: BeautifulSoup lxml (nudge nudge wink wink!)Optionally in the scraper function you can also save the URL and the web page on your local disk. Make sure to return only URLs that are within the domains and paths mentioned above! (see is_valid function in scraper.py — you need to change it) Make sure to defragment the URLs i.e. remove the fragment part. You can use whatever libraries make your life easier to parse things. Optional dependencies you might want to look at: BeautifulSoup lxml (nudge nudge wink wink!) Optionally in the scraper function you can also save the URL and the web page on your local disk. Run the crawler from your laptop/desktop or from an ICS openlab machine (you can use either the classical ssh&scp to openlab.ics.uci.edu or you can use the web interface hub.ics.uci.edu from your browser; I would recommend you to use ssh such that you will learn a skill that will be probably important for the rest of your professional life… note that to install any software in machines that you do not own or that you are authorized to sudo you need to install them to your user folder and in pip/pip3 you need to use the –user option to do so). Note that this will take several hours possibly a day! It may even never end if you are not careful with your implementation! Note that you need to be inside the campus network or you wont be able to crawl. If your computer is outside UCI use the VPN. Monitor what your crawler is doing. If you see it trapped in a Web trap or malfunctioning in any way stop it fix the problem in the code and restart it. Sometimes you may need to restart from scratch. In that case delete the frontier file (frontier.shelve) or move it to a backup location before restarting the crawler. How many unique pages did you find? Uniqueness for the purposes of this assignment is ONLY established by the URL but discarding the fragment part. So for example http://www.ics.uci.edu#aaa and http://www.ics.uci.edu#bbb are the same URL. Even if you implement additional methods for textual similarity detection please keep considering the above definition of unique pages for the purposes of counting the unique pages in this assignment.What is the longest page in terms of the number of words? (HTML markup doesn’t count as words)What are the 50 most common words in the entire set of pages crawled under these domains? (Ignore English stop words which can be found for example here (Links to an external site.)) Submit the list of common words ordered by frequency.How many subdomains did you find in the ics.uci.edu domain? Submit the list of subdomains ordered alphabetically and the number of unique pages detected in each subdomain. The content of this list should be lines containing URL number for example:http://vision.ics.uci.edu 10 (not the actual number here) Install the dependenciesSet the USERAGENT variable in Config.ini so that it contains all students IDs separated by a comma (the numbers! e.g. IR UW21 12312321312312312123123 ) of the group members and please also modify the quarter information (i.e. UW21 for Undergraduate Winter 2021 US21 for Undergraduate Spring 2021 etc). If you fail to do this properly your crawler will not exist in the servers log which will put your grade for this project at risk.(This is the meat of the crawler) Implement the scraper function in scraper.py. The scraper function receives a URL and corresponding Web response (for example the first one will be “http://www.ics.uci.edu” and the Web response will contain the page itself). Your task is to parse the Web response extract enough information from the page (if its a valid page) so to be able to answer the questions for the report and finally return the list of URLs “scrapped” from that page. Some important notes:Make sure to return only URLs that are within the domains and paths mentioned above! (see is_valid function in scraper.py — you need to change it)Make sure to defragment the URLs i.e. remove the fragment part.You can use whatever libraries make your life easier to parse things. Optional dependencies you might want to look at: BeautifulSoup lxml (nudge nudge wink wink!)Optionally in the scraper function you can also save the URL and the web page on your local disk.Run the crawler from your laptop/desktop or from an ICS openlab machine (you can use either the classical ssh&scp to openlab.ics.uci.edu or you can use the web interface hub.ics.uci.edu from your browser; I would recommend you to use ssh such that you will learn a skill that will be probably important for the rest of your professional life… note that to install any software in machines that you do not own or that you are authorized to sudo you need to install them to your user folder and in pip/pip3 you need to use the –user option to do so). Note that this will take several hours possibly a day! It may even never end if you are not careful with your implementation! Note that you need to be inside the campus network or you wont be able to crawl. If your computer is outside UCI use the VPN.Monitor what your crawler is doing. If you see it trapped in a Web trap or malfunctioning in any way stop it fix the problem in the code and restart it. Sometimes you may need to restart from scratch. In that case delete the frontier file (frontier.shelve) or move it to a backup location before restarting the crawler. Make sure to return only URLs that are within the domains and paths mentioned above! (see is_valid function in scraper.py — you need to change it)Make sure to defragment the URLs i.e. remove the fragment part.You can use whatever libraries make your life easier to parse things. Optional dependencies you might want to look at: BeautifulSoup lxml (nudge nudge wink wink!)Optionally in the scraper function you can also save the URL and the web page on your local disk.

Order a unique copy of this paper
(550 words)

Approximate price: $22

Basic features
  • Free title page and bibliography
  • Unlimited revisions
  • Plagiarism-free guarantee
  • Money-back guarantee
  • 24/7 support
On-demand options
  • Writer’s samples
  • Part-by-part delivery
  • Overnight delivery
  • Copies of used sources
  • Expert Proofreading
Paper format
  • 275 words per page
  • 12 pt Arial/Times New Roman
  • Double line spacing
  • Any citation style (APA, MLA, Chicago/Turabian, Harvard)

Our guarantees

We value our customers and so we ensure that what we do is 100% original..
With us you are guaranteed of quality work done by our qualified experts.Your information and everything that you do with us is kept completely confidential.

Money-back guarantee

You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.

Read more

Zero-plagiarism guarantee

The Product ordered is guaranteed to be original. Orders are checked by the most advanced anti-plagiarism software in the market to assure that the Product is 100% original. The Company has a zero tolerance policy for plagiarism.

Read more

Free-revision policy

The Free Revision policy is a courtesy service that the Company provides to help ensure Customer’s total satisfaction with the completed Order. To receive free revision the Company requires that the Customer provide the request within fourteen (14) days from the first completion date and within a period of thirty (30) days for dissertations.

Read more

Privacy policy

The Company is committed to protect the privacy of the Customer and it will never resell or share any of Customer’s personal information, including credit card data, with any third party. All the online transactions are processed through the secure and reliable online payment systems.

Read more

Fair-cooperation guarantee

By placing an order with us, you agree to the service we provide. We will endear to do all that it takes to deliver a comprehensive paper as per your requirements. We also count on your cooperation to ensure that we deliver on this mandate.

Read more

Calculate the price of your order

550 words
We'll send you the first draft for approval by September 11, 2018 at 10:52 AM
Total price:
$26
The price is based on these factors:
Academic level
Number of pages
Urgency

Order your paper today and save 30% with the discount code HAPPY

X
Open chat
1
You can contact our live agent via WhatsApp! Via + 1 323 412 5597

Feel free to ask questions, clarifications, or discounts available when placing an order.