Web crawler assignment

In this project you are going to implement the core of a Web crawler and then you are going to crawl the following URLs (to be considered as domains for the purposes of this assignment) and paths: As a concrete deliverable of this project besides the code itself you must submit a report containing answers to the following questions: What to submit: a zip file containing your modified crawler code and the report. Grader meetings: this project requires a meeting of all members of your group with one of the TAs/Readers where all of you will be asked questions about your crawler — your code and the operation of the crawler. These meetings will occur a few days after the submission deadline. Instructions will be sent at the time. To get started fork or get the crawler code from https://github.com/Mondego/spacetime-crawler4py (Links to an external site.) Read the instructions in the README.md file up to and including the section “Execution”. This is enough to implement the simple crawler for this project. In short this is the minimum amount of work that you need to do: Requirements: tbd *.ics.uci.edu/* *.cs.uci.edu/* *.informatics.uci.edu/* *.stat.uci.edu/* today.uci.edu/department/information_computer_sciences/* How many unique pages did you find? Uniqueness for the purposes of this assignment is ONLY established by the URL but discarding the fragment part. So for example http://www.ics.uci.edu#aaa and http://www.ics.uci.edu#bbb are the same URL. Even if you implement additional methods for textual similarity detection please keep considering the above definition of unique pages for the purposes of counting the unique pages in this assignment. What is the longest page in terms of the number of words? (HTML markup doesn’t count as words) What are the 50 most common words in the entire set of pages crawled under these domains? (Ignore English stop words which can be found for example here (Links to an external site.)) Submit the list of common words ed by frequency. How many subdomains did you find in the ics.uci.edu domain? Submit the list of subdomains ed alphabetically and the number of unique pages detected in each subdomain. The content of this list should be lines containing URL number for example:http://vision.ics.uci.edu 10 (not the actual number here) Install the dependencies Set the USERAGENT variable in Config.ini so that it contains all students IDs separated by a comma (the numbers! e.g. IR UW21 12312321312312312123123 ) of the group members and please also modify the quarter information (i.e. UW21 for Undergraduate Winter 2021 US21 for Undergraduate Spring 2021 etc). If you fail to do this properly your crawler will not exist in the servers log which will put your grade for this project at risk. (This is the meat of the crawler) Implement the scraper function in scraper.py. The scraper function receives a URL and corresponding Web response (for example the first one will be “http://www.ics.uci.edu” and the Web response will contain the page itself). Your task is to parse the Web response extract enough information from the page (if its a valid page) so to be able to answer the questions for the report and finally return the list of URLs “scrapped” from that page. Some important notes:Make sure to return only URLs that are within the domains and paths mentioned above! (see is_valid function in scraper.py — you need to change it)Make sure to defragment the URLs i.e. remove the fragment part.You can use whatever libraries make your life easier to parse things. Optional dependencies you might want to look at: BeautifulSoup lxml (nudge nudge wink wink!)Optionally in the scraper function you can also save the URL and the web page on your local disk. Make sure to return only URLs that are within the domains and paths mentioned above! (see is_valid function in scraper.py — you need to change it) Make sure to defragment the URLs i.e. remove the fragment part. You can use whatever libraries make your life easier to parse things. Optional dependencies you might want to look at: BeautifulSoup lxml (nudge nudge wink wink!) Optionally in the scraper function you can also save the URL and the web page on your local disk. Run the crawler from your laptop/desktop or from an ICS openlab machine (you can use either the classical ssh&scp to openlab.ics.uci.edu or you can use the web interface hub.ics.uci.edu from your browser; I would recommend you to use ssh such that you will learn a skill that will be probably important for the rest of your professional life… note that to install any software in machines that you do not own or that you are authorized to sudo you need to install them to your user folder and in pip/pip3 you need to use the –user option to do so). Note that this will take several hours possibly a day! It may even never end if you are not careful with your implementation! Note that you need to be inside the campus network or you wont be able to crawl. If your computer is outside UCI use the VPN. Monitor what your crawler is doing. If you see it trapped in a Web trap or malfunctioning in any way stop it fix the problem in the code and restart it. Sometimes you may need to restart from scratch. In that case delete the frontier file (frontier.shelve) or move it to a backup location before restarting the crawler. How many unique pages did you find? Uniqueness for the purposes of this assignment is ONLY established by the URL but discarding the fragment part. So for example http://www.ics.uci.edu#aaa and http://www.ics.uci.edu#bbb are the same URL. Even if you implement additional methods for textual similarity detection please keep considering the above definition of unique pages for the purposes of counting the unique pages in this assignment.What is the longest page in terms of the number of words? (HTML markup doesn’t count as words)What are the 50 most common words in the entire set of pages crawled under these domains? (Ignore English stop words which can be found for example here (Links to an external site.)) Submit the list of common words ed by frequency.How many subdomains did you find in the ics.uci.edu domain? Submit the list of subdomains ed alphabetically and the number of unique pages detected in each subdomain. The content of this list should be lines containing URL number for example:http://vision.ics.uci.edu 10 (not the actual number here) Install the dependenciesSet the USERAGENT variable in Config.ini so that it contains all students IDs separated by a comma (the numbers! e.g. IR UW21 12312321312312312123123 ) of the group members and please also modify the quarter information (i.e. UW21 for Undergraduate Winter 2021 US21 for Undergraduate Spring 2021 etc). If you fail to do this properly your crawler will not exist in the servers log which will put your grade for this project at risk.(This is the meat of the crawler) Implement the scraper function in scraper.py. The scraper function receives a URL and corresponding Web response (for example the first one will be “http://www.ics.uci.edu” and the Web response will contain the page itself). Your task is to parse the Web response extract enough information from the page (if its a valid page) so to be able to answer the questions for the report and finally return the list of URLs “scrapped” from that page. Some important notes:Make sure to return only URLs that are within the domains and paths mentioned above! (see is_valid function in scraper.py — you need to change it)Make sure to defragment the URLs i.e. remove the fragment part.You can use whatever libraries make your life easier to parse things. Optional dependencies you might want to look at: BeautifulSoup lxml (nudge nudge wink wink!)Optionally in the scraper function you can also save the URL and the web page on your local disk.Run the crawler from your laptop/desktop or from an ICS openlab machine (you can use either the classical ssh&scp to openlab.ics.uci.edu or you can use the web interface hub.ics.uci.edu from your browser; I would recommend you to use ssh such that you will learn a skill that will be probably important for the rest of your professional life… note that to install any software in machines that you do not own or that you are authorized to sudo you need to install them to your user folder and in pip/pip3 you need to use the –user option to do so). Note that this will take several hours possibly a day! It may even never end if you are not careful with your implementation! Note that you need to be inside the campus network or you wont be able to crawl. If your computer is outside UCI use the VPN.Monitor what your crawler is doing. If you see it trapped in a Web trap or malfunctioning in any way stop it fix the problem in the code and restart it. Sometimes you may need to restart from scratch. In that case delete the frontier file (frontier.shelve) or move it to a backup location before restarting the crawler. Make sure to return only URLs that are within the domains and paths mentioned above! (see is_valid function in scraper.py — you need to change it)Make sure to defragment the URLs i.e. remove the fragment part.You can use whatever libraries make your life easier to parse things. Optional dependencies you might want to look at: BeautifulSoup lxml (nudge nudge wink wink!)Optionally in the scraper function you can also save the URL and the web page on your local disk.

Place your order
(550 words)

Approximate price: $22

Calculate the price of your order

550 words
We'll send you the first draft for approval by September 11, 2018 at 10:52 AM
Total price:
$26
The price is based on these factors:
Academic level
Number of pages
Urgency
Basic features
  • Free title page and bibliography
  • Unlimited revisions
  • Plagiarism-free guarantee
  • Money-back guarantee
  • 24/7 support
On-demand options
  • Writer’s samples
  • Part-by-part delivery
  • Overnight delivery
  • Copies of used sources
  • Expert Proofreading
Paper format
  • 275 words per page
  • 12 pt Arial/Times New Roman
  • Double line spacing
  • Any citation style (APA, MLA, Chicago/Turabian, Harvard)

Our guarantees

Delivering a high-quality product at a reasonable price is not enough anymore.
That’s why we have developed 5 beneficial guarantees that will make your experience with our service enjoyable, easy, and safe.

Money-back guarantee

You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.

Read more

Zero-plagiarism guarantee

Each paper is composed from scratch, according to your instructions. It is then checked by our plagiarism-detection software. There is no gap where plagiarism could squeeze in.

Read more

Free-revision policy

Thanks to our free revisions, there is no way for you to be unsatisfied. We will work on your paper until you are completely happy with the result.

Read more

Privacy policy

Your email is safe, as we store it according to international data protection rules. Your bank details are secure, as we use only reliable payment systems.

Read more

Fair-cooperation guarantee

By sending us your money, you buy the service we provide. Check out our terms and conditions if you prefer business talks to be laid out in official language.

Read more

Order your essay today and save 30% with the discount code HAPPY