{"id":933,"date":"2017-03-02T08:30:56","date_gmt":"2017-03-02T15:30:56","guid":{"rendered":"http:\/\/somethingk.com\/main\/?p=933"},"modified":"2017-03-02T08:30:56","modified_gmt":"2017-03-02T15:30:56","slug":"python-url-term-scrapper-kind-of-like-a-very-dumbsimple-watson","status":"publish","type":"post","link":"https:\/\/somethingk.com\/main\/python-url-term-scrapper-kind-of-like-a-very-dumbsimple-watson\/","title":{"rendered":"Python URL Term Scrapper (Kind of Like a Very Dumb\/Simple Watson)"},"content":{"rendered":"<p>I wrote this a while back to scrape a given URL page for all links it contains and then search those links for term relations. The Python scrapper first finds all links in a given url. It then searches all the links found for a list of search terms provided. It will return stats on the number of times specific provided terms show up. <\/p>\n<p>This may come in handy while trying to find more information on a given topic. I&#8217;ve used it on Google searches (be careful, you can only scrape google once ever 8 or so seconds before you are locked out) and wikipedia pages to gather correlation statistics between topics.<\/p>\n<p>It is old code&#8230; so there might be some errors. Keep me posted!<\/p>\n<div class=\"snippetcpt-wrap\" id=\"snippet-936\" data-id=\"936\" data-edit=\"\" data-copy=\"\/main\/wp-json\/wp\/v2\/posts\/933?snippet=8b8b00e8ef&#038;id=936\" data-fullscreen=\"https:\/\/somethingk.com\/main\/code-snippets\/python-url-term-scrapper\/?full-screen=1\">\n\t\t\t\t<pre class=\"prettyprint linenums lang-python\" title=\"Python URL Term Scrapper\">#!\/usr\/bin\/env python\r\n#ENSURE permissions are 755 in order to have script run as executable\r\n\r\nimport os, sys, re, datetime\r\nfrom optparse import OptionParser\r\nimport logging, urllib2\r\n\r\ndef parsePage(link, list):\r\n    searchList = {}\r\n    try:\r\n        f = urllib2.urlopen(link)\r\n        data = f.read()\r\n        for item in list:\r\n            if (item.title() in data) or (item.upper() in data) or (item.lower() in data):\r\n                searchList[item]=searchList[item]+1\r\n                searchList[&quot;count&quot;]=searchList[&quot;count&quot;]+1\r\n        return searchList\r\n    except Exception, e:\r\n        print &quot;An error has occurred while parsing page &quot; +str(link)+&quot;.&quot;\r\n        log.error(str(datetime.datetime.now())+&quot; &quot;+str(e))\r\n\r\ndef searchUrl(search):\r\n    try:\r\n        f = urllib2.urlopen(search)\r\n        data = f.read()\r\n        pattern = r&quot;\/wiki(\/\\S*)?$&quot; #regular expression to find url\r\n        links = re.findall(pattern, data)\r\n        return links\r\n    except Exception, e:\r\n        print &quot;An error has occurred while trying to reach the site.&quot;\r\n        log.error(str(datetime.datetime.now())+&quot; &quot;+str(e))\r\n\r\ndef main():\r\n    try:\r\n        parser = OptionParser() #Help menu options\r\n        parser.add_option(&quot;-u&quot;, &quot;--url&quot;, dest=&quot;search&quot;, help=&quot;String containing URL to search.&quot;)\r\n        parser.add_option(&quot;-f&quot;, &quot;--file&quot;, dest=&quot;file&quot;, help=&quot;File containing search terms.&quot;)\r\n        (options, args) = parser.parse_args()\r\n        if not options.search or not options.file:\r\n            parser.error('Term file or URL to scrape not given')\r\n        else:\r\n            urls = searchUrl(options.search)\r\n            f = open(options.file, 'r')\r\n            terms = f.readlines()\r\n            for url in urls:\r\n                parsePage(url, terms)\r\n            print &quot;Results:&quot;\r\n            print searchList\r\n    except Exception, e:\r\n        log.error(str(datetime.datetime.now())+&quot; &quot;+str(e))\r\n\r\nif __name__ == &quot;__main__&quot;:\r\n    log = logging.getLogger(&quot;error&quot;) #create error log\r\n    log.setLevel(logging.ERROR)\r\n    formatter = logging.Formatter('[%(levelname)s] %(message)s')\r\n    handler = logging.FileHandler('error.log')\r\n    handler.setFormatter(formatter)\r\n    log.addHandler(handler)\r\n    try:\r\n        main()\r\n    except Exception, e:\r\n        print &quot;An error has occurred, please review the error.log for more details.&quot;\r\n        log.error(str(datetime.datetime.now())+&quot; &quot;+str(e))<\/pre>\n\t\t\t<\/div>\n","protected":false},"excerpt":{"rendered":"<p>I wrote this a while back to scrape a given URL page for all links it contains and then search those links for term relations. The Python scrapper first finds all links in a given url. It then searches all the links found for a list of search terms provided. It will return stats on the number of times specific [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[177,72,322],"tags":[339,153,342,341,343,340],"class_list":["post-933","post","type-post","status-publish","format-standard","hentry","category-development","category-python","category-web","tag-python","tag-python-2-7","tag-term-scrapper","tag-url-scrapper","tag-watson","tag-web-scrapper"],"_links":{"self":[{"href":"https:\/\/somethingk.com\/main\/wp-json\/wp\/v2\/posts\/933","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/somethingk.com\/main\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/somethingk.com\/main\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/somethingk.com\/main\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/somethingk.com\/main\/wp-json\/wp\/v2\/comments?post=933"}],"version-history":[{"count":4,"href":"https:\/\/somethingk.com\/main\/wp-json\/wp\/v2\/posts\/933\/revisions"}],"predecessor-version":[{"id":938,"href":"https:\/\/somethingk.com\/main\/wp-json\/wp\/v2\/posts\/933\/revisions\/938"}],"wp:attachment":[{"href":"https:\/\/somethingk.com\/main\/wp-json\/wp\/v2\/media?parent=933"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/somethingk.com\/main\/wp-json\/wp\/v2\/categories?post=933"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/somethingk.com\/main\/wp-json\/wp\/v2\/tags?post=933"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}