Scraping a GIThub repository into DEVONthink using AppleScript

Sometimes, the easiest way to keep a small GIThub repository for later study is to safe it into a document instead of cloning everything into an abyssal directory structure. Clipping into DEVONthink makes the project searchable as well and part of your knowledge base – Thank is amazingly handy to find it back later :)

The following script scraps an entire GIThub repository into DEVONthink Pro, merging all subpages into one single PDF. It’s best installed in the applicaton specific directory in order to have the tool only one click away while surfing.

/Users/<youaccount>/Library/Scripts/Applications/Firefox

Now, open the repository in Firefox and start the tool using the “scroll” symbol in the menu bar.

(Click here to open the GIThub page, where you can find the latest version.)

-- AppleScipt to scrap a GIThub project, currently open in FireFox,
-- into one single PDF note in DEVINthink.
-- Install this script here: /Users/<youaccount>/Library/Scripts/Applications/Firefox
-- (cc0) 
-- 2016-2017/@imifos
-- v2

global downloadedurls
global allnewnotes
global basewebsiteurl

set allnewnotes to {}
set downloadedurls to {}

-- Get current URL from Firefox
tell application "Firefox" to activate
tell application "System Events"
	keystroke "l" using command down
	keystroke "c" using command down
end tell
delay 0.5
set basewebsiteurl to the clipboard

say "Scraping started!"
log "Started with URL " & basewebsiteurl

-- Block non GIThub.com URLs
if basewebsiteurl does not contain "/github.com/" then
	say "Current page is not a github repository. Stop here."
	return
end if

try
	-- Recursively fetch the repository into notes
	my handle_page(basewebsiteurl)
	
	-- Merge the above notes into one single note
	my merge_all_pages(basewebsiteurl)
	
on error error_message number error_number
	say "There is an error, please check dialog window!"
	display alert "Scrap Github repository into DEVONthink" message error_message as warning
end try

log "Operation completed"
say "Operation completed."

return


-- ------------------------------------------
--
on merge_all_pages(newwebsiteurl)
	
	log "Merge single pages into one document"
	
	if (count of allnewnotes) > 0 then
		tell application id "DNtp"
			
			-- Merge newly created notes into one and get rid of the single ones
			set mergedpage to merge records allnewnotes
			set the name of mergedpage to newwebsiteurl
			repeat with itemtodelete in allnewnotes
				delete record itemtodelete
			end repeat
			
		end tell
	else
		say "No pages to scraped!"
	end if
	
end merge_all_pages


-- ------------------------------------------
--
-- Downloads a page, creates a note in DT, scans for sub-URLs and recursively handles the sub-URLs 
--
on handle_page(newwebsiteurl)
	
	-- Skip various cases
	
	if newwebsiteurl does not contain "/blob/master/" and newwebsiteurl does not contain "/tree/master/" and newwebsiteurl is not basewebsiteurl then
		-- log "Skipped URL as not a master branch file " & newwebsiteurl
		return
	end if
	
	if not {newwebsiteurl begins with basewebsiteurl} then
		log "Skipped URL as reference to other repository " & newwebsiteurl
		return
	end if
	
	if newwebsiteurl contains "#" then
		log "Skipped URL as it's a relative jump " & newwebsiteurl
		return
	end if
	
	if not {newwebsiteurl begins with "http:" or newwebsiteurl begins with "https:"} then
		log "Page URL does not start with http! " & newwebsiteurl
		return
	end if
	
	if newwebsiteurl contains "README.md" then
		-- No need to scrap the README as it's displayed as part of the parent page
		log "Skipped README.md at " & newwebsiteurl
		return
	end if
	
	if newwebsiteurl contains "?raw=true" then
		-- Do not scrap binary files as they do not well in PDF format :)
		log "Skipped binary file at " & newwebsiteurl
		return
	end if
	
	if downloadedurls contains newwebsiteurl then
		log "URL already downloaded, so it's not done again " & newwebsiteurl
		return
	end if
	
	set downloadedurls to downloadedurls & newwebsiteurl
	
	-- Fetch the current page
	tell application id "DNtp"
		
		log "Tell DT to scrap " & newwebsiteurl
		
		-- Create PDF image in DEVONthink
		repeat with i from 1 to 5
			log "   Download tenative " & i
			set contentobject to create PDF document from newwebsiteurl name newwebsiteurl
			log "   DT is back!"
			if contentobject is not missing value then exit repeat
		end repeat
		
		if contentobject is missing value then
			log "  DT create PDF document returns 'missing value'"
			log "      from: " & newwebsiteurl & ", name: " & newwebsiteurl
			log "      return: " & contentobject
			say "Warning! A page could not been downloaded!"
		end if
		
		-- Add new DEVONthink object to the "to be merged" list
		set end of allnewnotes to contentobject
		
		-- Ask DEVONthink to download the page source (no need to call a browser for this)
		set websitesource to download markup from newwebsiteurl
		
		-- Get URLs of all sub-pages
		set subpageurls to get links of websitesource base URL newwebsiteurl
		
	end tell
	
	-- Recursively handle sub pages one by one
	repeat with subpageurl in subpageurls
		handle_page(subpageurl)
	end repeat
	
end handle_page

Leave a Reply

Your email address will not be published. Required fields are marked *