› Take 2: Download script for springerlink.com Ebooks 
Tue, 02/24/2009 - 22:58
Seems like quite some people are interested in my bash script for downloading ebooks from http://springerlink.com.
That script has some quirks, the greatest of all that it was written in bash which makes it kind of hard to implement new features. And one which was requested was support for books which span multiple pages on SpringerLink.
So here I present springer_download.py - a Python rewrite which should handle all the old links and some more. This is the very first program I’ve written in Python. And since it has to run on the Zedat servers it’s limited to Python 2.4.x without any fancy shmancy additions (a pity, since I’d love to use urlgrabber or pycurl).
the script
You can find the sources on GitHub: http://milianw.github.com/springer_download/
I plan to put all my future code snippets in public repositories on GitHub. That way you can easily track changes and stay up to date. GitHub also has a nice “download” feature which you can use to get the current version. You can find my profile and my repositories at http://github.com/milianw
Note: This script is intended to be run under Linux or other *nix’es which fulfill the requirements (Python 2.4.x, iconv and pdftk). Windows is not supported.
TODO
- introduce multithreading for faster / simultaneous downloads
- add speed to progressbar
- use progressbar in source-downloader
- use one git-repo per project (makes links work properly)
Comments
Sehr gut - genau was ich Mon, 03/16/2009 - 21:20 — Anonymous (not verified)
Sehr gut - genau was ich gesucht habe! Danke!
weitere Wünsche:
Beides nicht wirklich Mon, 03/16/2009 - 22:29 — Milian Wolff
Beides nicht wirklich möglich, da man dafür die PDFs bearbeiten müsste. Und PDF ist mehr oder weniger ein Read-Only-Dateiformat.
Und was wäre wenn man die Mon, 01/04/2010 - 13:33 — Anonymous (not verified)
Und was wäre wenn man die einzelnen Kapitel mit Namen als Bookmarks in das endgültige PDF einfügen könnte? Ginge das?
Find es raus, ich hab keine Mon, 01/04/2010 - 17:11 — Milian Wolff
Find es raus, ich hab keine Ahnung von PDF-Authoring.
Hello. This software is an Fri, 04/03/2009 - 19:55 — Anonymous (not verified)
Hello. This software is an excellent idea but I get the following error:
4]# ./springer_download.py -l “http://www.springerlink.com/content/h381wp/?sortorder=asc&v=expanded” Please wait, link source is being downloaded… http://www.springerlink.com/content/h381wp/
ERROR: Bad link given ([Errno socket error] (110, ‘Connection timed out’))
The springerlink address is correct because I can paste it into a browser and both the webpage and the pdfs open up properly. I’m using andlinux which runs ubuntu as a service in windows vista. This could be the cause of the problem but the browser, Synaptic Package Manager and pinging from the console work. I’ve also tried to disable my firewall but this did not fix the problem. Thanks in advance for any insight into this problem.
I think I know what is the Fri, 04/03/2009 - 20:40 — Milian Wolff
I think I know what is the cause yet can’t test it myself right now. Try with the following link:
http://www.springerlink.com/content/h381wp/?sortorder=asc
Note the different layout of the page, I think that’s the cause. Hope that helps.
Thanks for the quick reply. Sat, 04/04/2009 - 16:37 — Anonymous (not verified)
Thanks for the quick reply. I think the problem is the university firewall which seems to be blocking the traffic as I can use the script from outside the university. One suggestion is to make this work as a firefox plugin (perhaps with imacros). Thanks for the useful script!
hi milian! ich hab mal Wed, 04/08/2009 - 19:29 — flo (not verified)
hi milian!
ich hab mal schnell ne “just HASH” option eingefügt, wenn du willst ..
is keine “schöne” lösung, aber ging schnell ^^
Ansonsten, großartig dein Skript!
Hab grad deinen Patch, leicht Wed, 04/08/2009 - 20:32 — Milian Wolff
Hab grad deinen Patch, leicht verändert, zu github geschoben. Danke :)
~~~
just pushed a commit to github with your patch (slightly modified). Thanks!
What about a Windows Tue, 06/02/2009 - 02:43 — Anonymous (not verified)
What about a Windows compatible script to allow download of articles from journals in an organized fashion? Thanks for ur consideration
I don’t use Windows and won’t Tue, 06/02/2009 - 12:59 — Milian Wolff
I don’t use Windows and won’t make the script windows-compatible. Yet I’d happily accept patches. Since python is cross-platform it should’nt be too hard. You’d just have to find alternatives to pdftk and iconv. These two dependencies make the script platform dependent.
Hey, tolles Script! Ich habe Mon, 07/27/2009 - 23:49 — Christian (not verified)
Hey, tolles Script! Ich habe versucht es unter Windows zum laufen zu bekommen - und es auch geschafft!!! Musste jedoch die beiden Abfragen ob pdftk und iconv vorhanden sind abschalten. Beide gibt es für Windows und ich habe sie so integriert, dass sie Platformweit aufrufbar sind.
Herunterladen funktioniert, jedoch folgendes Problem:
Woran scheitert die weitere Verarbeitung? LG
Naja, ich will den Namen in Tue, 07/28/2009 - 01:51 — Milian Wolff
Naja, ich will den Namen in iconv pipen, weiß nicht ob das auf Windows überhaupt geht. Notfalls einfach auskommentieren und damitleben, dass der dir ggf. versucht ne Datei anzulegen die “ungute” Zeichen im Namen enthält… Oder nen anderen Weg finden iconv unter Windows aufzurufen (ohne echo). Oder vlt. mingw installieren - könnte gehen…
Habe Cygwin installiert, Thu, 07/30/2009 - 17:01 — Anonymous (not verified)
Habe Cygwin installiert, darin läuft es. Habe nur bezüglich iconv auf Folgendes abgeändert:
p2 = subprocess.Popen([“iconv”, “-f”, “UTF-8”, “-t” ,”CP1258”],
Umlaute kommen dann komisch und es gibt ein Problem wenn Fragezeichen drin sind (im Dateinamen)
Eine andere Schwierigkeit kommt auf, wenn ein Buch aus mehreren Untervolumes besteht, da versagt dann das downloaden.
Das Skript ist genial gemacht, besten Dank an den Autor!
i can’t download with script Tue, 08/11/2009 - 14:55 — pappy (not verified)
i can’t download with script now, error here
“Please wait, link source is being downloaded… http://springerlink.com/content/f54k582l0w11xj18/
The book you are trying to download is called ‘Architecture of an LBS Platform to Support Privacy Control for Tracking Moving Objects in a Ubiquitous Environments’
found 1 chapters downloading chapter 1/1 http://springerlink.com/content/f54k582l0w11xj18/fulltext.pdf 100%
ERROR: downloaded chapter http://springerlink.com/content/f54k582l0w11xj18/fulltext.pdf has invalid mime type text/html - are you allowed to download it? “
Plz help :(
You need to be authenticated Tue, 08/11/2009 - 15:18 — Milian Wolff
You need to be authenticated for SpringerLink via VPN. This script does not support any other authentication.
I myself use it from my university where access to springerlink is automatically authenticated. If it is the same for your university, access one of the servers there and run the script from there. Ask your IT department.
I am authenticated via VPN Wed, 02/03/2010 - 23:18 — moohh (not verified)
I am authenticated via VPN and I can manually download the books by using Firefox, but if I want to try this script, I get the same error.
The download of a single pdf via wget doesn’t work either. I got ERROR 403.
update the script, I fixed Thu, 02/04/2010 - 00:45 — Milian Wolff
update the script, I fixed that a few hours ago.
Hi, thank you for this Wed, 02/03/2010 - 01:50 — Vitaly (not verified)
Hi,
thank you for this script.
Since about a week ago, it stopped working though:
………………………………….
$ ./springer_download.py -l http://www.springerlink.com/content/qv89j2/?p=101a335b740a47c7a7578b7d16…
$ Please wait, link source is being downloaded… http://www.springerlink.com/content/qv89j2/
ERROR: Could not evaluate book title - bad link?
Usage: springer_download.py [OPTIONS]
Options: -h, —help Display this usage message -l LINK, —link=LINK defines the link of the book you intend to download -c HASH, —content=HASH builds the link from a given HASH (see below)
………………………………….
This error appears for whatever book I try to download. Is it because they changed directory structure or something else @ Springer?
Thank you, Vitaly
Thanks for the heads up, I Wed, 02/03/2010 - 16:01 — Milian Wolff
Thanks for the heads up, I fixed the code to circumvent this springerlink “protection” (it didn’t accept the default UserAgent that was sent by python…). Should work properly now (assuming you have the rights to access this book, which I / the FU-Berlin) hast not it seems.
Hi, erstmal danke für das Thu, 02/04/2010 - 16:03 — Thomas (not verified)
Hi, erstmal danke für das Script, ich benutze es schon ziemlich lange…Ich habe jetzt allerdings auch Probleme beim downloaden. Ich bekomme folgende Fehlermeldung:
Please wait, link source is being downloaded…
http://www.springerlink.de/content/q28652/
The book you are trying to download is called ‘Regelungstechnik 1’
found 15 chapters
downloading chapter 1/15
http://www.springerlink.de/content/q28652/front-matter.pdf -819200%
ERROR: downloaded chapter http://www.springerlink.de/content/q28652/front-matter.pdf has invalid mime type text/html - are you allowed to download it?
“Per Hand” kann ich die pdfs der einzelnen Kapitel allerdings problemlos herunterladen.
Hm dann stimmt wohl noch was Fri, 02/05/2010 - 00:01 — Milian Wolff
Hm dann stimmt wohl noch was nicht - muss ich mir mal anschauen. Evtl. wird noch der Referer gecheckt oder sowas - mal gucken was die Leute von SpringerLink sich da ausdenken um es uns Studenten zu erschweren an die Bücher zu kommen… seufz
Ich habe mal was probiert, Fri, 02/05/2010 - 13:35 — Thomas (not verified)
Ich habe mal was probiert, scheint sogar geklappt zu haben :)
Ich habe nur oben das hinzugefügt:
Und die def geturl(url, dst) geändert in:
Great, thanks for the patch. Fri, 02/05/2010 - 14:16 — Milian Wolff
Great, thanks for the patch. I included it now (slightly different). Does it work with the vanilla source from github again now? I ask since I can still not download that one book ;-)
Thanks a lot for prompt Fri, 02/12/2010 - 08:38 — Vitaly (not verified)
Thanks a lot for prompt response, Milian! It works great now.
By the way, I spotted another Fri, 02/12/2010 - 08:47 — Vitaly (not verified)
By the way, I spotted another glitch: if the book name contains the colon sign (‘:’), the book is downloaded OK but cannot be saved, as file name cannot include colons. You could substitute it with dash or something…
Can you give me an example? I Fri, 02/12/2010 - 13:59 — Milian Wolff
Can you give me an example? I don’t see why a colon should be removed from a filename, it’s perfectly valid imo. At least on Unix:
Hallo, bin alter Windows Fri, 02/05/2010 - 20:54 — RS(15,11) (not verified)
Hallo, bin alter Windows Benutzer und hab mich seit gestern auch wegen deinem Skript in Unix eingearbeitet. Benutze nun Cygwin und mit der neuesten Version des Skriptes läuft alles super. Vielen Dank!!
Nachdem ich es ja unter Mon, 02/22/2010 - 17:26 — Christian (not verified)
Nachdem ich es ja unter Windows teilweise zum laufen bekommen habe sind das meine ersten gehversuche mit Linux, aber trotzdem lädt er es nicht runter…. (VPN ist aktiviert)
vll kann mir ja jemand einen Tipp geben.
^^ hier an der stelle hängt er
Ich wollte nur mal Wed, 02/24/2010 - 20:31 — Anonymous (not verified)
Ich wollte nur mal rückmelden, dass das Skript mit Cygwin unter Windows 7 hervorragend und ohne Probleme arbeitet. Bei der Installation von Cygwin muss man natürlich darauf achten, die entsprechenden Pakete auszuwählen. Ein großes Danke an den Autor für die Arbeit!
I get an error as well Mon, 03/01/2010 - 14:26 — Seb (not verified)
I get an error as well “WindowsError: [Error 2] Das System kann die angegebene Datei nicht finden” what I can do?
The book you are trying to download is called ‘Word 2007’
found 14 chapters downloading chapter 1/14 http://www.springerlink.com/content/m5427g/front-matter.pdf 100% downloading chapter 2/14 http://springerlink.com/content/u838718169040815/fulltext.pdf 100% downloading chapter 3/14 http://springerlink.com/content/t347382q85766802/fulltext.pdf 100% downloading chapter 4/14 http://springerlink.com/content/rm4183k750h28k27/fulltext.pdf 100% downloading chapter 5/14 http://springerlink.com/content/q25m417372705567/fulltext.pdf 100% downloading chapter 6/14 http://springerlink.com/content/t7303462187j3t17/fulltext.pdf 100% downloading chapter 7/14 http://springerlink.com/content/n191l36489484284/fulltext.pdf 100% downloading chapter 8/14 http://springerlink.com/content/wp7t1657u28p4774/fulltext.pdf 100% downloading chapter 9/14 http://springerlink.com/content/l3126077869171g2/fulltext.pdf 100% downloading chapter 10/14 http://springerlink.com/content/j6399r4330572128/fulltext.pdf 100% downloading chapter 11/14 http://springerlink.com/content/p2262884852pl1w4/fulltext.pdf 100% downloading chapter 12/14 http://springerlink.com/content/u647377241368kl7/fulltext.pdf 100% downloading chapter 13/14 http://springerlink.com/content/t5h400553644360l/fulltext.pdf 100% downloading chapter 14/14 http://www.springerlink.com/content/m5427g/back-matter.pdf 100% merging chapters Traceback (most recent call last): File “C:\Dokumente und Einstellungen\Sebastian\Desktop\sp\springer_download.py”, line 238, in <module> main(sys.argv[1:]) File “C:\Dokumente und Einstellungen\Sebastian\Desktop\sp\springer_download.py”, line 147, in main p1 = subprocess.Popen([“echo”, bookTitle], stdout=subprocess.PIPE) File “D:\Python26\lib\subprocess.py”, line 621, in init errread, errwrite) File “D:\Python26\lib\subprocess.py”, line 830, in _execute_child startupinfo) WindowsError: [Error 2] Das System kann die angegebene Datei nicht finden
I don’t know, I won’t support Mon, 03/01/2010 - 20:07 — Milian Wolff
I don’t know, I won’t support Windows. Try cygwin as the poster above you said that it works.
Post new comment