Code SnippetsDownload script for springerlink.com Ebooks Syndicate content

Sat, 11/08/2008 - 17:28

After a long period of silence I present you the following bash script for downloading books from http://springerlink.com. This is not a way to circumvent their login mechanisms, you will need proper rights to download books. But many students in Germany get free access to those ebooks via their universities. I for example study at the FU Berlin and put the script in my Zedat home folder and start the download process via SSH from home. Afterwards I download the tarball to my home system.

Read on for the script.

Download the script (attached below), push it to your Zedat account, make it executable and run it. You’ll have to give it a link to a book-detail page like this one for example. Also take a look at the example call at the top of the script.

Requires bash, wget, iconv, egrep.

Note: Take a look at the comments, Faro has come up with an updated Bash script which properly handles ebooks which span multiple pages on SpringerLink and merges the pdf-files with pdftk. Thanks Faro!

Note: For those, who’d prefer a Python version over a Bash-version, take a look at my second attempt on a download script. The Bash version is abandoned. Long live the Python version!

  1. #!/bin/bash
  2.  
  3. if [[ "$1" == "" ]]; then
  4. echo "Usage: $0 \"http://springerlink.com/content/.../?p=...\""
  5. exit 1
  6. fi
  7.  
  8. target=$1
  9.  
  10. # get whole page
  11. echo -n "Please wait, link source is being downloaded..."
  12. page=$(wget -q -O - "$target")
  13. echo "ok - done"
  14.  
  15. echo -n "Validating link source..."
  16.  
  17. # get title of page
  18. title_line=$(echo "$page" 2>/dev/null | grep -n -m 1 '<h2 class="MPReader_Profiles_SpringerLink_Content_PrimitiveHeadingControlName">' | egrep -o "^[[:digit:]]+")
  19. if [[ "$title_line" == "" ]]; then
  20. echo "invalid URL"
  21. exit 1
  22. fi
  23. l=0
  24. title=""
  25. while read line; do
  26. if [[ "$l" == "$title_line" ]]; then
  27. title=$(echo "$line" | egrep -o "[[:alnum:]].+[[:alnum:]]" | iconv -f "UTF-8" -t "ASCII//TRANSLIT")
  28. break
  29. fi;
  30. l=$(expr $l + 1)
  31. done < <(echo "$page")
  32. if [[ "$title" == "" ]]; then
  33. echo "invalid URL"
  34. exit 1
  35. fi
  36. echo "ok - done"
  37.  
  38. # check type
  39. type=$(echo "$page" | grep -o '<span id="ctl00_PageHeadingLabel".*</span>' | grep -o '>.*<' | egrep -o '[^<>]+')
  40.  
  41. if [[ "$type" == "Book Chapter" ]]; then
  42. echo "will download book chapter '$title'"
  43. echo
  44.  
  45. wget -O "$title.pdf" "$(dirname $target)/fulltext.pdf"
  46. elif [[ $type == "Book" ]]; then
  47. echo "will download book '$title'"
  48. echo
  49.  
  50. mkdir "$title" 2>/dev/null
  51. cd "$title" || exit 1
  52.  
  53. # get links
  54. declare -a links;
  55. key=0
  56. while read link; do
  57. links[${key}]=$link
  58. key=$(expr $key + 1)
  59. done < <(echo "$page" | grep '/fulltext.pdf"><img' | egrep -o 'href="[^"]+' | cut -c 7-)
  60.  
  61. # get front + back matter
  62. wget -O "0-front-matter.pdf" "$(dirname $target)/front-matter.pdf"
  63. wget -O "$((${#links[@]}+1))-back-matter.pdf" "$(dirname $target)/back-matter.pdf"
  64.  
  65. # get chapters
  66. key=0
  67. while read chapter; do
  68. echo "$(($key+1)) - $chapter :: ${links[${key}]}"
  69. chapter=$(echo $chapter | iconv -f "UTF-8" -t "ASCII//TRANSLIT")
  70. wget -O "$(($key+1))-$chapter"".pdf" "http://springerlink.com/${links[${key}]}"
  71. key=$(expr $key + 1)
  72. done < <(echo "$page" | egrep -o '^[[:blank:]]*<a href="/content/[^>]+&amp;pi=[[:digit:]]+">[^>]+</a>' | \
  73. egrep -o '>[^<]+' | cut -c 2-)
  74.  
  75. cd ..
  76. tar -cvjf "$title.tar.bz2" "$title"
  77. rm "$title"/*.pdf
  78. rmdir "$title"
  79. else
  80. echo "unknown link type '$type'"
  81. fi

Update 01/09/09: - The script now includes chapter numbers in the file names - The script can now handle links to single book chapters - minor other cleanup

Update 02/20/09: - fixed types

Update 02/24/09: - rewrite script in Python

AttachmentSize
springer_download.sh2.33 KB

Comments

I tried the latest version Thu, 12/13/2012 - 16:16 — Gabriel (not verified)

I tried the latest version from Github, but it merges the chapters in the wrong order! I have used the following commands:

./springer_download.py -c 978-3-642-23253-4

and

./springer_download.py -c 978-3-642-02507-5

I am using Debian Testing (Wheezy) with the latest updates (pdftk, not stapler). What is the mistake?

Can you rebuild this script Sat, 04/07/2012 - 20:08 — matze (not verified)

Can you rebuild this script in a way to work with http://www.oldenbourg-link.com/ An example is there: http://www.oldenbourg-link.com/isbn/9783486582451

I think both pages are similar and this shouldn’t be a lot of work, or?

Please help!

Sorry but that’s not going to Sun, 04/08/2012 - 18:04 — Milian Wolff

Sorry but that’s not going to work. Since I have no use for that page, why should I spend time on that?

Good stuff - but I prefer the Thu, 12/08/2011 - 18:52 — macdet (not verified)

Good stuff - but I prefer the python method!

thx4all

Hi Milian, i fixed the Wed, 02/25/2009 - 00:47 — Faro (not verified)

Hi Milian,

i fixed the script in bash, for those who likes the simplicity of bash. For myself, i love to see your python approach and will continue using that one. Thanks Faro

  1. #!/bin/bash
  2.  
  3. if [[ "$1" == "" ]]; then
  4. echo "Usage: $0 \"http://springerlink.com/content/.../?p=...\""
  5. exit 1
  6. fi
  7.  
  8. target=$1
  9.  
  10. p_o=0
  11. # get whole page
  12. echo -n "Please wait, link source is being downloaded..."
  13. page=$(wget -q -O - "$target""/?p_o="$p_o"")
  14. echo "ok - done"
  15. echo -n "Validating link source..."
  16. # get title of page
  17. title_line=$(echo "$page" 2>/dev/null | grep -n -m 1 '<h2 class="MPReader_Profiles_SpringerLink_Content_PrimitiveHeadingControlName">' | egrep -o "^[[:digit:]]+")
  18. if [[ "$title_line" == "" ]]; then
  19. echo "invalid URL"
  20. exit 1
  21. fi
  22. l=0
  23. title=""
  24. while read line; do
  25. if [[ "$l" == "$title_line" ]]; then
  26. title=$(echo "$line" | egrep -o "[[:alnum:]].+[[:alnum:]]" | iconv -f "UTF-8" -t "ASCII//TRANSLIT")
  27. break
  28. fi;
  29. l=$(expr $l + 1)
  30. done < <(echo "$page")
  31. if [[ "$title" == "" ]]; then
  32. echo "invalid URL"
  33. exit 1
  34. fi
  35.  
  36. # check type
  37. type=$(echo "$page" | grep -o '<span id="ctl00_PageHeadingLabel".*</span>' | grep -o '>.*<' | egrep -o '[^<>]+')
  38.  
  39. if [[ "$type" == "Book Chapter" ]]; then
  40. echo "will download book chapter '$title'"
  41. echo
  42.  
  43. wget -q -O "$title.pdf" "$(dirname $target)/fulltext.pdf"
  44. elif [[ $type == "Book" ]]; then
  45. echo "will download book '$title'"
  46. echo
  47.  
  48. mkdir "$title" 2>/dev/null
  49. cd "$title" || exit 1
  50.  
  51. wget -q -O "0.pdf" "$target""/front-matter.pdf"
  52. until [[ "$title_line" == "" ]];
  53. do
  54. # get links
  55. declare -a links;
  56. key=00
  57. while read link; do
  58. links[${key}]=$link
  59. key=$(expr $key + 1)
  60. done < <(echo "$page" | grep '/fulltext.pdf"><img' | egrep -o 'href="[^"]+' | cut -c 7-)
  61.  
  62. # get chapters
  63. key=00
  64. while read chapter; do
  65. echo "$(($key+$p_o+1)) - $chapter :: ${links[${key}]}"
  66. chapter=$(echo $chapter | iconv -f "UTF-8" -t "ASCII//TRANSLIT")
  67. wget -q -O "$(($key+$p_o+1))"".pdf" "http://springerlink.com/${links[${key}]}"
  68. key=$(expr $key + 1)
  69. done < <(echo "$page" | egrep -o '^[[:blank:]]*<a href="/content/[^>]+&amp;pi=[[:digit:]]+">[^>]+</a>' | egrep -o '>[^<]+' | cut -c 2-)
  70. # get title of page
  71. title_line=$(echo "$page" 2>/dev/null | grep -n -m 1 '<h2 class="MPReader_Profiles_SpringerLink_Content_PrimitiveHeadingControlName">' | egrep -o "^[[:digit:]]+")
  72. next_link=$(echo "$page" 2>/dev/null | grep -n -m 1 "<a href=\"/content/"$(basename $target)"/?sortorder=asc&amp;p_o="$(($p_o+10))"\">Next</a>" | egrep -o "^[[:digit:]]+")
  73. if [[ "$next_link" == "" ]]; then
  74. break
  75. fi
  76. p_o=$(( $p_o + 10 ))
  77. # get whole page
  78. echo -n "Please wait, next page is being downloaded..."
  79. page=$(wget -q -O - "$target""/?p_o="$p_o"")
  80. echo "ok - done"
  81.  
  82. done
  83. # back matter
  84.  
  85. wget -q -O "$((${#links[@]}-9))-back-matter.pdf" "$target""/back-matter.pdf"
  86.  
  87. echo "Merging PDF files"
  88. pdftk `echo $( ls |sort -n)` cat output ../"$title".pdf
  89. cd ..
  90. #tar -cvjf "$title.tar.bz2" "$title"
  91. rm "$title"/*.pdf
  92. rmdir "$title"
  93. echo "Suck completed"
  94. else
  95. echo "unknown link type '$type'"
  96. fi

Great, thanks Faro! I’ve Wed, 02/25/2009 - 02:37 — Milian Wolff

Great, thanks Faro!

I’ve added a note to the main article and took the liberty to enable syntax-highlighting for your code.

just a remark. You could Fri, 02/20/2009 - 17:42 — Faro (not verified)

just a remark. You could search for the counted chapter list: e.g.

  1. <td>40 Chapters</td>

and proceed the script in a loop for each page with <a href="/content/t64382/?sortorder=asc&amp;p_o=10">Next</a> until the

<span class="paginationDisabled">Next</span>

“Disabled” Tag appears… hope this helps

Yes, I know that and I will Mon, 02/23/2009 - 16:08 — Milian Wolff

Yes, I know that and I will fix it one day. But maybe I’ll rewrite it in another language first, lets see!

rewrite has started and is Tue, 02/24/2009 - 23:00 — Milian Wolff

rewrite has started and is usable imo, take a look at http://milianw.de/code-snippets/take-2-download-script-for-springerlinkc…

Thank you for your fix. It Fri, 02/20/2009 - 17:10 — Faro (not verified)

Thank you for your fix. It now works… but i’ve discovered some problems with books spread on more than one page: like this one:

http://springerlink.com/content/t64382/?p=a10f1da5c8604081a487cfce67924074

do you have any idea for this to solve?

By the way if included pdftk `echo $( ls |sort -n)` cat output ../"$title".pdf to merge the PDF files into one file. The ordering is done by sort -n

Great idea, I’ll add that Mon, 02/23/2009 - 16:07 — Milian Wolff

Great idea, I’ll add that since that will make “Go to page” work once again.

Hi… did springer change Thu, 02/19/2009 - 23:03 — Faro (not verified)

Hi…

did springer change something or does my script do something wrong?

i always get:

Please wait, link source is being downloaded…ok - done Validating link source…ok - done unknown link type ”

Thank you for your help

Thanks for the hint, I fixed Fri, 02/20/2009 - 01:43 — Milian Wolff

Thanks for the hint, I fixed it. You can find an updated version above.

Hi, thank you very much for Mon, 12/15/2008 - 18:20 — Andreas (not verified)

Hi, thank you very much for your helpful script! However I did found a small bug: once a book contains several chapters with the same name, only the first chapter is downloaded and the others are being omitted. This is an example: http://springerlink.com/content/lp46u2/?p=4315112f571546c79595e6d1dd7552…

I tried to add numbers into the “chapter”-line, but this only gave me the numbers in the filenames. The script insisted on downloading only the first chapter.

Cheers (and thanks again), Andreas

Ok, the Script was updated Fri, 01/09/2009 - 16:53 — Milian Wolff

Ok, the Script was updated and should handle chapter numbers now correctly. Also it handles single book-chapter downloading well.

Thanks! With the new version Fri, 02/27/2009 - 17:34 — Andreas (not verified)

Thanks! With the new version I’ll be able to get the rest of the interesting books on math&physics.

Cheers, Andreas

I’ll look into it and update Sat, 12/20/2008 - 14:49 — Milian Wolff

I’ll look into it and update the script. Thanks for the report!

Post new comment

  • You can use Markdown syntax to format and style the text. Also see Markdown Extra for tables, footnotes, and more.
  • You can enable syntax highlighting of source code with the following tags: <code>, <blockcode>, <pre>.
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options