Download script for springerlink.com Ebooks

After a long period of silence I present you the following bash script for downloading books from http://springerlink.com. This is not a way to circumvent their login mechanisms, you will need proper rights to download books. But many students in Germany get free access to those ebooks via their universities. I for example study at the FU Berlin and put the script in my Zedat home folder and start the download process via SSH from home. Afterwards I download the tarball to my home system.

Read on for the script.

Download the script (attached below), push it to your Zedat account, make it executable and run it. You’ll have to give it a link to a book-detail page like this one for example. Also take a look at the example call at the top of the script.

Requires bash, wget, iconv, egrep.

Note: Take a look at the comments, Faro has come up with an updated Bash script which properly handles ebooks which span multiple pages on SpringerLink and merges the pdf-files with pdftk. Thanks Faro!

Note: For those, who’d prefer a Python version over a Bash-version, take a look at my second attempt on a download script. The Bash version is abandoned. Long live the Python version!

    #!/bin/bash                                                                                                                                                   
     
    if [[ "$1" == "" ]]; then
        echo "Usage: $0 \"http://springerlink.com/content/.../?p=...\""
        exit 1                                                         
    fi                                                                 
     
    target=$1
     
    # get whole page
    echo -n "Please wait, link source is being downloaded..."
    page=$(wget -q -O - "$target")                           
    echo "ok - done"                                         
     
    echo -n "Validating link source..."
     
    # get title of page
    title_line=$(echo "$page" 2>/dev/null | grep -n -m 1 '<h2 class="MPReader_Profiles_SpringerLink_Content_PrimitiveHeadingControlName">' | egrep -o "^[[:digit:]]+")
    if [[ "$title_line" == "" ]]; then                                                                                                                                
        echo "invalid URL"                                                                                                                                            
        exit 1                                                                                                                                                        
    fi                                                                                                                                                                
    l=0                                                                                                                                                               
    title=""                                                                                                                                                          
    while read line; do                                                                                                                                               
        if [[ "$l" == "$title_line" ]]; then                                                                                                                          
            title=$(echo "$line" | egrep -o "[[:alnum:]].+[[:alnum:]]" | iconv -f "UTF-8" -t "ASCII//TRANSLIT")                                                       
            break                                                                                                                                                     
        fi;                                                                                                                                                           
        l=$(expr $l + 1)                                                                                                                                              
    done < <(echo "$page")                                                                                                                                            
    if [[ "$title" == "" ]]; then                                                                                                                                     
        echo "invalid URL"                                                                                                                                            
        exit 1                                                                                                                                                        
    fi                                                                                                                                                                
    echo "ok - done"                                                                                                                                                  
     
    # check type
    type=$(echo "$page" | grep -o '<span id="ctl00_PageHeadingLabel".*</span>' | grep -o '>.*<' | egrep -o '[^<>]+')
     
    if [[ "$type" == "Book Chapter" ]]; then
      echo "will download book chapter '$title'"
      echo                                      
     
      wget -O "$title.pdf" "$(dirname $target)/fulltext.pdf"
    elif [[ $type == "Book" ]]; then
      echo "will download book '$title'"
      echo
     
      mkdir "$title" 2>/dev/null
      cd "$title" || exit 1
     
      # get links
      declare -a links;
      key=0
      while read link; do
          links[${key}]=$link
          key=$(expr $key + 1)
      done < <(echo "$page" | grep '/fulltext.pdf"><img' | egrep -o 'href="[^"]+' | cut -c 7-)
     
      # get front + back matter
      wget -O "0-front-matter.pdf" "$(dirname $target)/front-matter.pdf"
      wget -O "$((${#links[@]}+1))-back-matter.pdf" "$(dirname $target)/back-matter.pdf"
     
      # get chapters
      key=0
      while read chapter; do
          echo "$(($key+1)) - $chapter    ::    ${links[${key}]}"
          chapter=$(echo $chapter | iconv -f "UTF-8" -t "ASCII//TRANSLIT")
          wget -O "$(($key+1))-$chapter"".pdf" "http://springerlink.com/${links[${key}]}"
          key=$(expr $key + 1)
      done < <(echo "$page" | egrep -o '^[[:blank:]]*<a href="/content/[^>]+&amp;pi=[[:digit:]]+">[^>]+</a>' | \
                              egrep -o '>[^<]+' | cut -c 2-)
     
      cd ..
      tar -cvjf "$title.tar.bz2" "$title"
      rm "$title"/*.pdf
      rmdir "$title"
    else
      echo "unknown link type '$type'"
    fi

Update 01/09/09: - The script now includes chapter numbers in the file names - The script can now handle links to single book chapters - minor other cleanup

Update 02/20/09: - fixed types

Update 02/24/09: - rewrite script in Python

Attachment	Size
springer_download.sh	2.33 KB

Comments

Want to comment? Send me an email!

Comment by Gabriel (not verified) (2012-12-13 16:16:00)

I tried the latest version from Github, but it merges the chapters in the wrong order! I have used the following commands:

./springer_download.py -c 978-3-642-23253-4

and

./springer_download.py -c 978-3-642-02507-5

I am using Debian Testing (Wheezy) with the latest updates (pdftk, not stapler). What is the mistake?

Comment by matze (not verified) (2012-04-07 20:08:00)

Can you rebuild this script in a way to work with http://www.oldenbourg-link.com/ An example is there: http://www.oldenbourg-link.com/isbn/9783486582451

I think both pages are similar and this shouldn’t be a lot of work, or?

Please help!

Comment by Milian Wolff (2012-04-08 18:04:00)

Sorry but that’s not going to work. Since I have no use for that page, why should I spend time on that?

Comment by macdet (not verified) (2011-12-08 18:52:00)

Good stuff - but I prefer the python method!

thx4all

Comment by Faro (not verified) (2009-02-25 00:47:00)

Hi Milian,

i fixed the script in bash, for those who likes the simplicity of bash. For myself, i love to see your python approach and will continue using that one. Thanks Faro

    #!/bin/bash                                                                                                                                                   
     
    if [[ "$1" == "" ]]; then
        echo "Usage: $0 \"http://springerlink.com/content/.../?p=...\""
        exit 1                                                         
    fi                                                                 
     
    target=$1
     
    p_o=0
    # get whole page
    echo -n "Please wait, link source is being downloaded..."
    page=$(wget -q -O - "$target""/?p_o="$p_o"")                           
    echo "ok - done"                                         
    echo -n "Validating link source..."
    # get title of page
    title_line=$(echo "$page" 2>/dev/null | grep -n -m 1 '<h2 class="MPReader_Profiles_SpringerLink_Content_PrimitiveHeadingControlName">' | egrep -o "^[[:digit:]]+")
    if [[ "$title_line" == "" ]]; then                                                                                                                                
        echo "invalid URL"                                                                                                                                            
        exit 1                                                                                                                                                        
    fi
    l=0                                                                                                                                                               
    title=""                                                                                                                                                          
    while read line; do                                                                                                                                               
        if [[ "$l" == "$title_line" ]]; then                                                                                                                          
            title=$(echo "$line" | egrep -o "[[:alnum:]].+[[:alnum:]]" | iconv -f "UTF-8" -t "ASCII//TRANSLIT")                                                       
            break                                                                                                                                                     
        fi;                                                                                                                                                           
        l=$(expr $l + 1)                                                                                                                                              
    done < <(echo "$page")                                                                                                                                            
    if [[ "$title" == "" ]]; then                                                                                                                                     
        echo "invalid URL"                                                                                                                                            
        exit 1                                                                                                                                                        
    fi                                                                                                                                                                
     
    # check type
    type=$(echo "$page" | grep -o '<span id="ctl00_PageHeadingLabel".*</span>' | grep -o '>.*<' | egrep -o '[^<>]+')
     
    if [[ "$type" == "Book Chapter" ]]; then
      echo "will download book chapter '$title'"
      echo                                      
     
      wget -q -O "$title.pdf" "$(dirname $target)/fulltext.pdf"
    elif [[ $type == "Book" ]]; then
      echo "will download book '$title'"
      echo
     
      mkdir "$title" 2>/dev/null
      cd "$title" || exit 1
     
    wget -q -O "0.pdf" "$target""/front-matter.pdf"
    until [[ "$title_line" == "" ]];
    do
          # get links
        declare -a links;
          key=00
         while read link; do
            links[${key}]=$link
            key=$(expr $key + 1)
       done < <(echo "$page" | grep '/fulltext.pdf"><img' | egrep -o 'href="[^"]+' | cut -c 7-)
     
         # get chapters
         key=00
         while read chapter; do
             echo "$(($key+$p_o+1)) - $chapter :: ${links[${key}]}"
           chapter=$(echo $chapter | iconv -f "UTF-8" -t "ASCII//TRANSLIT")
           wget -q -O "$(($key+$p_o+1))"".pdf" "http://springerlink.com/${links[${key}]}"
           key=$(expr $key + 1)
       done < <(echo "$page" | egrep -o '^[[:blank:]]*<a href="/content/[^>]+&amp;pi=[[:digit:]]+">[^>]+</a>' | egrep -o '>[^<]+' | cut -c 2-)
        # get title of page
        title_line=$(echo "$page" 2>/dev/null | grep -n -m 1 '<h2 class="MPReader_Profiles_SpringerLink_Content_PrimitiveHeadingControlName">' | egrep -o "^[[:digit:]]+")
        next_link=$(echo "$page" 2>/dev/null | grep -n -m 1 "<a href=\"/content/"$(basename $target)"/?sortorder=asc&amp;p_o="$(($p_o+10))"\">Next</a>" | egrep -o "^[[:digit:]]+")
     if [[ "$next_link" == "" ]]; then                                                                                                                                
         break                                                                                                                                                      
     fi
        p_o=$(( $p_o + 10 ))
        # get whole page
       echo -n "Please wait, next page is being downloaded..."
      page=$(wget -q -O - "$target""/?p_o="$p_o"")                           
      echo "ok - done"                                         
     
    done
      # back matter
     
      wget -q -O "$((${#links[@]}-9))-back-matter.pdf" "$target""/back-matter.pdf"
     
      echo "Merging PDF files" 
      pdftk `echo $( ls |sort -n)` cat output ../"$title".pdf
      cd ..
      #tar -cvjf "$title.tar.bz2" "$title"
      rm "$title"/*.pdf
      rmdir "$title"
      echo "Suck completed"
    else
      echo "unknown link type '$type'"
    fi

Comment by Milian Wolff (2009-02-25 02:37:00)

Great, thanks Faro!

I’ve added a note to the main article and took the liberty to enable syntax-highlighting for your code.

Comment by Faro (not verified) (2009-02-20 17:42:00)

just a remark. You could search for the counted chapter list: e.g.

    <td>40 Chapters</td>

and proceed the script in a loop for each page with <a href="/content/t64382/?sortorder=asc&p_o=10">Next</a> until the

<span class="paginationDisabled">Next</span>

“Disabled” Tag appears… hope this helps

Comment by Milian Wolff (2009-02-23 16:08:00)

Yes, I know that and I will fix it one day. But maybe I’ll rewrite it in another language first, lets see!

Comment by Milian Wolff (2009-02-24 23:00:00)

rewrite has started and is usable imo, take a look at http://milianw.de/code-snippets/take-2-download-script-for-springerlinkc…

Comment by Faro (not verified) (2009-02-20 17:10:00)

Thank you for your fix. It now works… but i’ve discovered some problems with books spread on more than one page: like this one:

http://springerlink.com/content/t64382/?p=a10f1da5c8604081a487cfce67924074

do you have any idea for this to solve?

By the way if included pdftk echo $( ls |sort -n) cat output ../"$title".pdf to merge the PDF files into one file. The ordering is done by sort -n

Comment by Milian Wolff (2009-02-23 16:07:00)

Great idea, I’ll add that since that will make “Go to page” work once again.

Comment by Faro (not verified) (2009-02-19 23:03:00)

Hi…

did springer change something or does my script do something wrong?

i always get:

Please wait, link source is being downloaded…ok - done Validating link source…ok - done unknown link type ”

Thank you for your help

Comment by Milian Wolff (2009-02-20 01:43:00)

Thanks for the hint, I fixed it. You can find an updated version above.

Comment by Andreas (not verified) (2008-12-15 18:20:00)

Hi, thank you very much for your helpful script! However I did found a small bug: once a book contains several chapters with the same name, only the first chapter is downloaded and the others are being omitted. This is an example: http://springerlink.com/content/lp46u2/?p=4315112f571546c79595e6d1dd7552…

I tried to add numbers into the “chapter”-line, but this only gave me the numbers in the filenames. The script insisted on downloading only the first chapter.

Cheers (and thanks again), Andreas

Comment by Milian Wolff (2009-01-09 16:53:00)

Ok, the Script was updated and should handle chapter numbers now correctly. Also it handles single book-chapter downloading well.

Comment by Andreas (not verified) (2009-02-27 17:34:00)

Thanks! With the new version I’ll be able to get the rest of the interesting books on math&physics.

Cheers, Andreas

Comment by Milian Wolff (2008-12-20 14:49:00)

I’ll look into it and update the script. Thanks for the report!

Published on December 20, 2008.

Tags: