Using the Scopus api with xml output

| categories: xml, python, scopus | tags:

According to http://api.elsevier.com/documentation/AbstractRetrievalAPI.wadl , the native form of the Scopus abstract document is xml, and the full abstract cannot always be represented as json. So… I am going to just bite the bullet and learn to deal with the xml. This is a companion post to http://kitchingroup.cheme.cmu.edu/blog/2015/04/04/Making-highly-linked-bibliographies-from-the-Scopus-API/ . Most of the code in this post gets tangled out to scopus_xml.py. I know it is not totally robust yet, but I have been using it for a lot of analysis, and it works pretty well so far.

This is another long post, with code that probably runs off screen. You can see the end result of what we do in this post here: http://kitchingroup.cheme.cmu.edu/publications.html .

We start with a general function to return an xml elementtree. We build in some caching to avoid downloading things we already have; this is slow, and there are limits on how many times you can download.

import requests
import os
import xml.etree.ElementTree as ET

from my_scopus import MY_API_KEY

def get_abstract_info(EID, refresh=False):
    'Get and save the json data for EID.'
    base = 'scopus-xml/get_abstract_info'
    if not os.path.exists(base):
        os.makedirs(base)

    fname = '{0}/{1}'.format(base, EID)
    if os.path.exists(fname) and not refresh:
        with open(fname) as f:
            return ET.fromstring(f.read())

    # Otherwise retrieve and save results
    url = ("http://api.elsevier.com/content/abstract/eid/" + EID + '?view=META_ABS')
    resp = requests.get(url,
                    headers={'Accept':'application/xml',
                             'X-ELS-APIKey': MY_API_KEY})
    with open(fname, 'w') as f:
        f.write(resp.text.encode('utf-8'))

    results = ET.fromstring(resp.text.encode('utf-8'))

    return results

Next, we do some introspection to see what we have.

from scopus_xml import *
#results = get_abstract_info('2-s2.0-84896759135')
#results = get_abstract_info('2-s2.0-84924911828')
results = get_abstract_info('2-s2.0-84901638552')
for el in results:
    print el.tag
    for el1 in el:
        print '  -->',el1.tag
    print
{http://www.elsevier.com/xml/svapi/abstract/dtd}coredata --> {http://prismstandard.org/namespaces/basic/2.0/}url --> {http://purl.org/dc/elements/1.1/}identifier --> {http://www.elsevier.com/xml/svapi/abstract/dtd}eid --> {http://prismstandard.org/namespaces/basic/2.0/}doi --> {http://purl.org/dc/elements/1.1/}title --> {http://prismstandard.org/namespaces/basic/2.0/}aggregationType --> {http://www.elsevier.com/xml/svapi/abstract/dtd}srctype --> {http://www.elsevier.com/xml/svapi/abstract/dtd}citedby-count --> {http://prismstandard.org/namespaces/basic/2.0/}publicationName --> {http://purl.org/dc/elements/1.1/}publisher --> {http://www.elsevier.com/xml/svapi/abstract/dtd}source-id --> {http://prismstandard.org/namespaces/basic/2.0/}issn --> {http://prismstandard.org/namespaces/basic/2.0/}volume --> {http://prismstandard.org/namespaces/basic/2.0/}startingPage --> {http://prismstandard.org/namespaces/basic/2.0/}endingPage --> {http://prismstandard.org/namespaces/basic/2.0/}pageRange --> {http://prismstandard.org/namespaces/basic/2.0/}coverDate --> {http://purl.org/dc/elements/1.1/}creator --> {http://purl.org/dc/elements/1.1/}description --> {http://www.elsevier.com/xml/svapi/abstract/dtd}link --> {http://www.elsevier.com/xml/svapi/abstract/dtd}link --> {http://www.elsevier.com/xml/svapi/abstract/dtd}link {http://www.elsevier.com/xml/svapi/abstract/dtd}affiliation --> {http://www.elsevier.com/xml/svapi/abstract/dtd}affilname {http://www.elsevier.com/xml/svapi/abstract/dtd}authors --> {http://www.elsevier.com/xml/svapi/abstract/dtd}author --> {http://www.elsevier.com/xml/svapi/abstract/dtd}author

Now, some examples for myself to see how to get things.

from scopus_xml import *

results = get_abstract_info('2-s2.0-84901638552')

coredata = results.find('./{http://www.elsevier.com/xml/svapi/abstract/dtd}coredata')

print coredata.find('{http://www.elsevier.com/xml/svapi/abstract/dtd}srctype').text
print coredata.find('{http://www.elsevier.com/xml/svapi/abstract/dtd}source-id').text

#authors = results.find('./{http://www.elsevier.com/xml/svapi/abstract/dtd}authors')
#for author in results.find('./{http://www.elsevier.com/xml/svapi/abstract/dtd}authors'):
#    print author.find('{http://www.elsevier.com/xml/ani/common}indexed-name').text

for creator in coredata.find('{http://purl.org/dc/elements/1.1/}creator'):
    print creator.attrib

print coredata.find('{http://purl.org/dc/elements/1.1/}title').text
print coredata.find('{http://prismstandard.org/namespaces/basic/2.0/}publicationName').text
print coredata.find('{http://prismstandard.org/namespaces/basic/2.0/}volume').text
print coredata.find('{http://prismstandard.org/namespaces/basic/2.0/}pageRange').text
print coredata.find('{http://prismstandard.org/namespaces/basic/2.0/}coverDate').text
print coredata.find('{http://www.elsevier.com/xml/svapi/abstract/dtd}citedby-count').text
print coredata.find('{http://prismstandard.org/namespaces/basic/2.0/}doi').text

for link in coredata.findall('{http://www.elsevier.com/xml/svapi/abstract/dtd}link'):
    if link.attrib['rel'] == 'scopus':
        print link.attrib['href']
    else:
        print link.attrib['href']

# alternative xpath to get the link
print coredata.find("./{http://www.elsevier.com/xml/svapi/abstract/dtd}link/[@rel='scopus']").attrib['href']
j 22746 {'auid': '55569461200', 'seq': '1'} Relating the electronic structure and reactivity of the 3d transition metal monoxide surfaces Catalysis Communications 52 60-64 2014-07-05 2 10.1016/j.catcom.2013.10.028 http://api.elsevier.com/content/abstract/scopus_id/84901638552 http://www.scopus.com/inward/record.url?partnerID=HzOxMe3b&scp=84901638552&origin=inward http://api.elsevier.com/content/search/scopus?query=refeid%282-s2.0-84901638552%29 http://www.scopus.com/inward/record.url?partnerID=HzOxMe3b&scp=84901638552&origin=inward

That is basically it. In the next sections, we basically recreate the previous functions from scopus.py using the xml data.

1 Authors

def get_author_link(EID):
    results = get_abstract_info(EID)
    authors = results.find('./{http://www.elsevier.com/xml/svapi/abstract/dtd}authors')
    if authors is None:
        return 'No authors found'
    s = []

    for author in authors:
        name = author.find('{http://www.elsevier.com/xml/ani/common}indexed-name').text
        auid = author.attrib['auid']
        s += ['<a href="http://www.scopus.com/authid/detail.url?origin=AuthorProfile&authorId={0}">{1}</a>'.format(auid, name)]

    return ', '.join(s)
from scopus_xml import *
print get_author_link('2-s2.0-84896759135')
print get_author_link('2-s2.0-84901638552')
<a href="http://www.scopus.com/authid/detail.url?origin=AuthorProfile&authorId=8724572500">Thompson R.L.</a>, <a href="http://www.scopus.com/authid/detail.url?origin=AuthorProfile&authorId=22981503200">Shi W.</a>, <a href="http://www.scopus.com/authid/detail.url?origin=AuthorProfile&authorId=6506329719">Albenze E.</a>, <a href="http://www.scopus.com/authid/detail.url?origin=AuthorProfile&authorId=23004637900">Kusuma V.A.</a>, <a href="http://www.scopus.com/authid/detail.url?origin=AuthorProfile&authorId=55676869000">Hopkinson D.</a>, <a href="http://www.scopus.com/authid/detail.url?origin=AuthorProfile&authorId=7003584159">Damodaran K.</a>, <a href="http://www.scopus.com/authid/detail.url?origin=AuthorProfile&authorId=55005205100">Lee A.S.</a>, <a href="http://www.scopus.com/authid/detail.url?origin=AuthorProfile&authorId=7004212771">Kitchin J.R.</a>, <a href="http://www.scopus.com/authid/detail.url?origin=AuthorProfile&authorId=6701399651">Luebke D.R.</a>, <a href="http://www.scopus.com/authid/detail.url?origin=AuthorProfile&authorId=24081524800">Nulwala H.</a>
<a href="http://www.scopus.com/authid/detail.url?origin=AuthorProfile&authorId=55569461200">Xu Z.</a>, <a href="http://www.scopus.com/authid/detail.url?origin=AuthorProfile&authorId=7004212771">Kitchin J.R.</a>

2 Journal

def get_journal_link(EID):
    results = get_abstract_info(EID)
    coredata = results.find('./{http://www.elsevier.com/xml/svapi/abstract/dtd}coredata')

    journal = coredata.find('{http://prismstandard.org/namespaces/basic/2.0/}publicationName').text
    sid = coredata.find('{http://www.elsevier.com/xml/svapi/abstract/dtd}source-id').text
    s = '<a href="http://www.scopus.com/source/sourceInfo.url?sourceId={sid}">{journal}</a>'

    return s.format(sid=sid, journal=journal)
from scopus_xml import *
print get_journal_link('2-s2.0-84901638552')
<a href="http://www.scopus.com/source/sourceInfo.url?sourceId=22746">Catalysis Communications</a>

3 DOI link

def get_doi_link(EID):
    results = get_abstract_info(EID)
    coredata = results.find('./{http://www.elsevier.com/xml/svapi/abstract/dtd}coredata')
    doi = coredata.find('{http://prismstandard.org/namespaces/basic/2.0/}doi')
    if doi is not None: doi = doi.text
    s = '<a href="https://doi.org/{doi}">doi:{doi}</a>'
    return s.format(doi=doi)
from scopus_xml import *
print get_doi_link('2-s2.0-84901638552')
doi:10.1016/j.catcom.2013.10.028

4 Abstract link

def get_abstract_link(EID):
    results = get_abstract_info(EID)
    coredata = results.find('./{http://www.elsevier.com/xml/svapi/abstract/dtd}coredata')

    data = get_abstract_info(EID)

    title = coredata.find('{http://purl.org/dc/elements/1.1/}title').text.encode('utf-8')
    link = coredata.find("./{http://www.elsevier.com/xml/svapi/abstract/dtd}link/[@rel='scopus']").attrib['href'].encode('utf-8')
    s = '<a href="{link}">{title}</a>'
    return s.format(link=link, title=title)
from scopus_xml import *
print get_abstract_link('2-s2.0-84901638552')
Relating the electronic structure and reactivity of the 3d transition metal monoxide surfaces

5 Citation image

def get_cite_img_link(EID):
    results = get_abstract_info(EID)
    coredata = results.find('./{http://www.elsevier.com/xml/svapi/abstract/dtd}coredata')
    doi = coredata.find('{http://prismstandard.org/namespaces/basic/2.0/}doi')
    if doi is not None: doi = doi.text
    s = '<img src="http://api.elsevier.com/content/abstract/citation-count?doi={doi}&httpAccept=image/jpeg&apiKey={apikey}"></img>'

    return s.format(doi=doi, apikey=MY_API_KEY, cite_link=None)
from scopus_xml import *
print get_cite_img_link('2-s2.0-84901638552')

6 Getting it all together

def get_html_citation(EID):
    results = get_abstract_info(EID)
    coredata = results.find('./{http://www.elsevier.com/xml/svapi/abstract/dtd}coredata')
    s = '{authors}, <i>{title}</i>, {journal}, <b>{volume}{issue}</b>, {pages}, ({year}), {doi}, {cites}.'

    issue = ''
    if coredata.find('{http://prismstandard.org/namespaces/basic/2.0/}issueIdentifier') is not None:
        issue = '({})'.format(    coredata.find('{http://prismstandard.org/namespaces/basic/2.0/}issueIdentifier').text)

    volume = coredata.find('{http://prismstandard.org/namespaces/basic/2.0/}volume')
    if volume is not None:
        volume = coredata.find('{http://prismstandard.org/namespaces/basic/2.0/}volume').text
    else:
        volume = 'None'

    pages = ''
    if coredata.find('{http://prismstandard.org/namespaces/basic/2.0/}pageRange') is not None:
        pages = 'p. ' + coredata.find('{http://prismstandard.org/namespaces/basic/2.0/}pageRange').text
    elif coredata.find('{http://www.elsevier.com/xml/svapi/abstract/dtd}article-number') is not None:
        pages = coredata.find('{http://www.elsevier.com/xml/svapi/abstract/dtd}article-number').text
    else:
        pages = 'no pages found'


    year = coredata.find('{http://prismstandard.org/namespaces/basic/2.0/}coverDate').text

    return s.format(authors=get_author_link(EID),
                    title=get_abstract_link(EID),
                    journal=get_journal_link(EID),
                    volume=volume,
                    issue=issue,
                    pages=pages,
                    year=year,
                    doi=get_doi_link(EID),
                    cites=get_cite_img_link(EID))
from scopus_xml import *
print '<ol>'
print '<li>',get_html_citation('2-s2.0-84896759135'),'</li>'
print
print '<li>',get_html_citation('2-s2.0-84924911828'),'</li>'
print
print '<li>',get_html_citation('2-s2.0-84901638552'),'</li>'
print '</ol>'
  1. Thompson R.L., Shi W., Albenze E., Kusuma V.A., Hopkinson D., Damodaran K., Lee A.S., Kitchin J.R., Luebke D.R., Nulwala H., Probing the effect of electron donation on CO2 absorbing 1,2,3-triazolide ionic liquids, RSC Advances, 4(25), p. 12748-12755, (2014-03-17), doi:10.1039/c3ra47097k, .
  2. Xu Z., Kitchin J.R., Relationships between the surface electronic and chemical properties of doped 4d and 5d late transition metal dioxides, Journal of Chemical Physics, 142(10), 104703, (2015-03-14), doi:10.1063/1.4914093, .
  3. Xu Z., Kitchin J.R., Relating the electronic structure and reactivity of the 3d transition metal monoxide surfaces, Catalysis Communications, 52, p. 60-64, (2014-07-05), doi:10.1016/j.catcom.2013.10.028, .

7 Finally getting my documents

Here we get the EIDs from a search query. We use these in the next section to get a new bibliography.

import requests
import json
from my_scopus import MY_API_KEY
resp = requests.get("http://api.elsevier.com/content/search/scopus?query=AU-ID(7004212771)&field=eid,aggregationType&count=100",
                    headers={'Accept':'application/json',
                             'X-ELS-APIKey': MY_API_KEY})

results = resp.json()

return [[str(r['eid']), str(r['prism:aggregationType'])] for r in results['search-results']["entry"] if str(r['prism:aggregationType']) == 'Journal']
2-s2.0-84924911828 Journal
2-s2.0-84923164062 Journal
2-s2.0-84924778427 Journal
2-s2.0-84924130725 Journal
2-s2.0-84901638552 Journal
2-s2.0-84898934670 Journal
2-s2.0-84896759135 Journal
2-s2.0-84896380535 Journal
2-s2.0-84896585411 Journal
2-s2.0-84916613197 Journal
2-s2.0-84908637059 Journal
2-s2.0-84880986072 Journal
2-s2.0-84881394200 Journal
2-s2.0-84873706643 Journal
2-s2.0-84876703352 Journal
2-s2.0-84867809683 Journal
2-s2.0-84864914806 Journal
2-s2.0-84865730756 Journal
2-s2.0-84864592302 Journal
2-s2.0-84863684845 Journal
2-s2.0-84866142469 Journal
2-s2.0-84861127526 Journal
2-s2.0-80052944171 Journal
2-s2.0-80051809046 Journal
2-s2.0-79953651013 Journal
2-s2.0-79952860396 Journal
2-s2.0-77956568341 Journal
2-s2.0-77954747189 Journal
2-s2.0-77956693843 Journal
2-s2.0-77949916234 Journal
2-s2.0-77955464573 Journal
2-s2.0-72049114200 Journal
2-s2.0-73149124752 Journal
2-s2.0-73149109096 Journal
2-s2.0-67449106405 Journal
2-s2.0-63649114440 Journal
2-s2.0-60849113132 Journal
2-s2.0-58649114498 Journal
2-s2.0-40949100780 Journal
2-s2.0-33750804660 Journal
2-s2.0-20544467859 Journal
2-s2.0-15744396507 Journal
2-s2.0-9744261716 Journal
2-s2.0-13444307808 Journal
2-s2.0-3042820285 Journal
2-s2.0-2942640180 Journal
2-s2.0-0142023762 Journal
2-s2.0-0141924604 Journal
2-s2.0-0037368024 Journal
2-s2.0-0037197884 Journal

8 And my html bibliography

This generates my blog bibliography page..

from scopus_xml import *

import requests
import json
from my_scopus import MY_API_KEY
resp = requests.get("http://api.elsevier.com/content/search/scopus?query=AU-ID(7004212771)&field=eid,aggregationType&count=100",
                    headers={'Accept':'application/json',
                             'X-ELS-APIKey': MY_API_KEY})

results = resp.json()

data = [[str(r['eid']), str(r['prism:aggregationType'])] for r in
        results['search-results']["entry"] if str(r['prism:aggregationType']) == 'Journal']


with open('../publications.html.mako', 'w') as f:
    f.write('''<%inherit file="_templates/site.mako" />
<article class="page_box">
<%self:filter chain="markdown">

<h1>Online collections of our work</h1>
Pick your favorite:
<ul>
<li><a href="http://orcid.org/0000-0003-2625-9232">orcid:0000-0003-2625-9232</a></li>

<li><a href="http://www.researcherid.com/rid/A-2363-2010">researcherid:A-2363-2010</a></li>

<li><a href="http://www.scopus.com/authid/detail.url?origin=AuthorProfile&authorId=7004212771">scopusid:7004212771</a></li>

<li><a href="https://scholar.google.com/citations?user=jD_4h7sAAAAJ">Google Scholar</a></li>

<li><a href="https://www.researchgate.net/profile/John_Kitchin">Research Gate</a></li>

<li><a href="https://www.growkudos.com/profiles/40205">Kudos</a></li>
</ul>

<h1>Publications</h1>
The authors are linked to their Scopus page, the title linked to the Scopus abstract, the journal linked to the Scopus journal page, and the DOI is linked to https://doi.org which normally redirects you to the journal page.

<ol reversed="reversed">
''')

    for eid,type in data:
        f.write('<li>{}</li>'.format(get_html_citation(eid)))
    f.write('''</ol>

</%self:filter>
</article>
''')

9 Summary

The XML format is not that intuitive to me. It takes some practice writing robust code, e.g. sometimes the find command does not find anything, and then there is not text attribute to get, so you should check for success on finding things. Also, some text is unicode, and you have to take care to encode it, which my library does not do uniformly. Finally, not all journals have things like volume or issue. My formatting code is not super flexible, so these bibliography entries show None in them occasionally. Still, it is not too bad, and this enables a lot of analysis of your publications, as well as displaying them in different ways. See the result of this page here: http://kitchingroup.cheme.cmu.edu/publications.html

Copyright (C) 2015 by John Kitchin. See the License for information about copying.

org-mode source

Org-mode version = 8.2.10

Discuss on Twitter