Thursday, October 2, 2014

From Confusion to Expertise - an experimental post on Medium

Trying something new this time. I have started to post on Medium. Read the post "From Confusion to Expertise" there.

Some brief impressions

  • The interface is indeed as clean and well designed as I have heard, allowing writers to knock out simple yet professional looking posts.

  • One of the selling points of Medium, where you can easily submit to "collections" is gone. In the past you could submit your stories to any collection without prior invites. You could also submit one story to more than one collection. It was one of the selling points of Medium, where you only followed collections instead of individual posters. 

Some collections on libraries on Medium 

I have 262 followers on Medium. I think they found me mostly via Twitter

  • The interesting feature to add comments at the paragraph level is something I wonder if will encourage more comments?

  • Always wondered how much of my long rambling posts are read. Medium has stats on this.

As you can see, without any extra marketing via my own Twitter or Facebook networks, Medium posts don't seem to draw a lot of views after 1 day.


All in all, while Medium is interesting, I wonder if it really does anything unique enough to be worth abandoning existing platforms like Blogger or even Tumblr.

This holds for whether one is thinking of doing it for institutional libraries or academics thinking of using blogging to spread their work. I haven't gone out of my way to look for academics doing so but I notice Adeline Koh of Associate Professor of Postcolonial Literature, Director of DH@Stockton at Richard Stockton College doing so and creating a interesting collection on "Chinese Privilege in Singapore" (Singapore is 70% ethnic Chinese).

A small plug for Internet Librarian International 14

I don't do this often (or at all!) but would like to mention that the Internet Librarian International 2014 will be on at the end of October.

I was at the 2012 event (read about my experiences here) and I can sincerely say it remains my favourite conference so far. While I have gone to bigger conferences eg ALA Annual, ILI 2014 seems to my type of conference.

As I blogged back then "I suspect if you like most of the things I blog about here, the conference would be a natural fit for you." I have gone to a few more conferences since then, and my thoughts on this haven't changed.

Every year since then, I have received brochures for ILI, and the talks and speakers on display are always excellent with a good blend of speakers from England, Australia, US and Scandinavian countries etc. This year is no different with star speakers such as Jan Holmquist (on Gamification) attending.

It's a pity I can't go to ILI more often, but if you do have the funds, do consider ILI 2014.

Disclosure: I am listed as on the advisory board on ILI 2014, though I am embarrassed to admit, I haven't done much advising. 

Wednesday, August 20, 2014

How academic libraries may change when Open Access becomes the norm

Like many academic library bloggers, I occasionally fancy myself as a "trend spotter" and am prone to attempts at predicting the future.

The trend I am increasingly convinced that is going to have a great impact on how academic libraries will function is the rise of Open Access.  As Open Access takes hold and eventually becomes the norm in the next 10-15 years, it will disrupt many aspects of academic library operations and libraries will need to rethink the value-add they need to provide to universities.

The events of the past year have convinced me that the momentum for open access is nearly unstoppable and the tipping point for open access has or will occur soon.

To be fair, this is a pretty easy call to make, Richard Poynder an independent journalist who has covered open access for over a decade and is as close as an independent observer on such matters as you can get (he claims not to be an open access advocate, though I find his views quite librarian friendly) says that open access is inevitable, the only question is how it will occur. 

I find myself identifying with him, as unlike some librarians, I don't consider myself a really big open access advocate. The fact that I believe that open access will take hold, neither fills me with sheer joy nor unhappiness.

That said I know enough to talk about it to most ordinary researchers in a general way after reading blog posts, articles and books on the topic.  I freely admit squabbles between open access advocates on the exact definition of open access, on the best way to provide/reach it etc often threaten to confuse me.

What I think I do have is some knowledge about some aspects of academic libraries. Some (but not all) open access advocates claim that the goals of the open access movement is more about access then affordability and isn't really about solving the serials crisis (that might or might not occur depending on the route taken) or even about libraries or librarians. As I identify myself as a librarian I love to think what it means for academic libraries when open access becomes the norm.

This post is going to assume that sometime during my professional career in the next 10-25 years, 50%-80% or more of the annual output of new papers will be open access in some form. Whether this will be mostly via through Gold OA or Green OA I do not know. Some models I have read predict also additional disruptions to the scholarly communication system eg. post peer review models may also occur that are not strictly necessary for open access. 

I am not going to argue why I think open access is inevitable, though I think policy changes by governments is the most obvious reason, but feel free to leave comments if you disagree.

What I want to explore in this blog post is its impact on academic libraries. 

1. Libraries roles in traditional discovery and even fulfilment/delivery for users will diminish 

We've known for a long while  that almost no student begins their research from the library homepage and this is likely to occur even for researchers of the future as the younger phd students are showing a preference for non-library "web scale" tools like Google Scholar.

The same report showing that no-one began their search from the library home did show that in the end 56% of users did use library materials via cross referencing of information sources (e.g Library Links in Google Scholar), so in the end the library did play a part in their research though more as a fulfillment role and less in a discovery role.

This has prompted some to argue academic libraries of the future to "think the unthinkable" and focus on delivery of full-text and books and give up on the discovery roles. This is a view that is far from been the majority view with dissenters saying that such a move is defeatist and object that it is risky to rely on for profit entities like Google on such a important role or that libraries can provide personally tuned discovery layers that can serve their communities better than search tools operating at the network level like Google Scholar or Mendeley.

But the rise of open access has the potential to disrupt even the delivery or fulfilment role. In a open access world when most articles or perhaps even books (open access models for books exist, as well "as all you can eat" subscription services like Scribd, Oyster, Amazon Prime) can be gotten for free, academic libraries' role in both discovery and fulfillment will be greatly diminished.

What proportion of articles are free online now? I've seen estimates that vary from 24% free articles (all years) in Google Scholar and Microsoft Academic Search  to as high as 48% for papers published in 2011.

Assuming the higher estimates of the newer articles are true (though I doubt so), we may already be at or near the tipping point of 50% for the annual output of articles each new year.

As it is, we already know from the  Ithaka S+R US Faculty Survey 2012 that when faculty can't get access from our library collection, they will search for free access online (80%). This option for searching for free copies is even more popular than ILL or document delivery. This strategy is going to become increasingly more effective as open access becomes the norm.

That explains why tools like Lazy Scholar a Chrome extension that automatically scans every web page you are on to identify articles mentioned and provides a link to the pdf if a free version available in Google Scholar seems to be so popular.

You can expect tools like Lazy Scholar to become increasingly effective as the tide for Open Access turns.

Conversely as argued in the day library discovery died -2035, web scale discovery services by libraries are likely to become even more irrelevant.

Lorcan Dempsey has been writing for years now about how researchers prefer gateways at the "network level" as opposed to the institutional level, but institutional discovery services have always had the advantage of showing all the journal articles you have immediate access to and nothing else and this can be helpful.

But in a world where the vast majority of journal articles are open access, we don't need institutional discovery services to make such distinctions.

Unless academic libraries can provide distinct reasons for why their search services are better than what the likes of Google Scholar, Mendeley web search etc can offer, eg personally tuned discover layers , I can't see why we will need such institutional level discovery layers.

Collection development, electronic resource management is also going to be very different.

At extreme levels of open access say 75%, one wonders if there will be much of a team in the library working on traditional librarian duties of subscriptions and electronic resource management (parts relating to managing link resolvers, knowledgebase management etc).

Services relating to document delivery may diminish in importance as well.

2. Libraries might make a greater focus on Special Collections and move into publishing/hosting journals

So does this mean the technical services portion of academic libraries will be less important?

Not necessarily.

Most obviously if the green route to open access takes off, perhaps along the "The Immediate-Deposit/Optional-Access" , more and more resources will be channeled towards the management of institution repositories.

Beyond simply serving as a repository, some libraries are experimenting also with "layered journals", such as what University College London is doing. Essentially this involves libraries moving into the publishing business by converting institutional repositories to become publishing platforms. For example, UCL Press is now a department within the institution’s Library Services.  Using the open source , Open Journal System (OJS) and the institutional repository as a storage system, the library is publishing open access journals. There are also many open access journals published via Digital Commons.

Whether academic libraries have the skills, knowledge and incentive to play such a role and retake the scholarly communication system is a big question.

Beyond hosting open access journals, academic libraries will also probably put greater focus on their special collections.

As argued by Lorcan Dempsey, libraries will have to focus their energies on items with high uniqueness (in few collections), in other words special collections. In the future, the prestige of a academic libraries lies in not how many journal articles or books it can provide to its community, but how much unique content that is made available by the library to the world.

Under such a model , academic libraries would perhaps resemble museums, carefully curating and preserving rare artifacts.

Similarly, in Can't Buy Us Love, Rick Anderson proposes that academic libraries should shift from what he calls "commodity documents" (common things you can purchase on the market place eg published journal articles, published books) towards "non-commodity documents" (rare unique material, grey literature etc).

He proposes we "devote a greater percentage of budget and staff time than we hitherto have to
the management and dissemination of those rare and unique documents that each of us owns, that no one but the holder can make available to the world, that have the potential greatly to enrich the world of scholarship, and that can be made available outside of the commercial marketplace without damage to any participant in the scholarly communication system."

There are certain subtleties in the proposal, I suspect I miss but I would argue that in a world where journal articles are available for free and are already efficiently discover-able by Google etc, we would be forced to follow Rick's proposal and focus on special collection which will involve what Lorcan Demsley calls again the "Inside out" challenge. This would involve, digitization/OCR, text transcription, creating metadata and making it discover-able of our special collections.

3. Libraries will have greater focus on value add expertise services such as information literacy, data management services, GIS etc to replace the diminishing "buyer" role

The  Ithaka S+R US Faculty Survey 2012 , shows that of all the roles academic libraries play, it is the role of a buyer that is by far the most important. Interestingly, 2012 is the first year since 2003, where there is a fall in this area though it is still by far most important.

This fall could be insignificant, or it could perhaps point to the fact that increasingly more content is available free online between 2009 and 2012.

What is not in doubt is that if open access rises to become the norm, the role of the buyer by the library will definitely diminish. 

Somewhat discouragingly the other non-collection based roles such as facilitating teaching activities and research activities between 2009 and 2012 fell. But the survey notes that this could be due to a smaller proportion of humanities faculty doing the survey, so might not be a trend.

I am going to state the obvious but perhaps unpleasant truth. If faculty view the buyer role has paramount, open access is going to make it tricky to demonstrate the value of the library as it will diminish the value that faculty want from us (at least for now).

It is hence critical for the survival of academic libraries in the coming years to provide value to faculty that goes beyond purely buying material.

Librarians should double-down on providing expert assistance to faculty across the research cycle, whether it be data research management services, GIS services, Bibliometrics or assisting in teaching activities aka information literacy.

Open Access, also creates roles for librarians as guides in the new Scholarly communication landscape, helping clarify open access issues and terms to faculty who will need to adjust to the new publishing options. The greater disruption to the landscape, the more librarians will be needed to guide and advice on say changes in the evaluation of research impact (post peer review, altmetrics etc). Some will be given shiny new titles like "open access librarian" but most academic librarians who do outreach work will need to do the work as well. But will such roles only be short term due to the novelty of issues?

Of course, some academic librarians reading this, will protest and say that their institution is already doing most of this as opposed to purely collection centric roles and indeed this varies from library to library. I worry though the perception of academic libraries as buyers is going to be hard to shake.

4. Budgets of libraries might shrink

This is quite speculative, but how will library budgets be affected by open access? Looking at the ARL Library Investment Index, we see roughly 30%-50% of ARL library expenditure is on materials (majority will be on journals). How much of this will still be under the control of the library when open access reigns?

If savings do accrue from a revamped open access system, how much of this savings will be channeled to the academic library or will it simply disappear from the budget?

Of course there is no certainty that in the open access world, much savings will accrue. Some open access advocates such as Stevan Harnad fear that a overly and premature focus on the Gold route to open access without what he calls a "leveraged transition" (achieving close to 100% self-archiving first hence forcing published versions of pdfs to compete with author post-print pdfs leading to reduced costs for APCs), might simply mean a transition to an open access environment under which publishers recapture their former profits under subscription journals but only this time via APC (article processing charges).

Some models of Gold open access, also simply push the bill to funders and governments, and depending on the type of model, academic libraries may or may not be involved in managing funds for APCs.

I am not a specialist enough to weigh in on these matters, though Harnad's view seems to make sense to me.

In a sense, a smaller total library budget due to losing the need for a materials expenditure budget doesn't quite matter as long as other things remain constant, but would there be a reduction in the prestige of academic libraries?

More worryingly on a very pessimistic view,  if academic libraries are not prepared for the transition and do not make a strong enough case for the value of their operations to replace the role of a buyer, staff cutbacks might occur.

5. Modernising Referencing practices

This is more an intriguing proposal rather than a prediction from Academic citation practices need to be modernized - References should lead to full texts wherever possible 

The article makes a now fairly standard observation that legacy referencing practices are broken because they do not take into account the shift towards a digital online environment (why shouldn't we simply link to a doi for example) as well as changes in the Scholarly communication system.

There's a lot of fascinating ideas in there but I find the most interesting idea relates to open access.

"With open access spreading now we can all do better, far better, if we follow one dominant principle. Referencing should connect readers as far as possible to open access sources, and scholars should in all cases and in every possible way treat the open access versions of texts as the primary source."

He suggests that if a find published version of an article exists under paywall and a preprint or postprint exists online, referencing should link to the freely available version.

Here's the order he suggests for referencing of articles available in

  1. Open Access Journal 
  2. Hybrid Journal 
  3. University Institution repository
  4. Other "widely used" open access site - He mentions Researchgate or Subject repositories like SSRN would fit here too. 

Only if none of this was available should one reference the paywall version as a primary source.


Assuming open access is inevitable, I feel it is only  a slight exaggeration that the upcoming disruption to academic libraries will potentially be bigger than the shift from print to digital for librarians. For good or ill, in the last 20-30 years or so providing access to journal articles behind paywalls was the major purpose of academic libraries as seen by faculty and students and open access will change that.

In a way, I suppose none of the consequences in this blog post is particularly earthshaking assuming open access occurs, but is there sufficient reason to believe that open access is inevitable? I know many librarians who disagree and think it's not so simple.

Even if it does occur, how fast will the transition occur? Will it be gradual allowing academic libraries to slowly transition operations and competencies or will be it a dramatic shift catching us off-guard?

What would be some signals are signs that open access is gaining ground and it might be time to scale back on traditional activities? Downloads per FTE for subscribed journals start to trend downloads? Decreasing library homepage hits? At what percentage of annual output that is open access, do you start scaling back?

Much of this blog post about open access, benefits and are drawn from the State of Open Access interviews by Richard Poynder. 

Sunday, July 27, 2014

Size of Google Scholar vs other indexes, personally tuned discovery layers & other discovery news

Regular readers of my blog know that I am interested in discovery, and the role academic libraries should play in promoting discovery for our patrons.

If you feel the same, here are a mix of links I came across recently on the topic that might be of interest

The Number of papers in Google Scholar is estimated to be about 100 million

When talking about discovery one can't avoid discussion of Google Scholar. My last blog post on 8 surprising things I learnt about Google Scholar, raced to the top 20 all time read blog posts in just 3 weeks showing intense interest in this subject.

As such, the Number of Scholarly Documents on the Public Web is a fascinating paper that attempts to estimate the number of Scholarly documents on the public web using the capture/recapture method and in particular it gives you a figure for the number of papers in Google Scholar.

This is quite a achievement, since Google refuses to give this information.

It look me a while to wrap my head around the idea, but essentially it

  • It defines number of Scholarly documents on the web as the sum of the papers in Google Scholar (GS) and Microsoft Academic Search (MAS)
  • It takes the stated number of papers in  MAS to be a bit below 50 million.
  • It calculates the amount of overlap in papers found in both GS and MAS. This overlap needs to be calculated via sampling of course.
  • The overlap is calculated using papers that cite 150 selected papers. 
  • Using the Lincoln–Petersen method, the overlap of papers found and the given value of about 50 million papers in MAS , one can estimate the number of papers in Google Scholar and hence the total sum of papers on the public web. (You may have to take some time to understand this last step, it took me a while for sure)
There are other technicalities such as the paper estimates only English Language papers, being careful to sample papers with less than 1,000 cites (because GS allows only 1,000 results to be shown at most) .

For more see also How many academic documents are visible and freely available on the Web? which summarises the paper, and assesses the strengths and weaknesses of the methodology employed in the paper.

The major results are 

  1. Google Scholar has estimated 99.3 million English Language papers and in total there are about 114 million papers on the web (where web is defined as Google Scholar + MAS)
  2.  Roughly 24% of papers are free online
The figures here are figured to be a lower bound, but it is still interesting as it provides a estimate on the size of Google Scholar. Is 99.3 million a lot?

Here are some comparable systems and the sizes of indexes I am aware of as of July 2014. Scopes might be slightly different but will focus mostly on comparing scholarly or peer reviewed articles which are the bulk of most indexes anyway. I did not adjust for including English Language articles only though many of them do allow filtering for that. 
  • Pubmed - 20-30 million - the go to source for medical and life sciences area.
  • Scopus - 53 million  - mostly articles/conference proceedings but now include some book and book chapters. This is one of the biggest traditional library A&I databases, it's main competitor Web of Science is roughly at the same level but with more historical data , fewer titles indexed.
  • Base - 62 million -drawn from open access institutional repositories. Mostly but not 100% open access items and may include non-article times
  • CrossRef metadata Search - 67 million - Indexed dois - may include book or book chapters. 
So far these are around the level of Microsoft Academic Search at about 50 million.

Are there indexes that are comparable to Google Scholar's roughly 100 million? Basically the library webscale discovery services are the only ones at that level

  • Summon - 108 million - Scholarly material facet on + "Add beyond library collection" + authenticated = including restricted A&I records from Scopus, Web of Science and more. (Your instance of Summon might have more or less depending on A&I subscribed and size of catalogue, Institutional repositories). 
  • Worldcat - 2.1 billion holdings of which 148 million are peer reviewed, 203 million articles [as of Nov 2013]
I am unable to get at figures for the other 2 major library webscale discovery services - Ebsco Discovery Service and Primo Central, but I figure they should be roughly at the same level.

108 millions Scholarly material in Summon - may vary for your Summon Instance

  • Mendeley - 181 million ? This is an interesting case, Mendeley used to list the number of papers in their search but have removed it. The last figure I could get at is 181 million (from wayback machine), which fits with some of the statements made online but looks a bit on the high side to me. 

The figures I've given above with the exception of Mendeley I would think tends to be pretty accurate (subject to the issues of deduping etc) at least compared to the estimates given in the paper.

I think the fact that web scale discovery services are producing results in the same scale >100 million suggests that Google Scholar figure estimated is in the right ballpark. 

Still my subjective experience is that it seems that Google Scholar tends to have substantially more than our library web scale discovery service, so I suspect the 99.3 million obtained for Google Scholar is an underestimate. 

I wonder if one could use the same methodology as in The Number of Scholarly Documents on the Public Web to estimate the size of Google Scholar but using Summon or one of the other indexes mentioned above to measure overlap instead of Microsoft Academic Search.

There are some advantages

For example, there is some concern that the size of Microsoft Academic Search assumed in the paper to be 48.7 is not accurate but the figures given for say Summon are likely to be more accurate (again issues with deduping aside).

It would also be interesting to see how Google Scholar fares when compared to a index that is about twice as large as MAS.

Would using a web scale library discovery service to estimate the size of Google Scholar give a similar figure of about 100 million? 

Arguably not since we are talking about a different populations ie. MAS + GS vs Summon + GS though both can be seen as a rough estimate of the size of scholarly material available in the world that can be discovered online. (Also are the results you can find in Summon be considered the "public web" if you need to authenicate before searching to see a subset of results from A&I databases like Scopus?)

The main issue though I think to trying to use Summon or anything similar in place of MAS is a technical one.

The methodology measures overlap in a way that has been described as "novel and brilliant", instead of running the same query on the 2 searches and looking for overlaps, they do it this way instead.

"If we collect the set of papers citing p from both Google Scholar and MAS, then the overlap between these two is an estimate of the overlap between the two search engines." 

Unfortunately none of the web scale discovery services have a cited by feature (they do draw on and display Scopus and Web of Science cited counts but that's a different matter)

One can fall back on older methodologies and measuring overlap by running the same query on GS and Summon, but this has drawbacks described as "bias and dependence" issues. 

Boolean versus ranked retrieval - clarified thoughts

My last blog post Why Nested Boolean search statements may not work as well as they did was pretty popular but what I didn't realise that I was implicitly saying that relevance ranking of documents retrieved using Boolean operators did not generally work well.

This was pointed out by Jonas 

I tweeted back asking why we couldn't have good ranked retrieval on documents retrieved using Boolean operators and he replied that he thinks it's based two different mindsets and one should either "trust relevance or created limited sets."

On the opposite end, Dave Pattern of Huddersfield reminded me that Summon's relevancy ranking was based on Open Source Lucene software with some amount of tweaking. You can find some details  but essentially it is designed to combine Boolean with Vector Space models etc aka it is designed or can do Boolean + ranked retrieval.

After reading though some documentation and the excellent Boolean versus ranked querying for biomedical systematic reviews, I realized my thinking on this topic was somewhat unclear.

As a librarian, I have always assumed it makes too much sense to (1) Pull out possibly relevant articles using Boolean Operators (2) Rank them using various techniques from classic tf-idf factors to other more modern techniques like link popularity etc.

I knew of course, there were 2 paradigms, that the classic Boolean set retrieval assumed every result was "relevant" and did not bother with ranking beyond sorting by date etc. But it still seemed odd to me not to try to at least to add ranking. What's the harm right?

The flip side was, what is ranked retrieval by itself? If one entered SINGAPORE HISTORICAL BUILDINGS ARCHITECTURE, it would still be ranking all documents that had all 4 terms right?(maybe with stemming) or wasn't it really still Boolean with ranking?

The key I was missing which now seemed obvious is that for ranked retrieval paradigms not every search term in the query has to be matched.

I know those knowledgeable in information retrieval reading this might think this be obvious and I am dense for not realizing this. I guess I did know this except I am a librarian, I am so trapped into Boolean thinking that I assume implicit AND is the rule.

In fact, we like to talk about how Google and some web searches do "Soft AND", and kick up a fuss when they might sometimes drop off one or more search terms. But in ranked retrieval that's what uou do, you throw in a "bag of words" (could be a whole paragraph of words), the ranking algorithms tries to do the best it can but the documents it fulls up may not have all the words in the query.

Boolean versus ranked querying for biomedical systematic reviews is particularly interesting paper, showing how different search algorithms ranging from straight out Boolean to ranked retrieval techniques that involve throwing in Title,abstracts as well as hybrid techniques that involve combining Boolean with Ranked retrieval techniques fare in term of retrieving clinical studies for systematic reviews.

It's a amazing paper, with different metrics and good explaintion of systematic reviews if you are unfamiliar. Particularly interesting they compare Boolean Lucene results which I think give you a hint on how Summon might fair.

The best algorithm for ranking might surprise you.... 

Read the full paper to understand the table! 

Large search index like Google Scholar, discovery service flatten knowledge but is that a good thing?

Like many librarians, I have an obsession on the size of databases, but is that really important?

Over at Library Babel Fish, Barbara Fister on the Library isn't flat, worries that academic libraries' discovery services are "are (once again) putting too high a value on volume of information and too little on curation".

 She ends with the following questions

"Is there some other way that libraries could enable discovery that is less flat, that helps make the communities of inquiry and the connections between ideas easier to follow? Is there a way to help people who want to join those conversations see the patterns and discern which ideas were groundbreaking and significant and which are simply filling in the details? Or is curation and connection too labor-intensive and inefficient for the globalized marketplace of ideas?"

Which makes the next section interesting....

Library Top Trends - Personally tuned discovery layers 

Ken Varnum at the recently concluded LITA Top Technology Trends Sessions certainly thinks that what is missing in current Library discovery services is the ability for librarians to provide personally tuned discovery layers for local use.

He would certainly think that there is value in librarians, slicing the collections into customized streams of knowledge to suit local conditions. You can jump to his section on this trend here. Also Roger Schonfeld's
section on Anticipatory discovery for current awareness of new publications is interesting as well.

To Barbara Fister's question on whether curation is too labour intensive or inefficient, Ken would probably answer no, and he suggests that in the future librarians can customize collections based on subject as well as appropriateness of use (e.g Undergraduate vs a Scholar).

It sounds like a great idea, since Summon and Ebscohost discovery layers currently provide hardcoded discipline sets and I can imagine eventually been able to create subject sets based on collections at the database and/or at the journal title levels (shades of the old federated search days or librarians creating google custom search engines eg one covering NGO Sites or Jurn (open access in humanities)).

At the even more granular level, I suppose one could also pull from reading lists etc.

Unlike Ken though I am not 100% convinced though it would just take "a little bit of work" to make this worth while or at least better than the hardcoded discipline sets. 

NISO Publishes Recommended Practice on Promoting Transparency in Library Discovery Services

NISO RP-19-2014, Open Discovery Initiative: Promoting Transparency in Discovery [PDF] was just published.

Somewhat related is the older NFAIS Recommended practices on Discovery Services [PDF]

I've gone through it as well as EBSCO supports recommendations of ODI press release and I am still digesting the implications, but clearly there is some disagreement about handling of A&I resources (not that shocking).

Discovery Tools, a Bibliography

Highly recommend resource - this is a bibliography by Fran├žois Renaville. Very comprehensive covering papers from 2010 onwards.

It is a duplicate of the Mendeley Group "Libraries & [Web-Scale] Discovery Tools.

Ebsco Discovery Layer related news

Ebsco has launched a blog "Discovery Pulse" with many interesting posts. Some tidbits

Note : I am just highlighting Ebsco items in this post because of their new blog as the blog may be of interest to readers. I would be happy to highlight Primo, Summon, WorldCat discovery service items when and if I become aware of them. 

Summon Integrates Flow research management tool.

It was announced that in July, Summon will integrate with Proquest Flow, their new cloud based reference management tool.

The word Login is extremely misleading in my opinion. 

I have very little information about this and how overt the integration will be. But given that Mendeley was acquired by Elsevier, Papers by Springer, it's no wonder that Proquest wants to get into the game as well.

It's all about trying to get into the researcher's workflow and unfortunately as increasingly "discovery happens elsewhere", so it would be smart to focus on reference management an area where currently the likes of Google seem to be ignoring (though moves like Scholar Library where one can add citations found in Google Scholar to your own personal library may say otherwise).

Mendeley for certain has shown that reference management is a very powerful place to start to get a digital foothold.

While it's still early days, currently Flow seems to have pretty much the standard features one sees in most modern reference managers eg. Free up to 2GB storage, support of Citation Style Language (CSL), capabilities for collaboration etc. I don't see any distinguishing features or unique angles yet.

Here's a comparison in terms of storage space for the major competitors such as Mendeley.

The webinar I attended on it (sorry don't have link to recording) suggests Proquest has big plans for Flow, beyond a reference manager. It will aim to support the whole research cycle, and I think this includes support as a staging ground for publication (submission to PQDT??), as well as support of prepub works (posting to Institutional or Subject repositories?).

It will be interesting to see if Proquest will try to leverage it's other assets such as Summon to support Flow. Eg. Would Proquest tie recommender services drawn from Summon usage into it?

Currently you can turn off Flow from Summon without much ill effects and it seems some libraries have done so because it may take time to evaluate and prepare staff to support this, but it remains to see if in the long run , if Flow might just have too many features and value to be turned off.

BTW If you want to keep up with articles, blog posts, videos etc on web scale discovery, do consider subscribing to my custom magazine curated by me on Flipboard (currently over 1,200 readers) or looking at the bibliography on web scale discovery services)

Monday, July 14, 2014

Why Nested Boolean search statements may not work as well as they did

At library school, I was taught the concept of nested boolean. In particular, I was taught a particular search strategy which goes like this.

  • Think of a research topic
  • Break them up into major concepts - typically 3 or more - eg A, B, C
  • Identify synonyms for each concept (A1,A2, A3 ; B1, B2, B3 ; C1, C2, C3
  • Combine them in the following manner

(A1 OR A2 OR A3) AND (B1 OR B2 OR B3) AND (C1 OR C2 OR C3)

We like many libraries have created videos on it as well.

If you are a academic librarian who has even taught a bit of information literacy, I am sure this is something you show in classes. You probably jazzed it up by including wildcards (such as teen*) as well.

Databases also encourage this search pattern

I am not sure how old this technique is, but around 2000ish? databases also started to encourage this type of structured search.

Above we see Ebscohost platform and in my institution this "Advanced search" is set to default. You can see a similar UI (whether as default or advanced search) in JSTOR, Engineering Village, Proquest platforms etc.

A lecturer when I was in library school even claimed credit (perhaps jokingly) for encouraging databases into this type of interface.

Recently I noticed a slight variant on this theme where the default search would show only one search box (because "users like the Google one box" according to a webinar I attended), but if you clicked on "add field" or similar you would see a similar interface. Below shows Scopus.

After clicking Add search field, you get the familiar structured/guide search pattern

You see a similar idea in the latest refresh of Web of Science, a default single search box but with a option to expand it to a structured search pattern. Below we see Web of Science with "Add another field" selected twice.

Lastly even Summon 2.0 which generally has a philosophy of keeping things simple got into the act and from what I understand under pressure from librarians finally came up with a advanced search that brought tears of joy to power users. 

But are such search patterns really necessary or useful?

In the first few years of my librarianship career, I taught such searches in classes without thinking much of it. 

It feels so logical, so elegant, it had to be a good thing right? Then I began studying and working on web scale discovery services, and the first doubts began to appear. I also started noticing when I did my own research I rarely even did such structured searches.

I also admit to be influenced by Dave Pattern's tweets and blog posts, but I doubt I will ever be as strongly in the anti-boolean camp.

But I am going to throw caution to the wind and try to be controversial here and say that I believe increasingly such a search pattern of stringing together synonyms of concepts generally does not improve the search results and can even hurt them

There is of course value in doing this exercise of thinking through the concepts and figuring out the correct language used by Scholars in your discipline, but most of the time doing so does not improve the search results much especially if you are simply putting common variants of words eg different variants of say PREVENT or ECONOMIC which is what I see many searches do.

That's because many of the search systems we commonly use increasingly are no longer well adapted to such searches even though they used to be in the past

Our search tools in the past

Think back to the days of the dawn of the library databases. They were characterized by the following

  1. Metadata (including subject terms) + abstract only - did not including full text
  2. Precise searching - what you enter is what you get search
  3. low levels of aggregation - A "large database" would maybe have 1 million items if you were lucky
In such conditions, most searches you ran had very few results. If you were unlucky you would have zero results. 


Firstly the search matched only over metadata + abstract and not full text. So if you searched for "Youth" and it just happened that the abstract and title the author decided on using "Teenager", you were sunk.

Also this was compounded by the fact that in those days, searches were also very precise. There was no autostemming that automatically covered variants of words (including British vs American spelling), so you had to be careful to include all the variants such as plurals, and other related forms. 

Lastly, It is hard to imagine in the days of Google Scholar with estimated 100 million documents (and Web Scale discovery systems that could potentially match that) but in those days databases were much smaller and fragmented with much smaller indexes and as such the most common result would be zero hits or at best a few dozen hits.

Summon full index (Scholarly filter on) showing about 100 million results

This is why the (A1 OR A2 OR A3) AND (B1 OR B2 OR B3) AND (C1 OR C2 OR C3) nested boolean technique was critical to ensure you expanded the extremely precise search to increase recall.

Add the fact that search systems like Dialog were charged per search or on time - so it was extremely important to craft the near-perfect search statement in one go to do efficient searching.

I will also pause to note that relevancy ranking of results could be available but when you have so few results that you could reasonably look through say 100 or less, you would just scan all the results, so whether it was ranked by relevancy was moot really.

Today's search environment has changed

Fast forward to today.

Full-text databases are more common. In fact, to many of our users and younger librarians, "databases" would imply full-text databases and they would look in dismay when they realized they were using a abstract and indexing database and wonder why in the world people would use something that might not give them instant gratification of a full text item. I fully understood some old school librarians would consider this definition to be totally backwards but......

Also the fact you are searching full-text rather than just metadata changes a lot. If an article was about TEENAGERS, there is pretty good odds you could find TEENAGER and probably, YOUTH, ADOLESCENCE etc in the full text of the book or article as well, so you probably did not need to add such synonyms to pick them up in the result set anyway.

Moreover as I mentioned before , increasingly databases under the influence of Google are starting to be more "helpful", by autostemming by default and maybe even adding related synonyms, so there was no real need to add variants for color vs colour say or for plural forms anyway.

Even if you did a basic

A AND B AND C -  you would have a reasonable recall, thanks to autostemming, full text matching etc.

All this meant you get a lot of results now even with a basic search.

Effect of full-text searching + relative size of index + related words

Don't believe this change in search tools makes a difference? Let's try the ebscohost discovery service for a complicated boolean search because unlike Summon it makes it easy to isolate the effect of each factor.


Let's try this search for finding studies for a systematic review

depression treatment placebo (Antidepressant OR "Monoamine Oxidase Inhibitors"  OR "Selective Serotonin Reuptake Inhibitors" OR "Tricyclic Drugs") ("general  practice" OR "primary care") (randomized OR randomised OR random OR trial)

Option 1 : Apply related words + Searched full text of articles - 51k results

Option 2 : Searched full text of articles ONLY -  50K results

Option 3 : Apply related words ONLY - 606 results

Option 4 : Both off - 594 results 

The effect of apply related keywords is slight in this search example possibly because of the search terms used, but we can see full text matches make a huge difference.

Option 4 would be what you get for "old school databases". In fact, you would get less than 594 results in most databases, because Ebsco Discovery service has a huge index far larger than any such databases.

To check, I did an equivalent search in one of the largest traditional abstracting and indexing database Scopus and I found 163 results (better than you would expect based on the relative sizes of Scopus vs EDS).

But 163 is still manageable if you wanted to scan all results, so relevancy ranking can be poor and it doesn't matter as much really.

Web scale discovery services might give poor results with such searches 

I know many librarians will be saying, doing nested Boolean actually improves their search, and even if it doesn't what's the harm?

First, I am not convinced that people who say nested boolean improves the results of their search have actually done systematic objective comparisons or whether it is based on impression that I did something more complicated so the results must be better. I could be wrong.

But we do know that many librarians and experienced users are saying the more they try to carry out complicated boolean searches the worse the results seem to be in discovery services such as Summon.

Matt Borg of Sheffield Hallam University wrote of his experience implementing Summon.

He found that his colleagues reported "their searches were producing odd and unexpected results."

"My colleagues and I had been using hyper stylised searches, throwing in all the boolean that we could muster. Once I began to move away from the expert approach and treated Summon as I thought our first year undergrads might use it, and spent more time refining my results, then the experience was much more meaningful." - Shoshin

I am going to bet that those "hyper stylised searches" were the nested boolean statements.

Notice that Summon like Google Scholar actually fits all 3 characteristics of a modern search I mentioned above that are least suited for such searches
  • Full text search
  • High levels of aggregation (typical libraries implementing Summon at mid-size universities would have easily 300 million entries)
  • autosteming was on by default - quotes give a boost to results with exact matches.
All this combine to make complicated nested Boolean searches worse I believe.

Poor choices of synonyms and overliberal use of wildcards can make things worse

I will be first to say the proper use of keywords is the key to getting good results. So a list of drugs names combined by an OR function, or a listing of philosophers, concepts etc - association of concepts would possible give good results.

The problem here is that most novice searchers don't have an idea what are the keywords to list in the language of the field, so often because they are told to list keywords they may overstretch and add ones that make things worse.

Say you did

(A1 OR A2 OR A3) AND (B1 OR B2 OR B3) AND (C1 OR C2 OR C3)

Perhaps you added A3, B3, C3 though they aren't exactly what you are looking for but "just in case".

Or perhaps you decided it wouldn't hurt to be more liberal in the use of wildcards which led to matches of words you didn't intend. 

Or perhaps the keyword A3, B3, C3 might be used in a context that is less appropriate that you did not expect. Remember unlike typical databases, Summon is not discipline specific, so a keyword like "migration" could be used in different disciplines. 

The fact that web scale discovery searched through so much content, there would be a high chance of getting A3 AND B3 AND C3 entries that were not really that relevant when used in combination.

Even if all the terms you chose were appropriate, the fact that they could be matched in full text could throw off the result.

If A2 AND B2 AND C2 all appeared in the full text in an "incidental" way, they would be a match as well. Hence creating even more noise.

And when you think about it, the problems I mention will get even worse. as each of the keywords would be autostemmed (which may lead to results you don't expect depending on how aggressive autostemming is) exploding the results.

My own personal experience with Summon 2.0 is that often the culprit is the match in full-text. Poorly chosen "synonyms" could often surface and even be pushed up.

The "explosion" issues is worsen by full text matches in books

In Is Summon alone good enough for systematic reviews? Some thoughts.  , I was studying to see if Summon could be used for systematic reviews. A very important paper, pointed out that Google Scholar was a poor tool for doing systematic reviews, because of the lack of precision features like lack of wildcards, limited character length, inability to nest boolean more than 1 level etc, and I had speculated Summon lacking these issues would be a better tool.

Somewhat surprising to me was when I tried actually to do so.

Sometimes, when I did the exact same search statement in both Google Scholar and Summon, number of Summon results usually exploded, showing more results than Google Scholar!

Please note that when I say "exact same search statement" I mean that precisely.

So for example, one of the searches done in Google Scholar to look for studies was

depression treatment placebo (Antidepressant OR "Monoamine Oxidase Inhibitors" 
OR "Selective Serotonin Reuptake Inhibitors" OR "Tricyclic Drugs") ("general 
practice" OR "primary care") (randomized OR randomised OR random OR trial)

Google Scholar found 17k results, while Summon (with add results beyond library collection to get the full index) shows 35K. 

Why does Summon have more than double the number of results?  

This was extremely unexpected because we generally suspect Google Scholar has a larger index and Google Scholar is more liberal in interpreting search terms as they may substitute terms with synonyms, while Summon at best includes variant forms of keywords (plurals, british/amercian spelling etc

But If you look at the content types of the results of the 35k results you get a clue.

A full 22k of the 35k results (62%) are books! If you remove those than the number of results make more sense. 

Essentially books which can be indexed in full text have a high chance of been discovered since they contain many possible matches and this gets worse the more ORs you pile on. Beyond a certain point they might overwhelm your results.

It is of course possible some of the 22k books matched can be very relevant, but it is likely a high percentage of them would be glancing hits and if you are unlucky, other factors might push them up high. 

I did not even attempt to use wildcards to "improve" the results, even though they could work in Summon. When I did that the number of results exploded even more.

As an aside the Hathitrust people have a interesting series of posts on Practical Relevance Ranking for 11 Million Books, basically showing you can't rank books the same way you rank other materials due to the much longer length of the book.

The key to note is that you are no longer getting 50, 100 or even 200 results like in old traditional databases. You are getting thousands. So you can no longer look through all the results, you are totally at the mercy of the relevancy ranking...

The relevancy ranking is supposed to solve all this... and rank appropriately, but does it? Do you expect it to?

A extremely high recall but low precision (over all results), with a poor relevancy ranking makes a broken search. Do you expect the relevancy ranking to handle such result sets resulting from long strings of OR?

With so few users actually doing Boolean in web scale discovery (e.g this library found  0.07% of searches uses OR), should you expect discovery vendors to actually tune for such searches? 

Final thoughts

I am not going to say these types of searches are always useless in all situations, just that often they don't help particularly in cases like Google, Google Scholar, web scale discovery.

Precise searching using Boolean operators has it place in the right database. Such databases would include Pubmed - which is abstract only, allows power field searching, including a very precise MESH system to exploit. The fact that medical searches particularly systematic reviews require comprehensiveness and control is another factor consider.

I also think if you want to do such searches, you should think really hard on just adding one more OR or liberal use of wildcards "just in case". With web scale discovery services searching full-text, and autostemming, a very poor choice will lead to explosion of searches with combinations of keywords found that may not be what you expect.

A strategic use of keywords is the key here, though often for the novice searcher who doesn't know the area, he is as likely to come up with a keyword that might hurt as it might help initially. As such it is extremely important to stress the iterative nature of such searches, so as you figure out more of the subject headings etc you use them in your search.

Too often I find librarians like to give the impression they found the perfect search statement by magic on their first try, which intimidates users. 

I would also highly recommend doing field searches, or metadata only search options if available, if you try such searches and get weird results.

Systems like Ebsco discovery service give you the option to restrict searches to metadata only or not search in full text.

For Summon, if you expect a certain keyword to throw off the search a lot due to full-text matches, doing title/subject term/abstract etc only matches might overcome this.

Try for example


So what do you think? Do you agree that increasingly you find doing a basic search is enough? Or am I understating the value of a nested boolean search? Are there studies showing they increase recall or precision.

Wednesday, June 11, 2014

8 surprising things I learnt about Google Scholar

Google Scholar is increasingly becoming a subject that an academic librarian cannot afford to be ignorant about.

Various surveys have shown usage of Google Scholar is rising among researchers, particularly beginning and intermediate level researchers.  Our own internal statistics such as link resolver statistics and views of Libguides on Google Scholar, tell a similar story. 

Of course, researchers including librarians have taken note of this and there is intense interest in details about Google Scholar.

I noticed for example in April....

More recently there was also the release of a Google Scholar Digest  that is well worth reading.

Sadly Google Scholar is something I've argued that libraries don't have any competitive advantage in, because we are not paying customers, so Google does not owe us any answers, so learning about it is mostly trial and error.

Recently, I've been fortunate to be able to encounter and study Google Scholar from different angles at work including

a) Work on discovery services - lead me to study the differences and similarities of Google Scholar and Summon (also on systematic reviews). Also helping test and setting up the link resolver for Google Scholar.

b) Work on bibliometrics team  - lead me to study the strengths and weakness of Google Scholar  and related services such as Google Citations and Google Scholar Metrics vs Web of Science/Scopus as a citation tool.

c) Most recently, I've was studying a little how entries in our Institutional repositories were indexed and displayed in Google Scholar.

I would like to set out 8 points/features on Google Scholar that surprised me when I learnt about them, I hope they are things you find surprising or interesting as well.

1. Google does not include full text of articles but Google Scholar does

I always held the idea without considering it too deeply was that Google had everything or mostly everything in Google Scholar but not viceversa.

In the broad sense this is correct, search any journal article by title and chances are you will see the same first entry going to the article on the publisher site in both Google and Google Scholar.

This also reflects the way we present Google Scholar to students. We imply that Google Scholar is a scope limited version of Google, and we say if you want to use a Google service, at least use Google Scholar, which does not include non-scholarly entries like Wikipedia, blog entries unlike in Google.

All this is correct, except the main difference between Google Scholar and Google, is while both allow you to find articles if you search by title, only Google Scholar includes full-text in the index.

Why this happens is that the bots from Google Scholar are given the rights by publishers like Elsevier, Sage to index the full-text of paywalled articles on publisher owned platform domains, while Google bots can only get at whatever is public, basically title and abstracts. (I am unsure if the bots harvesting for Google and Google Scholar are actually different, but the final result is the same).

I suppose some of you will be thinking this is obvious, but it wasn't to me. Until librarians started to discuss a partnership Elsevier announced with Google in Nov 2013.

Initially I was confused, didn't Google already index scholarly articles? But reading it carefully, you see it talked about full-text

The FAQ states this explicitly.

A sidenote, this partnership apparently works such that if you opt-in,  your users using Google within your internal ip range will be able to do full-text article matches (within your subscription). We didn't turn it on here, so this is just speculation.

2.  Google Scholar has a very comprehensive index , great recall but poor precision for systematic reviews.

I am not a medical librarian, but I have had some interest in systematic reviews because part of my portfolio includes Public Policy which is starting to employ systematic reviews. Add my interest in Discovery services meant I do have a small amount of interest in how discovery systems are similar and different to Google.

In particular "Google Scholar as replacement for systematic literature searches: good relative recall and precision are not enough" was a very enlightening paper that drilled deep into the capabilities of Google Scholar.

Without going into great detail (you can also read Is Summon alone good enough for systematic reviews? Some thoughts), the paper points out that while the index of Google Scholar is generally good enough to include almost all the papers eventually found for systematic reviews (verified by searching for known titles) , the lack of precision searching capabilities means one could never even find the papers in the first place when actually doing a systematic review.

This is further worsened by the fact that Google Scholar, like Google actually only shows a maximum of 1,000 results anyway so even if you were willing to spend hours on the task it would be a futile effort if the number of results shown are above 1,000.

Why lack of precision searching? See next point.

3. Google Scholar has 256 character limit, lacking truncation and nesting of search subexpressions for more than 1 level.

Again Google Scholar as replacement for systematic literature searches: good relative recall and precision are not enough gets credit from me for the most detailed listing of strengths and weaknesses of Google Scholar.

Some librarians seem to sell Google and Google Scholar short.  Years ago, I heard of librarians who in an effort to discourage Google use, tell students Google doesn't do Boolean OR for example, which of course isn't the case.

Google and Google Scholar does "implied AND" and of course you could always add "OR", As far as I can tell the undocumented Around function  doesn't work for Google Scholar though.

The main issue with Google Scholar that makes precision searching hard is

a) Lack of truncation
b) Unable to turn off autostemming - (Verbatim mode available only in Google, not sure if + operator works for Google Scholar, but it is depreciated for Google)

These are well known.

But I think lesser known is that there is a character limit for search queries in Google Scholar of 256. Apparently if you go beyond, it will silently drop the extra terms without warning you. Typical searches of course won't go beyond 256 characters, but ultra precise systematic review queries might of course.

Another thing that is is interesting to me is the paper I believe states that nested boolean operators beyond one level will fail.

4. Google Scholar harvests at the article level, hence it is difficult for them to give coverage lists.

Think many people know that Google Scholar's index is constructed very differently from databases in that it crawls page by page, pulling in content it considers Scholarly at the article level.

This meant that multiple versions of the same article could be pulled into Google Scholar and combined, so for example it could grab copies from

  • the main publisher site (eg Sage)
  • an aggregator site or citation only site
  • a Institutional repository 
  • even semi-legal copies on author homepage, Researchgate, etc
All these versions are auto-combined.

I knew this, but only fairly recently it dawned on me this is the reason for why Google Scholar does not have a coverage list of publication with coverage dates.

Essentially they pull items at the article level, so there is no easy way to succinctly summarise their coverage at the journal title.

Eg. Say there is a journal publisher that for whatever reason bars them from indexing, they could still have some articles with varying coverage and gaps by harvesting individual articles from institutional repositories that may have some of the articles from the journal.

Even if they had the rights to harvest all the content from say Wiley, the harvester might still miss out a few articles because of poor link structure etc.

So they would in theory have coverage that could be extremely spotty, with say an article or 2 in a certain issue, a full run for some years etc.

As a sidenote, I can't help but compare this to Summon's stance that they index at the journal title level rather than database level, except Google Scholar indexes at a even lower level at the article level. 

Of course in theory Google Scholar could list the publisher domains that were indexed?

That said, I suspect based on some interviews by Anurag Acharya when asked this question, fundamentally Google doesn't even think the coverage data is useful to most searchers. I believe he notes, that even though databases with large indexes have sources listed, it still provides little guidance on what to use and most guides recommend just searching all of them anyway.

Other semi-interesting things include
  • Google Scholar tries to group different "manifestations" and all cites are to the this group
  • Google Scholar uses automated parsers to try to figure out author and title, while may lead to issues of ghost authors, though this problem seems to be mostly resolved [paywall source]

5. You can't use Site:institutionalrepositoryurl in Google Scholar to estimate number of entries in your Institutional repository indexed in Google Scholar

Because of #4 , we knew it was unlikely everything in our institutional repository would be in Google Scholar.

I was naive and ignorant to think though one could estimate the amount indexed in our institutional repository in Google scholar by using the site operator.

I planned to do say Site: in Google Scholar and look at the number of results. That should work by returning all results from the site right?

Does Harvard's institutional repository only have 4,000+ results in Google Scholar?

Even leaving aside the weasel word "about", sadly it does not work as you might expected.  In the official help files it is stated this won't work and the best way to try to see if there is an issue is to randomly sample entries from your institutional repository.

Why? Invisible institutional repositories: Addressing the low indexing ratios of IRs in Google Scholar has the answer.

First, we already know when there are multiple versions, Google Scholar will select a primary document to link to. That is usually the one at the publisher site. The remaining ones that are not primary will be under the "All X versions".

According to Invisible institutional repositories: Addressing the low indexing ratios of IRs in Google Scholar., the site operator will only show up articles where the copy in your institutional repository is the primary document (the one that the title links to in Google Scholar, rather than those under "All X versions")

Perhaps one of the reasons I was mislead was I was reading studies like this calculating and comparing Google Scholar indexing ratios.

These studies, calculate a percentage based on number of results found using the site:operator as a percentage of total entries in the Institutional repository.

These studies are useful as a benchmark when studied across institutional repositories of course.

But I think assuming site:institututionalrepository shows only primary documents, this also means the more unique content your Institutional repository has (or for some reason the main publisher copy isn't indexed or recognised as the main entry), the higher your Google Scholar indexing ratio will be.

Some institutional repositories contains tons of metadata without full-text (obtained from Scopus etc), and these will lower the Google Scholar indexing ratio, because they will be typically under "all x variants" and will be invisible to Site:institutionalrepositoryurl

Other interesting points/implications

  • Visibility of your institutional repository will be low if all you have is post/preprints of "normal" articles where publisher sites are indexed. 
  • If the item is not free on the main publisher site and you have the full-text uploaded on your institutional repository Google scholar will show on the right a [Pdf ] from yourdomain

Seems to me this also implies most users from Google Scholar won't see your fancy Institutional repository features but will at best be sent to the full-text pdf directly, unless they bother to look under "All X versions"

  • If the item lacks an abstract Google Scholar can identify, it will tend to have a [Citation] tag. 

6. Google Scholar does not support OAI-PMH, and use Dublin Core tags (e.g., DC.title) as a last resort.

"Google Scholar supports Highwire Press tags (e.g., citation_title), Eprints tags (e.g., eprints.title), BE Press tags (e.g., bepress_citation_title), and PRISM tags (e.g., prism.title). Use Dublin Core tags (e.g., DC.title) as a last resort - they work poorly for journal papers because Dublin Core doesn't have unambiguous fields for journal title, volume, issue, and page numbers." - right from horse's mouth

Also this

Other interesting points

  •  "New papers are normally added several times a week; however, updates of papers that are already included usually take 6-9 months. Updates of papers on very large websites may take several years, because to update a site, we need to recrawl it"
  • To be indexed you need to have the full text OR (bibliometric data AND abstract)
  • Files cannot be more than 5 MB, so books, dissertations should be uploaded to Google Books.

7. Despite the meme going on that Google and especially Google Scholar (an even smaller almost microscopic team within Google) does not respond to queries, they actually do respond at least for certain types of queries.

We know the saying if you are not paying you are the product. Google and Google Scholar have a reputation for having poor customer service.

But here's the point I missed, when libraries put on their hats as institutional repository manager, their position with respect to Google is different and you can get responses. 

In particular, there is a person at Google Darcy Dapra - Partner Manager, Google Scholar at Google, who is tasked to do outreach for library institutional repositories and publishers.

She has given talks to librarians managing institutional repositories as well as publishers in relation to indexing issues in Google Scholar.

Her response in my admittedly limited experience when asking questions about institutional repository items presence in Google Scholar is amazingly fast. 

8. Google Scholar Metrics - you can find H-index scores for items not ranked in the top 100 or 20.

Google Scholar Metrics which ranks publications is kinda comparable to Journal Impact factor or other journal level metrics like SNIP,  SJReigenfactor  etc

First time I looked at it, I saw you could only pull out the top 100 ranked publications by languages (excluding English).

For English, at the main category and subcategories it will show the top 20 entries.

Top 20 ranked publications for Development Economics

I used to think that was all that was possible, if the publication was not in the top 20 of each English Category or Subcategory, you couldn't find a metric for the publication.

Apparently, not as you can search by journal titles.  Maybe obvious but I totally missed this.

As far as I can tell entries 7 & 8 above are not in the top 20 of any category or sub-category yet there is a H5 index for this.

How is this useful? Though we would frown on improper use of journal level metrics, I have often encountered users who want some metric, any metric for a certain journal (presumably they published in it) and they are not going to take "this isn't the way you use it for research assessment" anyway.

When you exhaust all the usual suspects (JCR, SNIP, Scimago rank etc), you might want to give this a try.

 Other interesting points

  • There doesn't seem to be a fixed period of updates (e.g Yearly)
  • Some Subject repositories like  arXiv are ranked though at more granular levels eg arXiv Materials Science (cond-mat.mtrl-sci)
  • Suggested reading


Given this is Google Scholar, where people figure things out by trial and error and/or things are always in flux, it's almost certain I got some things wrong.

Do let me know any errors you spot or if you have additional points that might be enlightening. 

Share this!

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Related Posts Plugin for WordPress, Blogger...