Sunday, November 15, 2015

Libraries and Trello - How are librarians using it?

I am always trying out new tools to organize and improve my work and my current workflow involves the use of Google Keep , Dropbox and Google Now reminders (having moved away from Google tasks).

But I haven't had much experience with project management tools, but recently I began playing again with the light weight tool Trello.

Formally it is a electronic kanban tool. But if you are are not familiar with the concept you can see it as a digital version of post-it notice boards and it is very helpful for collaboration purposes.

The idea is pretty simple. You setup boards (corresponds to projects) which has lists. Lists (the vertical columns above) are broken up into individual cards. Each card typically corresponds to a task and you can assign people to each card and their photos will appear on the card.

You can further customize by adding colored labels to each card. Cards can have checklists, comments and can be populated via email. Lastly you can attach files via Dropbox, Google drive, One drive etc.

If you are into GTD (getting it done) methodology , a common idea is to have lists for "Doing", "Done", "To do" etc and dragging each card/task to each list as needed.

Like most productivity tools, Lifehack has a great guide on how to use it on a personal basis, from planning a vacation, setting up a family status board , planning a wedding, or pretty much anything you can think of.

In the past two years, Trello has become very popular in libraries. There are many reasons but the main one is that it's free with almost no limitations. There is no ads, no restrictions on number of boards you can create or members you can add etc.

Here for some of the ways they have been used for tracking workflows and/or project management

  • Package management & resource vendor negotiation
  • Electronic resource troubleshooting
  • Web site redesign project
  • Strategic planning, department planning
  • Marketing campaigns
  • Information literacy classes + faculty Liason work

For package management & resource negotiation 

I think given the nature of the tool, it's no surprise the technical service departments in libraries use it quite a bit.

Both NCSU and Duke University are examples of this and they recently held a webinar to talk about how they use it in technical services work.

I particularly like their package management board (See below). They color code cards based on publisher (e.g Sage, Elsevier) and you can then filter by cards to see for example cards to do with Wiley.

Another nice board is the board they setup for the license team for negotiating resources. There are as many as seven members on the team and the process for negotiating can get confusing.

There is clever use of checklists for negotiation that are all copied from a master template to help track the process.

For more see the article - Who's on First?: License Team Workflow Tracking With Trello

For Electronic Resources Troubleshooting

At Oakland University, Meghan Finch combines Trello with Zapier to organize tracking of requests involving electronic resources troubleshooting.

In the paper entitled "Using Zapier with Trello for Electronic Resources Troubleshooting Workflow", she explains that her board consists of the following lists

  • To Do
  • Tier II
  • Waiting
  • Completed
  • Get Done
  • How To
  • Honey Badger Tips

She has her own workflow setup on how to drag the cards from each list.

The issue here is how does she handle troubleshooting reports submitted by users? For sure she doesn't want to manually create cards for them in Trello.

She solved it by using a combination of Zapier and Trello's build-in feature to create cards based on emails.

In her library, her link resolver - 360link has a link to a simple online form for users to submit reports of problems with eresources.

The form once submitted sends an email to their e-resources mailing list.

She uses Zapier to automate the process of pulling out the data from the email sent to the e-resources mailing list and then sends another email to Trello to popular the Trello board which is all automated using Zapier.

If you are not familar with Zapier, it's similar to IFTTT , which allows you to automatic workflows between a large number of apps by creating trigger and actions that happen when the trigger occurs. (See my past posts on Zapier and IFTTT)

For website redesign projects

Amanda L. Goodman User Experience Librarian, Darien Library uses it for getting track of tasks for website design (among other things).

For Strategic planning or department planning

Megan Hartline writes "Transparency is one of the more challenging aspects of leadership. Letting people in your group and across your organization know what you’re doing, what your priorities are, and what projects are up next takes a huge amount of conscious communication." 

She then suggests Trello as a way to visualize and draw attention to the major projects each library department/unit or even the whole library is focusing on. 

The idea of a "At the glance" view of the major projects going on in a department or even over the whole library is pretty common in fact.

Below is NCSU's Serials Unit Projects board

Here's University of Minnesota Trello Board

Librarians doing IL and faculty engagement

This book briefly suggests that teaching librarians "create a course board for all course projects and assign students to different groups". It goes on to say that because all actions on each card is recorded, students can see the contributions by their fellow classmates.

More concretely Robert Heaton of Utah State University suggests that Trello can be used to keep track of work done by prior Subject Liaisons, so that the new librarian can benefit from a Trello board filled with information such as Faculty CVs, prior relationships and more.

For tracking of marketing campaigns

Champaign Public Library of Evelyn C. Shapiro writes

"I ran some experiments through the fall and winter and went full-on with this strategy for the spring and summer seasons. Now I have a visual "board" with every event and promotion, including a place to store—and serve up—all my content organized by event or promotion, with a separate "card" for each event. It's set up by season, with columns organized by month. Each card includes:

  • the marketing copy we're using in getting the word out
  • approved images (separate ones for website, lobby slide, e-news, Facebook, plus extras provided by presenters)
  • associated URLs (event bit.lys, related videos, subject-specific or book-specific "deep" links into our Polaris catalog, presenter websites)
  • collected notes from the presenter or the in-house staff sponsor of the event
  • any special acknowledgments that need to be included in promotions"


Regardless of the type of librarian job we do, we are consistently doing projects that potentially involve large number of collaborators. As such Trello seems to be a useful tool that can be used in many situations.

If you have been using Trello, how have you been using it? Is it easy to get buyin to use the tool?

Friday, October 23, 2015

6 common misconceptions when doing advanced Google Searching

As librarians we are often called upon to teach not just library databases but also Google and Google Scholar.

Unlike teaching other search tools, teaching Google is often tricky because unlike library databases where we can have insider access through our friendly product support representative as librarians we have no more or no less insight into Google which  is legendary for being secretive.

Still, given that Google has become synonymous with search we should be decently good at teaching it.

I've noticed though, often when people teach Google, particularly advanced searching of Google, they fall prey to 2 main types of errors.

The first type of error involved not keeping up to date and given the rapid speed that Google changes, we often end up teaching things that no longer work.

The second type of error is perhaps more common to us librarians. We often carry over the usual methods and assumptions from Library databases expecting them to work in Google when sadly they don't.

It is very difficult to detect both types of errors because Google seems to be designed to fail gracefully, for example it may simply silently ignore symbols you add that don't work.

Also the typical Google search brings back estimated count of results. e.g. "about" X million so it's hard to see if your search worked as expected.

As I write this blog post in Oct 2015, what follows is some of the common errors and misconceptions I've seen about searching in Google while doing research on the topic. Some of the misconceptions I knew about, a few surprised me. Of course by the time you read this post,  a lot is likely to be obsolete!

The 6 are

  • Using depreciated operators like  tilde (~) and plus (+) in search strings
  • Believing that all terms in the search string will definitely be included (in some form)
  • Using AND in search strings works
  • Using NOT in search strings works
  • Using asterisk (*) as a character wildcard or truncation  in search strings works
  • 6. Using parenthesis (  (    ) ) in search strings to control order of operators works

1. Using depreciated operators like  tilde (~) and plus (+) in search strings

As of writing these are the list of operators supported by Google, anything else is probably not supported, so if you are teaching people to use tilde (~) , or plus operator (+) please stop.

About tilde (~)

Karen Blakeman explains here what it used to do.

"Although Google automatically looks for variations on your terms, placing a tilde before a word seemed to look for more variations and related terms. It meant that you didn’t have to think of all the possible permutations of a word. It was also very useful if you wanted Google to run your search exactly as you had typed it in except for one or two words.

The Verbatim option tells Google to run your search without dropping terms or looking for synonyms, but sometimes you might want variations on just one of the words. That was easily fixed by placing a tilde before the word"

However as of June 2013 tilde (~) no longer works. (See official explanation).

About plus operator (+)

Another discontinued operator often still taught is the plus (+) Operator.

The plus operator used to force Google to match against the exact search term as you typed them. In other words,  "It turned off synonymization and spell-correction".  So for example if you searched +library , it would match library exactly and wouldn't substitute it for libraries or librarians for example.

However as of Oct 2011, it no longer works. (See official explanation)

According to Google help page, the plus operator is now used for Google+ pages or Blood types! (It generally can see the plus at the end eg C++ etc.)

If you wanted to force exact keywords you should add quotes around even single words. Eg. "library"

Of course we librarians know double quotes also have another purpose, they force words to be in an exact phrase say "library systems" . This works in Google as per normal.

Interesting enough in the latest Google Power Searching course (September 2015), Daniel Russell, mentions that you can do quotes within quotes to combine phrase searching with exact search around a single word.

For example he recommends searching "daniel "russell" " (note the nested quotes) because "daniel russell" alone gets him results with Daniel Russel (note only one 'L')

Another option if you want as near to as possible to what you typed in is to use the verbatim mode (which is kind of like + operator but for everything typed) 


As noted in the video above, even in that mode, the order of operations is not enforced, so you should use double quotes on top of verbatim mode for further control.

I believe even verbatim mode or using quotes around single words doesn't absolutely stop Google from occasionally "helping" by dropping search terms if including those search terms causes too many results to disappear - sometimes called  "Soft AND", more about that next.

2. Believing that all terms in the search string will definitely be included (in some form)

I've mentioned this before in the past, but Google practices what some call a "Soft AND", it will usually include all terms searched but occasionally one of the search terms will be dropped.

In the above Power Searching Video, Daniel explains that when you search for term1 term2 term3 you might find some pages with only term1 term2 but not term3. He states that some pages rank so highly on just term1 and term2 that Google will drop term3.

What's the solution? He recommends doing the intext operator. So for example term1 term2 intext:term3 , where the intext operator will force term3 to be on the page.

Note you can do phrase search together with intext as well, eg. intext:"library technology"

3.  Using AND in search strings

Believe it or not Google does not explicitly support the AND string in search.

For example neither the official google help or the official Google power searching course mention the AND operator!

Let me be clear, of course if you do something like library systems  , Google will do an implicit AND and combine the terms together (subject to the issue stated above).

But what I am saying is you shouldn't type something like library AND systems (whether AND, and, AnD, aNd etc) because at best it is ignored because it is too common (a stop word), though occasionally it may actually just search and match the word AND like a normal term!

To avoid such issues just drop the AND and do library systems

As an aside, OR works as per normal, and the power searching course states it's the only case sensitive operator.

4. Using NOT in search strings

Many of us Librarians are too used to literally typing NOT to exclude results. So for example we will automatically do libraries NOT systems ,not knowing this fails.

What you should do of course to exclude terms is to use the minus (-) operators. For example, try libraries -systems

5. Using asterisk (*) as a character wildcard or truncation in search strings

Another thing that doesn't work is that you can't find variant words of a search term by using * behind a string of letters.

For example the following doesn't work , organ* 

I believe Google automatically decides on stemming already so you don't need to do this to find words with the root of organ.

What works is something entirely different like this

a * saved is a * earned

The official guide says * is used as "a placeholder for any unknown or wildcard terms" , so you can match things like a penny saved is a penny earned where * can stand for 1 or more words.

But see tip 7 for interaction with site operator. 

6. Using parenthesis (  (    ) ) in search strings to control order of operators

This one is perhaps most shocking if you are unaware. When we combine AND with OR operators, a common question to ponder is, which operator has precedence?

My testing with various library databases shows that there is no one standard, some databases favour OR first others favour AND .

So it is a favourite trick of librarians to just cut through the complication and just use parenthesis to avoid having to memorise how it works in different databases.

So we love to do things like

(library AND technology) OR systems

First off we already said in #2 you shouldn't use AND in the search so let's try

(library technology) OR systems

But I am sorry to inform you that doesn't work too. In fact, the parenthesis is ignored , actually what Google sees is

library technology OR systems

Don't believe me? See here, here and here.

On Quora , a Google software engineer (search quality) says this

So what happens when you do something like library technology OR systems ?
In fact it's the equalvant of a library database search with library AND (technology OR systems)

It looks to me that OR has precedence which makes more sense to me than the other way around.

So what happens if you want (a b) OR (x y) ? Typing that out won't work in Google since it actually gives you a AND (b OR x) AND Y, but here's a complicated untested idea.

7. Bonus tips

Around operator

There is a semi-official operator known as the Around function. It allows you to match words that are within X words. This seems to be the same to a proximity operator without order.

So for example you can do

"library technology" AROUND(9) "social"

As noted by Dan Russell , AROUND needs to be in caps. For more details.

Combining asterisks with site operator

I guess everyone knows about the useful site: function . But did you know it works with wildcards as spotted here?

There's a lot more detail here that I recommend you read for interaction between wildcards and site operators. Combine it with the minus (-) operator for more fun!


As you can see while Google does generally support Boolean searching loosely (though it often does unexpected things like drop terms and may or may not include common words searched), the exact details are very different!

If you want to know more into the nuts and bolts of boolean operators in Google, I highly recommend

Thursday, October 15, 2015

Of full text , thick metadata , and discovery search

As my institution recently switched to Primo, nowadays I lurk in the Primo mailing list. I am amused to note that in many ways the conversation on it is very similar to what I experienced when lurking in the Summon mailing list. (One wonders if in time to come this difference might become moot but I digress).

Why don't the number of results make sense?

A common thread that occurs on such mailing lists from time to time and that often draws tons of responses is a game I call "Do the number of results make sense?".

Typically this would begin with some librarian or (technical person tasked to support librarians) bemoaning the fact that they (or their librarians) find that the number of results shown are not "logical".

For example someone would post a email with a subject like "Results doesn't make sense". The email would look like this (examples are made up).

a) Happy birthday    4,894
b) Happy birth*    3,623                                      
c) Happy holidays  20,591
d) Happy holid*    8,455
e) Happy OR birthday 4,323                                    

The email would then point out that it made no sense that number of results in b) and d) were lower than in a) and c) respectively. Or that e) Should have more results than a).

Other variants would include using quotes, or finding that after login (which usually produces more results due to results appearing from mutually licensed content providers) the number of results actually fell etc.

The "reason" often emerges that the web scale discovery service whether Summon Or Primo is doing something "clever" that isn't transparent to the user that results in a search that isn't strictly boolean logic.

In the past, I've seen cases such as

* Summon doing stemming by default but dropping it when boolean operators was used (might have changed now)
* Primo doing metadata search only by default but expanding to matching full text if the number of results dropping below a certain number.

I've discussed in the past How is Google different from traditional Library OPACs & databases?  and in this way web scale discovery services are somewhat similar to Google in that they don't do strict boolean and can do various adjustments to try to "help the user" at the cost of predictability and often transparency if the user wasn't given warning.

Matching full text or not?

In the most recent case I encountered in the Primo mailing list, it was announced there would be a enhancement to add a displayed message indicating that the search was expanded to match full text.

This lead to a discussion on why Primo couldn't simply match on full text all the time, or at least provide a option to do either like how EBSCO Discovery Services does.

MIT Libraries's Ebsco Discovery services searches in full text by default but you can turn it off.

An argument often made is that metadata match only, improves relevancy , in particular known item searching which makes up generally about 40-60% of searches.

For sure this makes relevancy ranking much easier since not bothering to consider matches in full text means the balancing act between ranking matches in full text vs metadata can be avoided.

In addition, unlike Google or Google Scholar, the discovery service index is extremely diverse including some content that is available in metadata only formats while others includes full text or are non text items (eg DVDs, videos).

Even if the items contain full text, they range from length in terms of a single page or paragraph to thousands of pages (for a book).

Not needing to consider this difference makes relevancy ranking much easier.

Metadata thick vs thin

Still a metadata match only approach ignores potentially useful information for full text and it's still not equally "fair", because content with "Thick metadata" still has a advantage over "Thin metadata".

I am not familiar with either term until Ebsco began to talk about it. See abstract below.

Of course "other discovery services" here refer mainly to Proquest's Summon (and Exlibris's Primo), which has roughly the same articles in the index but because they obtain the metadata directly from the publisher have limited metadata basically , article title, author, author supplied keywords etc.

While thick metadata would generally have controlled vocabulary, table of contents etc

The 4 types of content in a discovery index

So when we think about it, we can classify content in a discovery service index along 2 dimensions

a) Full text vs Non-full text
b) Thick metadata vs Thin metadata

Some examples of the type of content in the 4 quadrants

A) Thick Metadata, No Full text - eg. Abstracting & Indexing (A&I) databases like Scopus, Web of Science, APA Psycinfo etc, MARC records

B) Thick Metadata, Full text - eg. Ebsco databases in Ebsco Discovery Service, combined super-records in Summon that include metadata from A&I databases like Scopus and full text from publishers

C) Thin metadata, No Full text - eg Publisher provided metadata with no full text, Online video collections, Institutional repository records?

D) Thin metadata, Full text - eg Many publisher provided content to Summon/Primo etc.

What are the different ways the discovery service could do ranking?

Type I - Use metadata only - Primo approach (does expand to full text match if number of results falls below a threshold)

Type II - Use metadata and full text - Summon approach

Type III - Use full text mostly plus limited metadata - Google Scholar approach?

Type IV - User selects either Type I or II as an option - Ebsco Discovery Service approach

The Primo approach of mainly using metadata (and occasionally matching full text only if number of results are below a certain threshold) as I said privileges content that has thick metadata (Class A and B) over thin metadata (Class C and D) but is neutral with regards on whether full text is provided.

Still compare this with a approach like Summon that uses both metadata and full text. Here full text becomes important regardless of whether you have thin metadata or thick metadata it helps to have full text as well.

All things equal would a record that has thick metadata but no full text (Class A) rank higher than one that has thin metadata but has full text? (Class D).

It's hard to say depending on the algorithm used to weight full text vs metadata fields,I could see it going either way. Depends on the way things are weighted I can see it going either way.

My own past experience with Summon seem to show that there are times where full text matches seem to dominate metadata. For example searching for Singapore AND a topic, can sometimes yield me plenty of generic books on Singapore that barely mention the topic over more specific items. I always attributed it to the overwhelming match of the word "Singapore" in such items.

The fear that the mass of full text overrides metadata is the reason why some A&I providers are generally reluctant to be included their content in discovery services. This is worsened by the fact that currently there is no way to measure the additional benefit A&I's bring to the discovery experience, as their metadata once contributed will appear alongside other lower quality metatdata in the discovery service results.

If by chance the library has access to full-text via Open URL resolution, users will just be sent to the full text provider while the metadata contributed by the A&I database that contributed to the discovery of the item in the first place is not recognised and the A&I is bypassed. This is one of the points acknowledged in the Open Discovery Initative reports and may be addressed in the future.

In fact implementation of discovery services can indeed lead to a fall in usage of A&I databases in their native interfaces as most users no longer need to go directly to the native UI. Add the threat from Google Scholar, you can understand why A&I providers are so wary.

I would add that this fear that discovery services (except for Ebsco which already host content from A&Is like APA's PsychInfo) will not properly rank metadata from A&Is is not a theoretical one.

Ebsco in the famous exchange between Orbis Cascade alliance and Exlibris,  claims that

As you are likely aware, leading subject indexes such as PsycINFO, CAB Abstracts, Inspec, Proquest indexes, RILM Abstracts of Music Literature, and the overwhelming majority of others, do not provide their metadata for inclusion in Primo Central. Similarly, though we offer most of these databases via EBSCOhost, we do not have the rights to provide their metadata to Ex Libris. Our understanding is that these providers are concerned that the relevancy ranking algorithm in Primo Central does not take advantage of the value added elements of their products and thus would result in lower usage of their databases and a diminished user experience for researchers. They are also concerned that, if end users are led to believe that their database is available via Primo Central, they won't search the database directly and thus the database use will diminish even further.

Interestingly, Ebsco discovery service itself splits the difference between Primo and Summon and allows librarians to set the default of whether to include matching in full text or metadata only but allows users to override the default.

From my understanding default metadata only search in EDS libraries is pretty popular because many librarians feel metadata only searching provides more relevant results.

I find this curious because EBSCO is on record for stating that their relevancy ranking places the highest priority on their subject headings rather than title, as they are justly proud of the subject headings they have.

One could speculate EBSCO of all discovery services would weigh metadata more than full text, but librarians still feel relevancy can be improved by ignoring full text!

Content Neutrality?

With the merger of Proquest and Exlibris , we are now down to one "content neutral" discovery service.

One of the fears I've often heard is that librarians fear Ebscohost would "push up" their own content in their discovery service and to some extent people fear the same might occur in Summon (and now Exlibris) for Proquest items.

Personally, I am skeptical of this view (though I wouldn't be surprised if I am wrong either).  but I do note that for discovery vendors that are not content neutral, it's natural that their own content will have at the very least full text if not thick metadata while other content from other sources is likely to have poor quality metadata and possibly no full text unless efforts are taken to obtain them.

This itself would lead to their own content floating to the top even without any other evil doing.

To be frank, I don't see a way to "Equalize" everything , unless one ignores full text and also only ranks on a very limited set of thin metadata that every content has.

Ignoring metadata and going full text mostly?

Lastly while there are discovery services that rank based on metadata but ignore full text, it's possible but strange to think of a Type of search that is the exact opposite.

Basically such a system ranks only or mostly on full text and not on metadata (whether thick or thin)

The closest analogy I can think of for this is Google or Google Scholar.

All in all, Google Scholar I guess is a mix of mostly full text and thin metadata so this helps make relevancy ranking easier since we are ranking across similar types of content.

Somehow though Google Scholar still manages to do okay.... though as I mentioned before in
5 things Google Scholar does better than your library discovery service has a big advantage as

"Google Scholar serves one particular use case very well - the need to locate recent articles and to provide a comprehensive search." compared to the various roles library discovery services are expected to play including known item search of non-article material.


Honestly, the idea that libraries would want to throw well available data such as full text to achieve better relevancy ranking is a very odd one to me. 

That said we librarians also carefully curate the collections that are searchable in our discovery index rather than just adding everything available or free , so this idea of not using everything is not a new concept I guess.

Saturday, September 19, 2015

[Research question] What percentage of citations made by our researchers is to freely available content?

I recently signed up to a "research methods" class whose aim was to help practitioners like me produce high quality LIS papers. Inspired slightly by Open Science methods, I will blog my thoughts on the research question I am working on. Writing this helps me clarify my thoughts and of course I am hoping comments from you if my thoughts have piqued your interest. 

The initial motivation

The idea began as a work I was asked to do. Basically I was doing a citation analysis of citations made by our researchers to aid collection development. The idea here was to see if there was a good fit of our collection with what users are using and gauge potential demand for Document Delivery/Inter-Library.

It's a little old school type of study, but generally the procedure to run it goes as follows
  • Sample citations made by your researchers to other items
  • Record what was cited - typically you record age of item, item type cited, impact factor of journal title etc.
  • Check if the cited item is in your collection
Papers like  Hoffmann & Doucette (2012) "A review of citation analysis methodologies for collection management" gives you a taste of such studies if you are unfamiliar with such studies.

The impact of free

But what does "in your collection" mean? This of course would include things you physically hold and subscribed journal articles etc.

But it occurred to me that these days our users could also often obtain what they wanted by searching for free copies and as open access movement starting to take hold, this is becoming more and more effective.

In fact, I did it myself all the time when looking for papers, so I needed to take this into account.

In short, whatever couldn't be obtained through our collection and was not free would be arguably the potential demand for DDS/ILL.

(In theory there are other ways legal and illegal to gain access such as writing to the author, access via coauthors/secondary affiliations or for the trendy ones #canihazpdf requests on Twitter),

How do you define free?

As a librarian with some interest in open access, I am aware that much ink has being spilled over definitions.

There's green/gold/diamond/platinum/hybrid/libre/gratis/Delayed etc open access. But from the point of view of a researcher doing the research, I simply don't care. All I want to see if the full text of the article is available for viewing at the time I need it. It could be "delayed open access" (often argued to be a paradoxical term) but if it's accessible when I need it, it's as good as any.

What would an average researcher do to check for any free full text?

Based on various surveys  and anecdotes talking to faculty both in my current and former place of work, I know Google Scholar is very popular with users.

It also just happens that Google Scholar is a excellent tool for finding free full text, and we have a recent Nature survey showing that when there is no free full text, more users will search Google or Google Scholar for it and a smaller number will use DDS/ILL.

As such it's not a leap to expect the average researcher would probably use Google Scholar or Google to check for free full-text. 

So one would have to factor in the availability of the item for free and this could be obtained by simply checking Google Scholar add that to what is in our "collection" (defined to be physical copy and subscribed material) . 

Whatever remaining that was cited that couldn't explain by these two sources was the potential demand for DDS/ILL.

Preliminary results 

I'll talk about the sampling method later, but essentially, In my first exploratory data collection (with help of colleagues of mine) , I found that of the citations made, 79.7% were to items in our collection (either print or subscribed) and of the remaining cited another 13.4% were freely available by searching Google Scholar.

But the figures above presume a library centric view point and assume that users check our collection first and then turn to free sources only if unavailable there.  Is this a valid assumption?

As one faculty I discussed the results with said "correlation does not imply causation" and mentioned that just because they cited something that could be found in our collection didn't mean they used it. In fact, given the popularity of Google Scholar and the convenience of using it, it might be just as likely they accessed the free copy, especially if they were off campus and using Google.

Librarians who are familiar with open access will immediately say , wait a minute not all free full text are equal especially those that are self archived. Some are pre-prints, some post prints, some final published versions etc because of lack of page numbers etc.

There could in theory be very big differences between preprints and the final published versions and if you only had the post print version you should cite it differently from the final published version.

According to Morris & Thorn (2009) , in a survey, researchers claim that when they don't have access to the published version, 14.50% would rarely; and 52.70% never access the self archived versions. 

This implies researchers usually don't try to access self archived versions that aren't final published version.

Still, this is a self reported usage and one suspects in the service of convenience, many researchers would just be happy with any free version of the text and just cite as if they read the final published version....

For example in the  Ithaka S+R US Faculty Survey 2012 survey, over 80% say they will search for freely available version online, more than those using ILL/DDS. Are these 80% of faculty only looking for freely available final published version? Seems unlikely to me.

 Ithaka S+R US Faculty Survey 2012

Let's flip it around for sake of argument, how do things look like if we assume users access free items (whether preprint/postprint/final version) as a priority and only consult the library collection only when forced to?

As seen above for the same sample, a whopping 80.4% of cited items can be found for free in Google Scholar and this is further supplemented by another 12.7% from the collection.

As we will see later this figure is probably a big overestimate and I don't want to be hung up on it   still it can be very suggestive (if we can trust the figure), because it tells you that if our user did not have access to our library collection, he could still find and read the full text of 80% of items he wanted to! 

It then dawned on me that this figure is actually of great importance to academic libraries. Why?

Why amount of cited material that is free is a harbinger of  change for academic libraries

One of the areas I've been mulling over in the past year is the impact of open access on academic libraries. It was clear to me based on   Ithaka S+R US Faculty Surveys currently faculty  highly value the role of the library as a "wallet" and this was going to drastically change when (if?) open access becomes more and more dominant.

Still timing is everything, you don't want to run too far ahead of where your clients are at. So there is a need to tread carefully when shifting resources.

I wrote, "how fast will the transition occur? Will it be gradual allowing academic libraries to slowly transition operations and competencies or will be it a dramatic shift catching us off-guard?

What would be some signals are signs that open access is gaining ground and it might be time to scale back on traditional activities? Downloads per FTE for subscribed journals start to trend downloads? Decreasing library homepage hits? At what percentage of annual output that is open access, do you start scaling back?"

It came to me that the figure calculated above, the % of cited items that could be found free in Google Scholar , could serve as a benchmark for determining when the academic libraries' role as a purchaser would be in danger.

In the above example if indeed 80% of what they wanted to cite is free at the time of citing, the academic library role as a purchaser would be greatly reduced such that users only need you 2 out of 10 times! Is that really true?

Combining citation analysis with open access studies

My research idea question can be seen as a combination of two different strands of research in LIS.

First there is the classic citation analysis studies for collection development uses that was already mentioned.

Second there is a series of studies in the open access field that focused on determining the amount of open access available throughout the years. 

The latter area, has accumulated a pretty daunting set of literature trying to estimate the amount of open access material available. 

Of these studies, there's a subset that focus on typically sampling from either Scopus or Web of Science and checking if free full text is available in Google Scholar/Google/ Or some combo that bears closest resemblance to my proposed idea.

Free full text found Sample Searched in Coverage of articles checked Comment
Bjork &  et. al (2010) 20.4% Drawn from Scopus Google 2008 articles searched in Oct 2009
Gargouri & et. al (2012) 23.8% Drawn from Web of Science "software robot then trawled the web" 1998-2006 articles searched in 2009
2005-2010 articles searched in 2011
Archambault & et. al (2013) 44% (for 2011 articles) Drawn from Scopus Google and Google scholar 2004-2011 articles searched in April 2013 "Ground truth' of 500 hand checked sample of articles published in 2008, 48% was freely available as at Dec 2012
Martín-Martín & et. al (2014) 40% 64 queries in Google Scholar, collect 1,000 results Google Scholar 1950-2013 articles search in May 2014 & June 2014
Khabsa & Giles (2014) 24% randomly sampled 100 documents from MAS belonging to each field to check for free and multiple that by estimated size of each field determined by capture-release method Google Scholar All? searched in ??
Pitol & De Groote (2014) 58% Draw randomly from Web of science - for Institution C, draw 50 that are not in the IR already and check in Google scholar Google Scholar 2006-2011 Abstract reports 70% free full text, this is for institution A , B and C, where for A and B, the random sample drawn from Wos had to include copies already in IR as well.
Jamali &  Nabavi (2015) 61% Do 3 queries each in Google Scholar for each Scopus third level subcategory. Check the top 10 results for free full text Google Scholar 2004–2014 articles, searched in April 2014

I am still mulling about the exact details of each paper and the differences in methodology of each but the overall percentages are suggestive ranging from 20% to 61%, with the later studies showing generally a higher percentage.

Martín-Martín & et. al (2014) and Archambault & et. al (2013), in particular strike me as very rigorous studies and both show around 40%++ full text is available.

But we can see obviously that what we got with 80% is far above the upper bounds expected. Why?

Big issues with methodology

Here is where I mention the big problem I have. 

First the sample I drawn from were from citations made in papers published in 2000-2015. The field was Economics and the search for free items was done in September 2015.

The first obvious issue is that when I check if something is free in Google Scholar, I am only checking what is free now.

This is okay if all I care is to know what percentage is free now. 

But from my point of view, I want to know how much was free, at the time the researcher was citing it rather than many years later.

So for example take a paper A written in 2003 that cites a paper B written in 2000.

Today (September 2015 as I write this), I determine paper B is free and findable via Google Scholar. The obvious thing of course is while it is free now , it might not be free in 2003 when the author was doing his research!

Whether an article was free at the time the author was writing the paper depends on 

a) When the writer was writing up the paper
b) The age of the article he was citing at the time

The interaction of these two factors make it's very confusing as there is a host of factors affecting if something is free at a certain time. A short list includes a) Journal policies with embargos for self archiving, b) uptake of illegal options like researchgate, c) the general momentum towards open access of both green and gold at the time  etc.

Is there a solution?

Honestly I am not sure, I can think of many ideas to try to fix it but they may not work.

First off, I could just forgot about longitudinal studies and just focus on citations made from papers published in a short window period say within 6 mths of the searching done today in 2015 to reduce such timing effects but even this isn't perfect as we can quibble and say publication dates tell us little of when the writing was actual done, as publishing can have long lead times.

Another way is to carefully examine the source where the full text was found , and hope that the source would have metadata on when the full text was loaded.

For example, some institutional repositories or subject repositories  might have indications for when the full text was uploaded (e.g Number of downloads since.... DD/MM/YY)

Full text upload to Dspace as a Download since indicator

Based on studies like Jamali &  Nabavi (2015) and Martín-Martín & et. al (2014), we know that surprisingly one of the major sources for free full text is items uploaded to researchgate (ranked 1st for the former and 2nd for the later), so this could be a big sticking point.

That said, looking around researchgate, I noticed surprisingly, researchgate does list when something is uploaded.

Was this article uploaded to Researchgate on Jun 6, 2016?

Edit : @varnum suggested a great idea of checking using the internet archive's wayback machine . It works for some domains like the edu domains which helps a little when someone puts up a pdf on university web space.

pdf on duke domain was existing in 2011 according to wayback machine.

Another idea was I could do a blanket rule, if the paper is citing something less than 2 years old at the time and a free full text is found (and it was not published in a Gold Journal), we can assume it wasn't free then as many journals will allow published versions or post prints to be published only after 2 years of publication. 

This will undercount for various reasons of course, least of which is illegal copies.

A more nuanced approach would be take into account policies listed in Sherpa/Romeo. But policies by journal publishers change over time too. 


The more I lay out my thoughts the more I wonder if my idea is fatally flawed? It would be great to be able to get the figure - percentage of items cited that could be freely obtained at the time of citing, but it may just be such a figure is impossible to get at accurately enough even if one limited one's sample to a short window around the time of searching.

What do you think? 

Edit: This post is written from the librarian point of view of reacting to changes in research behavior, and is neutral between whether academic librarians should be revolutionaries or soldiers in the open access movement.

Sunday, August 23, 2015

Things i learnt at ALA Annual Conference 2015 - Or data is rising

I had the privilege to attend ALA annual conference 2015 in San Francisco this summer. This was my 2nd visit to this conference (see my post in 2011) and as usual I had lots of fun.

Presenting at  "Library Guides in an Era of Discovery Layers" Session

My ex-colleague and I were kindly invited to present on our work on a bento-style search we implemented for our LibGuides search.

For technical details please refer to our joint paper Implementing a Bento-Style Search in LibGuides v2 in July's issue of Code4lib.

See the Storify of event at

Data is rising 

Before I attended ALA 2015, I was of course aware that  research data management was increasingly an important service academic librarians are or should be supporting.

To be perfectly frank though, it was a hazy kind of "aware".

I knew that increasingly grant giving organization like NIH and other funders were requiring researchers to submit data sharing plans, so that was an area where academic librarians would provide support in particularly if open access takes hold since it would make obsolete many traditional tasks .

Also I knew there was all this talk about supporting Digital Humanities and GIS (geographic information system) services such that my former institution where I worked with began to start appointing Digital humanities and GIS librarians just before I left.

Perhaps closer to my wheel-house given my interest in library discovery, there was talk about Linked data and BIBFRAME which isn't research data management per se.

All these three areas relate to emerging areas that I knew or strongly suspected would be important but was unsure about the timing or even the nature (see later)

Add the "stewardship's duty of libraries" towards the "Evolving Scholarly Record" (what counts as scholarly record is now much expanded beyond just the final published article and libraries need to collect and preserve that), you can see why data is a word librarians are saying a lot more.

Still attending ALA annual 2015, made me wonder if finally a tipping point has been reached and I should start really looking at it deeper.

Is Linked data finally on the horizon?

While attending a session by Marshall Breeding "The future of Library Resource Discovery: Creating new worlds for users (and Librarians) he asked this question.

Breeding's observation was indeed apt, though one's choice of sessions to attend obviously as an impact so for example this blogger wonders if the overdose of linked data is simply due to her interest.

Still, this year there seemed to be quite a lot of talk on linked data and Bibframe. Perhaps a tipping point has been reached?

I think part of it is due to the fact that ILS/LMS/LSP vendors have began to support linked data.
This breaks the whole chicken and egg problem of people saying there is no interest in using linked data hence there are no tools for it and that there are no tools for it because it isn't worth making because no-one is interested.

The biggest announcement was on Intota v2 - ProQuest's cloud-based library services platform

"Intota v2 will also deliver a next generation version of ProQuest's renowned Knowledgebase. Powered by a linked data metadata engine, Intota will allow libraries to participate in the revolutionary move from MARC records to linked data that can be discovered on the web, increasing the visibility of the library." - Press release

I actually was in attendance during the session but left before it was demoed (kicking myself for that). The tweet below is interesting as well.

Of course, we also can expect Summon to start taking advantage of linked data to enhance discovery via Intota,

Besides Proquest, SirsiDynix announced to "produce BIBFRAME product in Q4 2015".
While Innovative had pledged support to Libhub Initiative a few months earlier.

OCLC of course has always been a early pioneer on linked data.

"Nobody comes to librarians for literature review?"

As part of my attempt to balance going to sessions where I was really interested in the area (and hence likely I would be well versed  in most of the things shown) and sessions where I was totally unfamiliar with (and hence likely most things would go over my head), I decided to go to some GIS sessions.

I accompanied my ex-collegue and co presenter to a couple of sessions on GIS (Geographic Information Systems) which he has an interest/passion in and is currently tasked with trying to start something up for the library.

I attended various sessions including a round table session  which focused more on what libraries were doing as opposed to more technical sessions. It was clear from the start that some academic libraries in the US were far more advanced than others, such as Princeton, who I believe had a librarian state that libraries have being managing data for over 50 years and it's not a new thing to them.

Much nodding of heads occurred when someone warned about jumping on the band wagon simply because their University Librarian thought it was a shiny new thing.

Many talked about staffing models, how to fit in liaison librarians vs specialist roles into these new areas which is a perennial issue whenever a new area emerges (e.g it was promoting open access the last time around for many academic libraries).

One librarian stated that helping faculty handling research data is important because "nobody comes to us anymore for literature searches".

Of course this immediately drew a response from I believe a social science (or was it medical) librarian who said, faculty do come to them for both literature review as well as data sets! :)

Why searching for data is the next challenge

ExLibris has been sharing the following diagram in various conferences recently, listing 5 things users expect to be able to do.

Of the five tasks above, I would say the greatest challenge right now would be to "obtain data for a research project" which can be seen as a different class of problems compared to the other 4 tasks which broadly speaking involve finding text based material.

I would think this is because over the years, improvements in search technology (from the "physical only" days to the early days of online and now to Google scholar and web scale discovery), coupled with easily over a century of effort and thinking of how to organize and handle text - this has made searching for text, in particular scholarly texts (peer reviewed articles in particular) if not a completely solved problem, at least a problem that isn't so daunting that most academics would recoil in terror and ask for help.

Yet, the level of difficulty for searching for data sets/ statistics is I would say about the same level of difficulty for searching for articles in the 1980s to 1990s. While the later has improved by leaps and bounds the former hasn't moved much.

Lack of competition from Google? 

Having worked in a business/management oriented University for 5 months, I am starting to appreciate how much more difficult it is to get datasets from say finance areas and I know many librarians including myself feel a sinking feeling in our stomach when asked to find them.

Firstly, the interfaces to get the data out of them are horrendous. Even the better ones are roughly at the level of the worst article searching interfaces.

This is partly I suspect because without Google to put pressure on these databases, there is no incentive to improve. Competition from Google I believe have driven the likes of EBSCO, Proquest etc to converge into pretty much the same usable design or at least a google like design that takes little to adjust to.

Today, the UI you see in Summon, Web of Science, Scopus, Ebsco platforms etc is pretty much the same, and you practically can use it without any familiarity. (See my post on how library databases have evolved most in terms of functionality and interface to fit into the google world).

Google's relentless drive to improve user experience has benefited libraries to try to keep up. You could say the Ebscos of the world would practically forced to improve or die from irrelevance as students flocked to Google .

Of the databases that libraries subscribe to , the worse ones typically belong to either the smallest outfits or ones that primarily served other non-library sectors.

So the likes of bloomberg , Capitaliq, T1 and even many law databases  such as lexisnexis have comparatively harder to use designs.

They can get away with this because of lack of competition from Google and also these are primarily work tools, and professionals are proud of the hard earned bloomberg skills say that gives them a competitive advantage.

When it comes to non-financial data, it becomes even more challenging, since there isn't many well known repositories of data (at least known to a typical librarian not immersed in data librarianship) that one should look at. Google is of limited help here showing up the usual open data worldbank/UN etc sources that are well known.

How researchers search for public data to use

A recent Nature survey asked researchers how they find data to use.

The article noted that no method predominated with checking references in articles as common a method as searching databases.  Arguably this points to the fact that

a) databases on date are not so well known
b) databases on data are hard to use (due to lack of comprehensiveness of data or poor interface).

Of course this survey question asks about "public data" to reuse,

Researchers often approach me about using data from databases (for content analysis) we license such as newspaper databases and article databases. This seems yet another area that academic libraries can work on, leading libraries like NCSU libraries have took on this task to negotiate access of data from the likes of Adam Matthew and Gale

Confusion over what libraries can or should do with data

Like any new area, academic libraries are trying to get involved in (thanks to reports like NMC's Library Horizon Report - Library editions listing this area as a increased focus) , there is a lot of confusion over the skill sets, roles and responsibilities needed.

What a "data librarian" should do is not a simple question, as this can span many areas.

In Hiring and Being Hired. Or, what to know about the everything data librarian, a librarian talked about how his responsibilities blow up and that "everything data librarians
 don’t actually exist".

He points out that many job ads for data librarians actually comprise 5 separate areas
  •  Instruction and Liaison Librarian
  •  Data Reference and Outreach Librarian
  •  Campus Data Services Librarian - (this job is most associated with Scholarly communication)
  •  Data Viz Librarian (Learning Technologist)
  • The Quantitative Data Librarian (Methods Prof)

I can smell the beginning of what the Library Loon dubs as "new-hire messianism". Where a new hire is expected to possess a impossible number of skill sets, working under indifferent or even hostile environments and expected to almost singlehandedly push for change with no or limited resources or authority. 

Obviously no one staff should be "responsible for data", I've been reading about concept of "tiers of data reference.  and thinking of how to improve in this area.


Like most academic librarians, I am watching developments closely, and trying to learn more about the areas. Some sites

Share this!

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Related Posts Plugin for WordPress, Blogger...