CC BY and data: Not always a good fit.

By Katie Fortney / September 15, 2016

SimbaDansLeCarton-jacme31 — Simba dans le carton, jacme31, CC BY-SA 2.0

Last week I wrote about data ownership, and how focusing on “ownership” might drive you nuts without actually answering important questions about what can be done with data. In that context, I mentioned a couple of times that you (or your funder) might want data to be shared under CC0, but I didn’t clarify what CC0 actually means. This week, I’m back to dig into the topic of Creative Commons (CC) licenses and public domain tools — and how they work with data.

People often refer to everything with a CC in front of it as a “Creative Commons license.” But only six of the Creative Commons tools are licenses. They are:

CC BY (Attribution)
CC BY-NC (Attribution-NonCommercial)
CC BY-ND (Attribution-NoDerivatives)
CC BY-SA (Attribution-ShareAlike)
CC BY-NC-ND (Attribution-NonCommercial-NoDerivatives)
CC BY-NC-SA (Attribution-NonCommercial-ShareAlike)

The licenses permit people to use a work in exchange for agreeing to certain conditions: attribution at a minimum, and others if the author so chooses. Let’s not focus on what those other conditions are today, just the fact that a) there are conditions and b) attribution is always one of them.

The public domain tools are different. There are two of them:

PDM (Public Domain Mark, aka “No Known Copyright”). This one means, “Hey, I found a thing that’s already in the public domain, and I’m marking it because I thought you’d like to know.”
CC0 (Public Domain Dedication, aka “No Rights Reserved”) This one means, “If I hold copyright in this, I waive it. And if we’re dealing with a country where we can’t waive copyright, then here’s a broad public license with no attribution requirement. And just as a back up, I also promise not to sue you for copyright or related stuff if you use this.”

There aren’t conditions placed on the user in these two cases, so we don’t call them licenses.

Sharing data with CC0 is increasingly encouraged (see these pages by Dryad and BioMed Central for a couple of examples) because the easier you make it for people to reuse your data — and the clearer you make it that you aren’t planning on suing people who do — the more likely it is that your data will be reused. The Open Data Commons has a similar tool called the Public Domain Dedication and License (PDDL).

That said, CC BY and the other five Creative Commons licenses are also popular among scholars publishing data, because of the attribution requirement I mentioned above. Specifically, the CC licenses require that those who reuse a work provide attribution to the work’s creator by retaining “identification of the creator(s) of the Licensed Material and any others designated to receive attribution, in any reasonable manner requested by the Licensor (including by pseudonym if designated).” This makes great sense for photos, articles, and music that creators share with a CC license.

It’s not surprising that researchers producing data also appreciate attribution. Unfortunately, CC licenses can be a poor tool for mandating it. They are neither sufficient, nor necessary.

1. CC licenses are not sufficient for ensuring proper attribution in many cases because their restrictions — including attribution — do not apply to facts. The restrictions only apply to whatever portion of a work is copyrightable. (See last week’s discussion of copyrightability.)

If you read the licenses, they say that “For the avoidance of doubt, this Public License does not, and shall not be interpreted to, reduce, limit, restrict, or impose conditions on any use of the Licensed Material that could lawfully be made without permission under this Public License” (emphasis added, see CC BY for an example). In other words, the conditions don’t apply to uncopyrightable facts.
This is also discussed in CC’s FAQ, “Do I always have to comply with the license terms? If not, what are the exceptions?” Answer: “If your use would not require permission from the rights holder because it falls under an exception or limitation, such as fair use, or because the material has come into the public domain, the license does not apply, and you do not need to comply with its terms and conditions.”

Because of this, sharing uncopyrightable data under a CC license can lead to confusion about what the data producer is trying to achieve. So can sharing data with no reuse terms stated at all: if data is shared publicly, but without a license, copyright law’s defaults apply. That means that if the data isn’t copyrightable, reuse doesn’t require the author’s permission. But if there’s something copyrightable, the author has exclusive rights to control reproduction and distribution of it, subject to certain exceptions like fair use. With or without a CC license, users who want to reproduce or redistribute the data can be left wondering about what’s protected by copyright, and what kinds of reuse (if any) the data producer was hoping to enable.

CC0 is much clearer. It doesn’t matter if a dataset is copyrightable or not. Whatever copyright the author has, they waive. Everywhere. If it’s nothing, they waive nothing, but they’re making a clear statement that they don’t plan on using copyright to restrict reuse. That’s especially handy when you consider that a) reasonable people can argue about whether something is copyrightable or not and b) something can be copyrightable in some countries but not others.

2. CC licenses’ attribution requirements aren’t necessary because scholars have very good reasons to provide attribution that has nothing to do with copyright — see, for instance, the Joint Declaration of Data Citation Principles, which doesn’t mention copyright or intellectual property at all. Data that comes from nowhere has little credibility. If someone wants to use data as persuasive evidence, they need to refer readers and reviewers back to its source: who it came from and how it was produced.

And licenses that dictate exactly how later projects reuse data and provide attribution can hobble those projects. For just one example, see this discussion of licensing issues in Rephetio, a project that’s done a particularly good job documenting the challenges faced in attempting to reuse drug repurposing data. Data reusers should follow best practices for data citation in their field given their particular project. If data sharers can trust them to do just that, rather than trying to shoehorn attribution practices into one-size-fits-all copyright licenses, they can make it much easier for others to reuse their data.

Meanwhile, cheaters will cheat. If professional ethics aren’t enough to dissuade someone from plagiarism or other misconduct, why would the restrictions of a CC license dissuade them, especially in cases where they’re using factual data that those restrictions don’t apply to?

If you’re sharing data publicly:

Understand that sharing data under a CC license doesn’t require users to provide attribution if they’re just pulling out uncopyrightable facts (although scholarly norms very well might).
If you use a CC license to share data and you’re not sure whether your data is copyrightable, potential users probably aren’t sure either. Some may assume little copyrightability and use freely; some others may err on the side of caution and decide they can’t use your data because they can’t meet the attribution requirements. Explanatory notes about your expectations alongside a CC license may help by showing that you’re aware that there’s complexity, or by inviting data users to initiate a conversation about desired uses. But avoid adding language to your notes that contradicts your chosen license; see another discussion from the Rephetio project mentioned above for an example of custom terms of use that conflicted with a CC license, leading to confusion.
If clarity and maximum reusability are important to you, consider CC0.

If you’re using publicly shared data:

Even if failing to meet attribution requirements for noncopyrightable data labeled CC BY isn’t a violation of the license, it might cause other problems. Think about scholarly norms and the working relationships you want to maintain with peers and potential collaborators.
If you’re not sure what someone’s trying to accomplish with their reuse terms — especially if those terms seem like an obstacle to work the data originator might well want to support — you can always try contacting them directly.

Public data sharing on a large scale is still relatively new. Things are going to be confusing for a while. People are going to make mistakes. Don’t be afraid to ask questions. Share your concerns, successes, and roadblocks with others in your field, publicly if you can. As data sharing becomes more common due to funder and publisher requirements, answers will be increasingly easy to find.

This post was cross-posted on DataPub — the blog of the University of California Curation Center at CDL.