Kia ora, I'm Michael Lascarides. I am the user experience lead at the National Library of New Zealand.
I work with the Digital New Zealand team where we craft web, mobile and data interfaces for all kinds of folks to use,
from family researchers, to teachers, to app developers, to students, to the generally curious, to our own staff.
I've been asked today to tell you a little bit about a few of the ways that our institution
creates, uses and shares our collections data.
I'll start with a little overview of what our institution does and what it collects.
The National Library is New Zealand's legal deposit library and is charged by law with the obligation
to "enrich the cultural and economic life of New Nealand, and its interchanges with other nations."
The library has three main parts: the general collections, encompassing the legal deposit services;
the schools collection, including the largest collection of children's books in the Southern Hemisphere;
and, the collections of the Alexander Turnbull Library, which is predominantly comprised of unpublished materials such as manuscripts and photographs.
To give a sense of the size of the collections,
we currently have about 836,000 items in the catalogue of unpublished materials,
about 1.4 million items in the published catalogue
and over 30 million items searchable on our website
and the bulk of that is mostly individual-digitised newspaper articles from our Papers Past service.
These collections include books, maps, images, recorded music, music scores,
newspapers, periodicals, manuscripts, letters, paintings, artifacts, manuals, and more.
So, to have a look at what sort of data we're generating and how we're generating it,
this is a very, very simplified structure of where we store our data and where we generate data,
organised roughly from most public to least public.
Here in the middle are the catalogues, where we have things like
the Papers Past service containing full-text digital objects,
various databases, the catalogue of published materials and the catalogue of unpublished materials.
Behind that we have the actual physical collections that the items in the catalogues represent,
and our digital preservation and digitisation services.
We run a metadata service called Digital New Zealand on top of all of this.
There will be more about that in an upcoming bit of this talk.
The Digital New Zealand service we use to generate a searchable archive
and an API data service across all of our collections, as well as that of other institutions.
Up front, we have several public interfaces to the web.
We have the digitalnz.org website, we have natlib.govt
and under the natlib.govt domain we have the
National Library website, as well as the Papers Past website and numerous others.
So what is our approach to generating, sharing and using collections data?
First of all, for the last few years we have had guidance from our government,
starting with the NZGOAL framework for copyright.
NZGOAL is the New Zealand Government Open Access and Licensing framework,
and essentially it gives us guidance for how we can license
data and materials that we generate. So this essentially sets the pathway for us to be open.
We also look to the Open Government Data Programme.
This is a specific program within the government that we are part of that encourages us to share our data online.
So this is from the data.gov.nz website, to whom we contribute
and we also release our own data sets out into the world.
Within the cultural heritage sector as well, we also look to, not just legal guidance,
but professional best practices, and something we've spent a lot of time thinking about is
the 5-Star Open Data framework, which has been championed by Tim Berners-Lee, the inventor of the World Wide Web.
Roughly speaking, the 5-Star framework gives us guidance on
what is the best practice for sharing our collections data, for sharing the data that we want to share.
And these five stars are great for checking on your own data to see how well you're sharing.
One-star data is sharing data, but really just putting stuff on the web.
So if you put a picture of some data on the web, that's one-star.
The end goal is five stars, which is data that is shareable - not only shareable and machine readable,
but interactive - is actively interacting with other institutions outside your own.
So, one star: make your stuff available on the web in any format.
Two stars: make your stuff available as structured data in any format -
so an Excel spreadsheet [is] two stars.
Three stars is specifically sharing it in structured data format in a non-proprietary open format.
So this is things like CSV, XML, JSON and
other formats that have open standards that are not controlled by commercial interests.
Four stars is using URIs, so basically web addresses, to denote things
and give them permanent homes on the web so that people can point at your stuff.
This is permalinks, so that each piece of data, each data set and each item within that data set
has a link that people can refer back to that doesn't change.
And finally, five stars is linking your data to other sources of data to provide context and interconnection.
So the title of this talk is 'Keeping an Eye on the Fifth Star' because
we feel like we've made progress into the three and four and where we are as an institution
is looking towards the five and that's where we want to be.
So let's look specifically a little bit at the National Library,
what we're doing with open data.
If you're interested in just getting right to the source and downloading the open data that we've provided
you can go to this address on our website, our open data page, where you can find
downloadable copies of all sorts of data sets.
And what you'll find in there includes copies of several of the databases that we use,
so this is things like Publications and Index, the Te Puna Web Directory,
Maori subject headings. These are CSV-format things, so these would fall into the 3-star open data format.
We also have some XML data sets coming out of the unpublished catalogue.
So this is things like the Turnbull Library collections metadata itself and the list of Turnbull Library names,
so the authoritative reference for things like organisational names and people's names.
We also run the Digital New Zealand service, which is many things -
digitalnz.org is a website that is searchable as a search engine,
but mainly we think of DigitalNZ - the heart of it, the beating heart - is the API,
the applications programming interface that allows us to bring materials from the outside world
into Digital New Zealand and to
serve it in a searchable downloadable format as JSON open format data.
There's a lot of acronyms going on there, but suffice it to say that what this means is that
you can search our stuff through Digital New Zealand, you can build your own program,
you can build your own website that searches Digital New Zealand
and then you can use what comes back. You can use the items and you can find things, reuse, remix them
through Digital New Zealand, because we're publishing things in an active, real time, open format.
So a little bit more about the Digital New Zealand service itself, because that's where I've been spending quite a bit of my time recently.
As mentioned, Digital New Zealand is a website,
so if you search for Digital New Zealand on Google, you will wind up at digitalnz.org,
which essentially looks like a search engine and lets you search a whole lot of materials related to New Zealand.
But it's really an ecosystem, an infrastructure of many different parts that's doing a lot of different things.
The Digital New Zealand service is a metadata search service.
We collect metadata on cultural heritage
collections across New Zealand and the world related to New Zealand content.
So we have just about 300 partners -
organisations - that we bring in materials from, including the National Library itself.
So the bulk of the materials in Digital New Zealand is retrieved from the National Library,
but we do include Te Papa, we include Auckland Museum,
we include libraries from all over New Zealand, from all over Australia,
from Trove, from the National Library of Australia and in fact institutions from around the world.
If they have New Zealand content relating to cultural heritage materials, we want them.
We bring that in and then through the API we offer a way that you can search the materials,
classify the materials, filter the materials and get to individual items within the collection and then reuse them.
Over here - so this is the ingest process - we collate the metadata, we use the API to publish that stuff to the world.
People can build search tools, people can remix content, people can build mashups,
and then you can build your own stuff, anything you can think of.
We have very open and clear licenses for the use of Digital New Zealand materials
and in fact the National Library is its own best customer for the Digital New Zealand services.
So if you go to the National Library website and you search
the National Library's collections, you're using the Digital New Zealand API behind the scenes.
So what does this all mean... is that for a particular record of a particular item -
in this case an etching of a red crown parakeet.
This is an item that the National Library holds in its collections.
We have a page that is represented on the National Library for that record.
We have a page on Digital New Zealand through the search service that is returned representing the same item record.
And there is also the data behind the scenes, which anyone can use through the API.
So if you wanted to create your own engine of New Zealand birds
and have your own red crown parakeet search engine you can use our API to build that.
But where we're going is we're building a lot of things on top of the Digital New Zealand service.
And we have... we're just about to launch a new version,
so I'll be showing you a mix of screenshots from some old versions and some works in progress,
but probably shortly after this is published
we will have a new version of the Digital New Zealand website that I can share with you.
Stories are an area that we're getting into.
We've already allowed people to collect things together into sets on the Digital New Zealand website,
we allow them to collect together a list of things that go together.
We're going to give people tools to not just add things to a story, but to be able to write
contextual information around those sets of things and to create stories that
tell more of their own story and to publish those on the digitalnz.org website.
The stories themselves will also be freely licensed once people opt in to publish their stories,
and there will be an API for stories as well.
So you can create things through the Digital New Zealand website
and then be able to access the data that goes in those stories.
Some of the other innovations that we're working on for the Digital New Zealand service is to get to that elusive fifth star.
And, really, the thing that we're most excited about with the five stars is the idea of linked open data.
And, specifically, we're in the process of releasing a new part of our API called Concepts
Concepts is essentially taking a search one step further and adding
definitive, authoritative hooks into the data for particular entities like people and places.
So my former colleague Chris McDowell wrote a blog post and did a presentation
which had the unwieldy title of all the various spellings of Colin McCahon, the New Zealand artist.
This graph that Chris created represents the number of references to people
that are in common between four institutions:
between the Te Ara online dictionary of New Zealand history, the Auckland Art Gallery's collections,
the Te Papa museum's collections, and the Alexander Turnbull Library.
And, as you can see, there's no one single overlap that predominates.
Auckland, Alexander Turnbull Library, Te Papa and Te Aro all have different lists of people
and different overlaps in those lists of people.
So if we knew that all of our stuff that matched the keyword Colin McCahon
was actually related to the specific person Colin McCahon, we'd have a lot stronger
foundation for making connections to these other institutions with their authoritative lists of people.
So the Concepts API is going to allow us to do that.
If you're technically minded and you are interested in seeing how we've done this,
we've actually turned the underlying technology that runs the Digital New Zealand API
into an open source project of its own, and I would encourage you to look us up on Github
and look at the Supplejack Project, which is our open source flavour of the metadata harvesting
and application programming interface server to create your own metadata collection and serving service.
We are also including a little demo search site, so you can create your own
site very quickly that looks similar to something that works like Digital New Zealand.
Papers Past is another very large site that we run that we've started to do interesting things
with sharing of data and setting the groundwork for getting to that elusive fifth star.
Papers Past, if you're not familiar with it, is our historical online digitised newspaper searching service,
which we recently - last year - have expanded to not just include newspapers
but to include magazines and journals, letters and diaries, and parliamentary papers.
So, essentially, any digitised document that is full-text can now be added to Papers Past by us.
And currently the scope of the newspapers collection includes over 300 titles of newspapers
between the years 1838 and 1945.
There's about 3 million digitised pages coming - actually, we're coming now closer to 4 million pages.
There's about 30 million individual articles that are searchable
and they are searchable within the full text.
So it's a massive data set and a massive resource,
and if you're interested in history, and you like reading about how life was 100 years ago,
it's a rabbit hole you can fall down and not come up for days. It's a wonderful service.
But recently we took the older version of Papers Past and we needed to modernise it for a number of reasons.
We rebuilt the front end so that it works on multiple devices.
We also had to expand it to support these new formats, and as we did we took the opportunity to modernise
and improve some things to make the sharing of data much easier through the Papers Past service.
For starters, I will bring up a not terribly exciting but fundamental and often overlooked part of data sharing.
We talked about the 4-star data being that each item in data has a URI -
has a web address that is permanent that you can refer back to.
The Papers Past URLs were quite unwieldy and out-of-date.
So they look like this. There was some information in here that was...
didn't follow as much of a structure as we would like and it made it very hard to keep track of statistics.
It made it very hard to share. There was a lot of duplication within the actual web addresses.
So when we launched the new service we took the opportunity to rewrite the URL.
So we went from an old one that looks like this. This represents all of the records for a particular title of newspaper.
This is what that URL looks like rewritten.
We now have, very clearly it says it's a newspapers URL,
it says what the title of the particular publication is in URL format, and it has the date.
So it's easier for humans to read. It's easier to cut and paste. It's easier to work with.
And it's much clearer. So as you're working it's much more predictable and much more stable.
We followed across all of our formats a structured way of dealing with these URLs.
So all the URLs on Papers Past mainly follow this format.
So, the format - newspapers, manuscripts, parliamentary papers;
the publication, which is the actual name of the publication or collection;
the year, the month, the day and the page within that document.
There are a couple of exceptions for things where this doesn't apply, such as volume,
but largely we've rationalised the URLs.
And on day one when we went live we mapped 30 million old URLs to 30 million new URLs.
The benefits of this are subtle and a little underrated.
Not only are they easier to understand, they're easier for Google analytics to parse.
So if we're looking to see how much traffic our website has had on a particular publication,
we can use our analytics tools and just slice through that set of URLs.
It's easier for web spiders - so Google's crawlers and different bots -
to actually gather all of the materials on Papers Past.
We were realising that some of our articles were getting crawled three, four or five different times,
with three, four, five different URLs, and now there's a single canonical URL that's easier to to deal with,
and that adds up to bandwidth savings that the crawlers are not crawling our site as often for the same benefit.
Within the pages on Papers Past there are also things that we've done.
One of the interesting ways that you can use collections data is data doesn't have to be
a downloadable database that is separate, or something. You can turn your website into data.
So behind the scenes, this article page.
This represents a single article within a single news- paper, and there's information that we know about that.
There are actually out there a number of standards for this kind of material that you can mark up in a generic format within your HTML.
So schema.org is an initiative that several of the large search engine companies initiated
to encourage people who owned sites to mark up the actual source code of their pages to
give a bit more information about what is actually being spelled out within those pages.
Instead of just a bunch of HTML code, we can say this is a news article.
It has this author, was published on this date in this publication.
So the schema.org news article markup is what we have actually used, and if you look at the source code on
any Papers Past article page now you will see that there is some schema.org markup indicating that
this is a news article and that the headline of the article is the headline field that schema.org would be representing.
We're not actively using this yet, but if someone wants to come along and use our pages
and understand them, do some research in a programmatic way,
write some programs that download our stuff and look at it, this makes it easier for them to do so.
So data doesn't have to be something that's separate, it can be incorporated into a website redesign.
And this is just a quick look at what one of those lines looks like. The schema.org website will give you
instructions on how to add this markup, and it's actually fairly minimal and you get quite a bit of benefit out of it.
So with all this, just a couple of moments of quick thoughts about what are our interesting problems.
What is making... What what are we thinking about in the collections data space?
Well, I think that that we're definitely thinking about how do we get to that fifth star. How do we connect our stuff to other people's stuff?
We're going to be putting the Concepts API in Digital New Zealand into production.
Once that's there, we're going to think a lot about concordances, which is taking the canonical name of a person in our collection
and adding and relating that to the same name in someone else's collection
and making that leap so that we can have links going through.
That makes liaising with other institutions a very important job,
and then doing the work, going from concept to production.
It's both communication with the rest of the world and being a good citizen ourselves.
How do we scale up? These are big data sets. We've reached a certain point.
We're doing a lot of refactoring under the hood.
We're doing a lot of work on the inside of our tools to make sure that they are future-focused and as open as possible,
from the internal engine that you don't see up to the data service that you do see.
And how do we get people involved?
We want to be dealing with more content partners. We want more people to use our stuff.
We want to do outreach and to get our open source tools in the hands of people around the country and around the world.
We need to make our tools easy
and we need to educate people.
Some of this is general marketing, just getting our name out there and letting people know what we do,
some of it is education, and some of it is actual active collaboration and partnership,
and working with other institutions and other individuals very closely.
So that's what we're thinking about and
if anyone is interested in following up,
Digital New Zealand and myself are pretty easy to find on the web and please get in touch. Thank you.
Không có nhận xét nào:
Đăng nhận xét