Stories

The 20-Petabyte Idea

A conversation with Brewster Kahle about the Internet Archive, the power of thinking big, and the quest for universal access to knowledge

Brewster Kahle

Brewster Kahle (Photo: Rudy Rucker)

Brewster Kahle is one of those people you want to talk to all night—and for a few decades after that.

He is the Founder and Digital Librarian of the Internet Archive, which he established in 1996 with the ultimate vision of creating universal access to all knowledge. The archive now consists of more than 20 petabytes of data and counting—that’s more than 20 million gigabytes—including books, music, television, software and more than 491 billion web pages. The Rita Allen Foundation is supporting the Internet Archive’s Political TV Ad Archive, which is providing an unprecedented resource for journalists, scholars and others interested in understanding the current campaign.

I recently reached out to Brewster for a conversation to explore the topic of big ideas—and how they have the power to transform.

—Elizabeth Good Christopherson
President and Chief Executive Officer
Rita Allen Foundation

What led you to your big idea—your passion for universal access to knowledge?

That’s easy. It was the confluence of two things. One is that it was clear by 1980 that the technology for storing all the published works of humankind would soon be available. The other was the optimism or idealism of the time, and also of me, to say, “What could we do with a technical background that would have a big impact?” The idea of universal access to all knowledge seemed like a really good life goal.

Who was influential in shaping what would become your life’s work?

Marvin Minsky, one of the founders of AI [artificial intelligence], and Danny Hillis [cofounder of Thinking Machines]. When I was at MIT they encouraged me, and I think everyone around them, to think big. The idea of achieving your goal is overrated. Pick a big, high goal that you can make progress on. Have the gumption to say, “I can make progress with this, even though I may not do it all and I may not succeed in my lifetime.”

Picking a big goal means that lots of people can cooperate without having to all work for each other. You don’t necessarily need one organization. You can say, “I’m going to take my piece and try to move it forward and try to get lots of other people to go in that same direction.”

“The idea of achieving your goal is overrated. Pick a big, high goal that you can make progress on. Have the gumption to say, ‘I can make progress with this, even though I may not do it all and I may not succeed in my lifetime.'”

The term “universal access to all knowledge” is from another inspiring person named Raj Reddy, who is the Dean of Computer Science at Carnegie Mellon. After I had been working on what I think of as universal access to all knowledge for a long time, he said those words, and that seemed like the right way to put it. So all of these—it’s a collage of other people’s ideas and help.

I’m not a subscriber to the idea of a lone genius. It takes a world, and many people who often aren’t visible, to try to get significant things done. I’ve certainly benefited from that help.

The people working in the nonprofit high-tech sphere are a special group, because they’re sitting out the gold rush. They’re not going to make millions of dollars by working on a nonprofit like the Internet Archive or Wikipedia or Mozilla. They’re going to have a fine life, but they’re not going to be driving Teslas.

And maybe that’s an okay thing, right?

It seems to be for a certain set of people, and I find that very inspiring

Where did the idea for the Internet Archive come from?

By the mid-’90s publishing had largely been put online, and we also had established an open infrastructure, which turned out to be the World Wide Web. Those functions were working pretty well. We had commercial participation in an open system. So then there was the question: What to do now?

I looked around, and we hadn’t really achieved the big goal of having a library. In fact, we could just start now.

The Internet Archive, which now consists of more than 20 petabytes of data, is housed in a Greek Revival building in San Francisco that was once a Christian Science church. (Photo: Rudy Rucker)

The Internet Archive, which now consists of more than 20 petabytes of data, is housed in a Greek Revival building in San Francisco that was once a Christian Science church.
(Photo: Rudy Rucker)

So we started in 1996 to try to at least make the published works of humankind permanently available. We started with the World Wide Web, collecting every page from every website every two months. Once that was moving along well, we started recording television starting in the year 2000—Russian, Chinese, Japanese, Iraqi, Al Jazeera, BBC, CNN, ABC, Fox—and then moved to music and movies and books.

We started with the archiving function and then have been trying to make it as accessible as we can.

Where do you see the Internet Archive going in three, five, seven, ten years?

Hopefully we not only have copies of everything required for a K–12 and an undergraduate education, but are making it available to anybody in the world who wants access. By 2020, gosh darn, we shouldn’t be talking about building a digital library, we should have built a digital library.

What are some of your greatest challenges?

Mike Lesk—whom I think of as the father of digital libraries—as he put it, the biggest challenges are institutional responsibilities. Different institutions don’t know what role they should play in this new world, because digital materials have different characteristics than physical materials. What should public institutions—libraries, universities, national libraries—what should they do? That’s, at the heart of it, the biggest challenge.

Looking down another level, there are copyright issues, which are really an encoding of business models, some of which are a little antique at this point. Those are problematic. And then funding. The good news is, if you wanted to digitize the books, music, video and software that are required for a K–12 education and undergraduate research, it’s in the order of 100 to 150 million dollars, one time. Then there’s the question of how easily available can you make that. That’s a technological and a rights issue, but the rights issues are going very much in the public’s direction.

People want the Library of Alexandria, especially if it doesn’t hurt rights holders. Now it’s coordinating the efforts, money, access to materials.

We’re also thinking about how the whole concept of knowledge changes in a digital environment. You have talked about the danger of foundational information disappearing. Besides the issue of access, why is this so important?

If the past just disappears or can be disappeared by the powerful, we’re living in an Orwellian world. You may not want to always dwell in the past, but you certainly don’t want it taken from us.

Fortunately, it’s relatively easy to keep materials at this point. There are certain threats to the open environment that we have now, but right now are the good old days. Let’s do what we can to make sure we have a record of our time and demonstrate that it doesn’t have a negative impact on commercial endeavors. In fact, it might really help.

Preservation and access go hand in hand.

How did the Political TV Ad Archive come about, and what are you learning from it?

We made the general web, general television accessible so you could search based on what people said. We thought that was just going to cause everyone to perk up and see the world in a new way. It’s been useful, but…world peace has not broken out because of it.

What do people really want out of television? What understanding? Roger Macdonald, the director of the Television Archive, thought we could dive deeper on a smaller subject of more importance to people. He has led a four-year project to bring to light the veracity and the ubiquity of U.S. political ads.

The first time we did this, based on experiments in midterm elections in Philadelphia, we felt we were able to do the work technically, but we weren’t able to put it in a way that the journalists knew what to do with. We had a couple of results that we put out as kind of press releases, and those were picked up, but that wasn’t a success from our perspective.

With this presidential election, we spent a lot more time and effort trying to massage the data into usable form and reaching out to very busy journalists. Now there are data journalists in almost every large news organization, and they’ve been able to run with these data in ways that we hadn’t expected. That is success. That’s been exciting.

Do you have an example?

There are two that were really kind of fun. One was done by an independent researcher, Kalev Leetaru, who analyzed the mentions of the different candidate names on television news. Trump was mentioned so much more than everybody else combined. He was just terrific at basically playing the news. He stayed front and center even though he didn’t have to pay for it.

It’s something that we knew, but to see the stats right there gives you an overview. As my friend Jesse Ausubel puts it, a macroscope.

I like it.

You get a broad idea of what’s going on, which is often very difficult. We’re down in the weeds.

A variety of news outlets have used data from the Political TV Ad Archive in their coverage of the 2016 presidential campaign. The Atlantic magazine used it to create a video game demonstrating the deluge of campaign ads directed at Iowa voters before the caucuses.

A variety of news outlets have used data from the Political TV Ad Archive in their coverage of the 2016 presidential campaign. The Atlantic magazine used it to create a video game demonstrating the deluge of campaign ads directed at Iowa voters before the caucuses.

Another one was a very fun little video game that was done by The Atlantic magazine, where it showed the ads coming at Iowa voters. You couldn’t flip channels fast enough to dodge the ads, they were coming so hot and heavy. For those of us that live in non-battleground states, it’s hard to understand the tsunami of paid speech that is in the American electoral system at this point.

When you think about your vision of universal access to knowledge, do you also think about how people will understand and interpret the information, or why they will want to search for it in the first place?

We had hoped that people would repurpose our data sets in bulk and come up with a new kind of Google every month because they were using these materials. But that didn’t happen. We found that you really needed to go that extra mile to try to get them in a form and format that other people could use.

I’m still hopeful that people will have tools that are more powerful than the ones we have today for wrestling with big data. We’re starting to see data scientists use our data, and that’s very exciting to help inform the building of artificial intelligences, to help better inform a broadening population around the world that is hungry for information.

So I’m still hopeful. But it’s been harder than we thought to make large data sets useful to large numbers of people.

How do you think this new future of knowledge sharing might change society?

I think we take some of the best of elite universities, which were all built on sharing and building on each other’s works, and allow everyone to have that kind of empowerment. Or we take some of the tools that were only available to the most powerful and make them available to everybody.

When I was growing up, I felt that I was fed textbooks that were really somebody else’s agenda. I think our new world can allow us to be informed by lots of different points of view that allow us to find our own path informed by the good works of others.

If we can pull that off, we’ve done something that is in the realm of the Library of Alexandria or the man on the moon. I think we could look back on our generation and say that we really achieved something new and different, in traditions that have always existed, but made it happen for everyone around the world. That is an exciting kind of spring-out-of-bed motivation for me.

“If we can pull that off, we’ve done something that is in the realm of the Library of Alexandria or the man on the moon.”