As part of Industry Insider — California’s ongoing efforts to inform readers about state agencies, their IT plans and initiatives, here’s the latest in our periodic series of interviews with departmental IT leaders.
Mary Ann Bates is executive director of the Office of Cradle-to-Career Data, a role she has had since February 2022. She was previously a senior fellow in the White House Office of Management and Budget from May 2021-January 2022; and was executive director of the North America Abdul Latif Jameel Poverty Action Lab at the Massachusetts Institute of Technology, from July 2017-November 2021, during more than 11 years at the Lab.
Bates has a bachelor’s degree in international studies, English from Denison University, where she was summa cum laude and Phi Beta Kappa as well as class valedictorian; and a master’s of public policy from the University of California, Berkeley, where she was a Richard and Rhoda Goldman Fellow. She spoke to Industry Insider for the One-on-One series in September 2022, the year she was named director; and in November, after announcements from the Office and Gov. Gavin Newsom’s office that the Office had received more than 1 billion data points from state entities and had begun to leverage them to drive equity and access in education.
Industry Insider — California: What’s most important to know about the marking of this milestone — more than 1 billion data points submitted?
Bates: So much. I’ll highlight a couple of things. I think the first and main point is that this data system has been very long-awaited. California leaders, educators, advocates have pushed for this kind of data system, that brings together information in a way that’s useful and actionable for people — have pushed for that for many, many years. And that’s part of what makes what happened in October so encouraging, because we’ve been building this work as an office for the last year and a half. But October marked the moment where, for the first time, so many data providers submitted their first data submissions to us. And that’s really what is needed to now move to the next phase of our work and bring that information together, stitch it together so that that can provide the foundation for the public-facing resources and tools that we have promised to develop. That’s why it was incredibly exciting and important. I think it’s really hard to overstate just how much collaboration, high-trust relationships, coordination that required, across our data providers who’ve been really wonderful in sharing their expertise, their team’s time and working with our team to prepare for that data submission. And then following through and sending the data our way. It’s an incredibly important milestone. Many states have an education-focused state longitudinal data system of some type. California hasn’t had that, so this is long-awaited and we’re excited to be on the way. I think it’s really important to highlight the fact that the point of this data system is to ensure that every Californian has the freedom to succeed. It’s to ensure that we’re putting out information in such a way that it’s useful to folks here in Sacramento, yes. But even more importantly, to students, families, educators who are making decisions every day about their education pathways, about what opportunities they’re pursuing and how. That means the data submitted by our partners last month will serve as a foundation for the analytical tools that we have been talking about. So, dashboards that illuminate stories from this data. A query builder where people can zoom in and say, “What does this look like for my community?” Those kinds of tools are really what we will now use this data to build.
IICA: How should readers think of a data point in this instance, i.e., what is a data point here or what are examples of what it can be?
Bates: An over-simplified way to think about it is, think of a data point as a cell in a spreadsheet. For example, imagine our de-identified data set for person No. 1, 2, 3, 4, right? It might have a row. And then you have different columns. And you think about this as, here’s an indicator that shows whether they applied to this four-year university. Were they admitted? Were they enrolled? Did they earn a degree? Think of each of those cells, if you will, as a data point. Given the size of California, given that for many of these data elements we received up to a decade or more of data, you can see how the numbers start multiplying quite quickly and how we got to the large number of data submitted. I think it’s very important for us to be clear with the public about the fact that we are not collecting new data. We’re linking together data that has long already been collected and validated at the state level by state-level entities. We have data available in K-12 education at the Department of Education in a number of wonderful data resources. We have the higher education data. But each of these domains — you have information about labor market statistics here in the state of California. But when each of those exist within a silo, it’s hard to see the picture and trajectory. And that’s the story that I believe will be really powerful.
IICA: Would you be able to offer a bit of context to this being the largest data processing and integration in state history? One billion data points sounds very large; have any other integrations approached this number?
Bates: I want to be very careful to not make claims that this has never been done before in [history], in terms of the number of data points. There are a lot of data integrations that happen at the state level, as you can imagine, even during the pandemic. I think what’s really important here is not, “Has this number of data points ever been brought together before?” What is unique about this effort is that this is, to my knowledge, the largest and most comprehensive linkage of data for these purposes. Having an education-focused state longitudinal data system here in California that links together not only data from K-12 and higher education, but also some important other indicators that can help us see well-being beyond the education sector. It’s very important to have earnings data so that we have information on living wages. Very important that we have data, some of the important health, health and human services indicators that can help illuminate well-being more broadly. This is California's debut, if you will, in having this kind of coverage as a state longitudinal data system. We’re linking data from a large number of entities from the get-go in a way that is quite ambitious. And the scope of how we’re planning to make that information available is also very ambitious. We’re not linking this data together just so that data analysts within government can do important analysis, to inform our work internally. We really have a focus on making sure this information is available to the public via our website in ways that are meaningful for them. I think the other answer to this, that I think is really important, is that what’s important about the Cradle-to-Career effort is that we’re building this data infrastructure to be able to link a very large amount of data on an ongoing annual basis, to enable many, many different use cases of people using the data. And I say that because it’s not that data integration across the state has been impossible in the past or isn’t done. It is done. But historically, sharing data across state entities has been done on a case-by-case basis for a specific purpose. And it takes a long time to navigate a legal framework each time, to navigate the data infrastructure each time. And so that’s part of why historically silos have remained silos, because it takes so much effort to bring the data together across silos. So, part of what we’re doing from the Cradle-to-Career side that I find really exciting, is that we’re really lowering the transaction costs for many parties to integrate their data with each other and free the data from its silos. So, I think of the work that our team is doing, is doing a lot of upfront effort that then over the years will really pay dividends in having the ability to much more seamlessly draw insights from linked data, without having to negotiate a new legal agreement each time. Without having to build new data linkage infrastructure each time.
Editor’s note: Find more on data points here.
IICA: What should readers know about the data providers making these submissions, i.e., how many entities have signed agreements and are providers, and did any one provider or any one category provide substantially more data than the others?
Bates: The legal agreements that are the foundation for the data submission that just happened were signed in May of 2022. There were a total of 16, including Cradle-to-Career, who signed those legal agreements. Now, that included a few coordinating entities who don’t directly submit data like the Association of Independent California Colleges and Universities. They don’t directly collect data from their members that they’ll submit to us, but they have a seat on our governing board. We’re looking for ways to bring in data from non-public universities in the future. That’s one example to give you the background context on what I’m saying more broadly here. We signed legal agreements with 16 total. Nine are providing data in this first phase. A few more are scheduled to provide data in future years, and you can see that schedule in our five-year timeline that’s linked in the press release. And then a few are umbrella entities. For example, CHHS [the California Health and Human Services Agency] signed the legal agreements. The actual entities sending data our way are the Department of Health Care Services, for instance. That’s why the total number of signatories is larger than the number of entities who are initially sharing data. The ninth [entity sharing data] will come next early in 2024. That’s the Employment Development Department. To the second part of your question about one category providing substantially more than the others: Across the number of data points, the data from K-12 education, from the Department of Education, and from our public higher education segments — that’s community colleges, California State University and University of California — those K-12 education and higher education data points anchor the data system by far in terms of the number of data points. That’s the bulk of it. But again, what makes this effort so powerful is that we’re connecting that with non-education data. So, for example, a lot of the questions that were prioritized in the initial legislation that launched the planning process for our office, and then were prioritized during that planning process, there are a number of questions that centered on whether people are earning living wages. And so, even though the Employment Development Department is not sending us a large number of data points, the ones they are sending us that help us get information on living wages, via the unemployment insurance data that shows employment information and wages, will provide an anchor for a lot of really important questions. What’s really powerful is not only this K-12 and higher education data, but also the other kinds of data, like earnings data that gives us a crucial part of understanding holistically how California’s people are.
Editor’s note: Find the May 2022 news release from Gov. Newsom’s office detailing signed data agreements here.
IICA: When we last spoke in 2022, the system was in early stages. What stage is it in now and what sort of work is taking place?
Bates: I would say that we’re still in the early stages of building the data system. We just received our very first data submissions, but it’s a very important milestone. And now that we have the data from our partners, the very next step, the work we’re in the middle of right now, is preparing to link the data that had previously been siloed so that we can paint a larger picture and see the larger trends. That data linkage is up next. And doing a lot of work with our data providers on ensuring that we understand all the nuances of the data that they’ve shared with us. So that we can helpfully and accurately share the information in an integrated way. Data integration is the first next step. The second next step is folding in the employment data from the Employment Development Department at the beginning of 2024. And then the next milestones that the public will be able to kind of feel and see will be the early beta versions of dashboards that emerge from this data that we’re bringing together. Early dashboards, we’re hoping to be able to release next year. And we have a lot of work planned to make sure we’re listening to communities and hearing from them what questions they most want to answer; and then how to ensure that these dashboards are maximally useful for all of our many communities here in California. And then after those early dashboards, the next thing on our to-do list is what we’re calling our query builder. This is the choose-your-own-adventure version of accessing the data where people can generate their own tables using our website. Again, aggregate level; we want to emphasize this is about de-identified data where people can see trends for groups of people over time, not individuals. But that will enable people to disaggregate the data. To zoom in and say, what does this look like for my community? What does this look like when we break things down by geography or when we break things down by demographics?
Editor’s note: Find more information on planned dashboards here.
IICA: Are there public-facing resources in the system, such as dashboards, that you’d like to highlight?
Bates: There were a number of dashboards that were mapped out and planned for during the planning process. One of the first dashboards that we’re planning to work on has what we’re calling the pathways diagram. It will essentially illuminate the path that students take from K-12 to higher education to entering the workforce and what that looks like across time. So, a flow diagram. And then also having that illuminate which pathways lead to living, which jobs, employment and opportunity. That’s one example. There was also a dashboard that was planned during the planning process around teacher training. There’s a lot of interest in understanding how teacher training programs are preparing teachers for their work in schools. That’s another one that we’re working on in the early stages here. Those are kind of the first planned dashboards, and there are a number of others on questions like transferring from community colleges to four-year institutions and those kinds of questions, that we hope to build over time. We’ll start with one at a time and then build each of these dashboards into data stories that people can engage with in different ways. One thing I want to highlight that’s separate from this analytical data system that we’ve been talking about today are the resources available to students and parents at CaliforniaColleges.edu. We partner with CaliforniaColleges.edu. It’s a separate effort from the data side. It’s not connected to this analytical data, but it’s the state’s one-stop shop for college and career planning and enables students to navigate applying for financial aid, applying for college, especially, navigating some of the more complicated parts of applying to four-year public universities in California. CaliforniaColleges.edu is live, they’re already working with high schools, serving about just over half of the high school students at public high schools in the state.
Editor’s note: Find CaliforniaColleges.edu here; find the System’s college and career planning tool here.
IICA: From an IT perspective, what has been the most challenging aspect of the work so far?
Bates: I think there are many answers to that question. I would start by saying that aligning our work with the work of so many different data partners to accept and manage the data with different unique needs of the different data providers; and aligning all of that work at once to be able to accept their data in the same month is a challenge. And I mention that because there is a technical answer to your question, right? We could talk about our tech stack and some of the really exciting things we’re doing from the IT side. But I think what I want to underscore here is the importance of doing every step of the work collaboratively with our data partners, collaboratively with the public who’s holding us accountable for delivering what we promised. So, starting with the pre-planning of defining the purpose and scope of the system. Defining the actual specifics of our security and privacy protocols and refining those in collaboration with the data providers. All of that speaks to essentially the continued importance of the governance structures that oversee this work. And the importance of the continued efforts on trust-building with our many data partners here within the state and with the many communities and individuals and the public we serve.
IICA: As a result of the work to date, what is one or more takeaways that you would offer to other entities that might be contemplating standing up a data system of their own?
Bates: I think my most important takeaway for others who are contemplating expanding this up is that these efforts at data sharing typically don’t face the road bumps — or, things that cause them to fail often have less to do with IT challenges and more to do with governance, trust-building. And they need to have alignment on the purpose of why the data is being brought together. What are the use cases? Who will be able to use the information for what purposes? Working together on those important questions is such an important foundation. And if that foundation is strong, then we can evolve over time and continue to solve the technological challenges.
IICA: Considering the uniqueness of some of these data streams, are teams continually mindful of the need to handle them with care?
Bates: Absolutely. Being very careful with the information that we’ve been trusted with is top of mind for our entire team. I’m also very grateful that an almost two-year planning process preceded the launch of our office. And so, the list of data points that were submitted to us were not drawn up in a hurry. Those were the results of a very thoughtful and deliberative process that started by saying, “Here are the key questions we want to answer. What data points can feasibly be shared that would enable people to answer those questions?” And excluding data points where it didn’t make sense to include them. Yes, we received a lot of data; and also not everything. And those choice points around what data are appropriate for a kind of data system like this? How can we ensure that the products that we put out protect the individual-level privacy and enable people to see the big picture, see the big picture trends across groups of individuals while protecting individual privacy, is incredibly important.
IICA: Anything you’d care to add on privacy and security?
Bates: I would emphasize again that Cradle-to-Career is not collecting any new data. We are just bringing together data that’s already been collected and validated at the state level, to ensure that that information can be maximally useful to the people here in California. A couple of pilots. We’re building this data system subject to all of the relevant state and federal laws concerning data privacy and security. And then as I highlighted as well, the foundation for those dashboards, the foundation for the query builder is a de-identified data set where the names, identifying information, all that has been stripped out. It’s a de-identified data set. And there was a lot of thoughtful work that began in the planning process already, to ensure we have detailed plans for the right data suppression and privacy-protecting protocols that are built into how we’re building the system. That was the conversation, well even before the launch of our office, and it has continued. We have a standing task force specifically on that topic that includes all of our data partners, to ensure that they’re collaboratively working with us to ensure that the data that they steward and that we steward with them is protecting people’s privacy effectively.
Editor’s note: Find the System’s five-year work plan here. This interview has been lightly edited for style and brevity.