Category Archives: Insights

Welcome to the New Database Era

The new category of cloud database services emerging

One of the most profound and maybe non-obvious shifts driving this is the emergence of the cloud database. Services such as Amazon S3, Google BigQuery, Snowflake, Databricks, and others, have solved computing on large volumes of data, and have made it easy to store data from every available source. Enterprise wants to store everything they can in the hopes of being able to deliver improved customer experiences and new market capabilities.

It has been a good time to be a database company.

Database companies have raised over $8.7B over the last 10 years, with almost half of that, $4.1B, just in the last 24 months, up from $849M in 2019 (according to CB Insights).

It’s not surprising with the sky-high valuations of Snowflake and Databricks and $16B in new revenue up for grabs in 2021, simply from market growth. A market that doubled in the last four years to almost $90B, is expected to double again over the next four. Safe to say there is a huge opportunity to go after.

See here for a solid list of database financings in 2021.

20 years ago, you had one option, a relational database.

Today, thanks to the cloud, microservices, distributed applications, global scale, real-time data, deep learning, etc., new database architectures emerged to hyper-solve new performance requirements. Different systems for fast reads, and fast writes. Systems specifically to power ad-hoc analytics, or for data that is unstructured, semi-structured, transactional, relational, graph, or time-series. Also for data used for cache, search, based on indexes, events, and more.

Each came with different performance needs, including high availability, horizontal scale, distributed consistency, failover protection, partition tolerance, serverless, and fully managed.

As a result, enterprises on average store data across seven or more different databases (i.e., Snowflake as your data warehouse, Clickhouse for ad-hoc analytics, Timescale for time series data, Elastic for their search data, S3 for logs, Postgres for transactions, Redis for caching or application data, Cassandra for complex workloads, and Dgraph for relationship data or dynamic schemas). That’s all assuming you are collocated to a single cloud, and that you’ve built a modern data stack from scratch.

The level of performance and guarantees from these services and platforms are unparalleled, compared to 5–10 years ago. At the same time, the proliferation and fragmentation of the database layer are increasingly creating new challenges. For example, syncing across the different schemas and systems, writing new ETL jobs to bridge workloads across multiple databases, constant cross-talk and connectivity issues, the overhead of managing active-active clustering across so many different systems, or data transfers when new clusters or systems come online. Each with different scaling, branching, propagation, sharding, and resource requirements.

What’s more, new databases emerge monthly to solve the next challenge of enterprise scale.

The New Age Database

So the question is, will the future of the database continue to be defined as what a database is today?

I’d make the case that they shouldn’t.

Instead, I hope the next generation of databases will look very different from the last. They should have the following capabilities:

  • Primarily compute, query, and/or be infrastructure engines that can sit on top of commodity storage layers.
  • No migration or restructuring of the underlying data is required.
  • No re-writing or parsing of queries is needed.
  • Work on top of multiple storage engines, whether columnar, non-relational, or graph.
  • Moves the complexity of configuration, availability, and scale into code.
  • Allows applications to call into a single interface, regardless of the underlying data infrastructure.
  • Works out of the box as a serverless or managed service.
  • Be built for developer-first experiences, in both single-player and multiplayer modes.
  • Deliver day 0 value for both existing (brownfield) and new (greenfield) projects

There are many secular trends driving this future:

1. No one wants to migrate to a new database. The cost of every new database introduced into an organization is an N² problem to the number of databases you already have. Migrating to a new architecture, schema, configuration, and needing to re-optimize for rebalancing, query planning, scaling, resource requirements, and more is often a [value/(time+cost)] of close to zero. It may come as a surprise, but there are still billions of dollars in Oracle instances still powering critical apps today, and they likely aren’t going anywhere.

2. Majority of the killer features won’t be in the storage layer. Separating compute and storage has increasingly enabled new levels of performance, allowing for super cheap raw storage costs, and finely-tuned, elastically scaled compute/query/infra layers. The storage layer can be at the center of data infrastructure and leveraged in various different ways, by multiple tools, to solve routing, parsing, availability, scale, translation, and more.

3. The database is slowly unbundling into highly specialized services, moving away from the overly-complex, locked-in approaches of the past. No single database can solve transactional and analytical use cases fully; with fast reads and writes, with high availability and consistency;all while solving caching at the edge, and horizontally scaling as needed. But unbundling into a set of layers sitting on top of the storage engine can introduce a set of new services to deliver new levels of performance and guarantees. For example, a dynamic caching service that can optimize caches based on user, query, and data awareness; managing sharding based on data distribution query demand and data change rates; a proxy layer to enable high availability and horizontal scale, with connection pooling and resource management; a data management framework to solve async and sync propagation between schemas; or translation layers between GraphQL and relational databases. These multi-dimensional problems can be built as programmatic solutions, in code, decoupled from the database itself, and perform significantly better.

4. Scale and simplicity have been trade-offs up until now. Postgres, MySQL, and Cassandra are very powerful, but difficult to get right. Firebase and Heroku are super easy to use but don’t scale. These database technologies have massive install bases, and robust engines, and withstood the test of time at Facebook and Netflix-level scales. But tuning them for your needs often requires a Ph.D. and a team of database experts, as teams at Facebook, Netflix, Uber, Airbnb all have. For the rest of us, we struggle with consistency and isolation, sharding, locking, clock skews, query planning, security, networking, etc. What companies like Supabase and Hydras are doing in leveraging standard Postgres installs but building powerful compute and management layers on top, allow for the power of Postgres, but with the simplicity of Firebase or Heroku.

5. The database index model hasn’t changed in 30+ years. Today we rely on general-purpose, one size fits all indexes such as B-trees and Hash-maps, taking a black-box view of our data. Being more data-aware, such as leveraging a cumulative distribution function (CDF) as we’ve seen with Learned Indexes, can lead to smaller indexes, faster lookups, increased parallelism, and reduced CPU usage. We’ve barely even begun to demonstrate next-generation indexes that have adapted both to the shape and changes of our data.

6. There is little-to-no machine learning used to improve database performance. Instead, today we define static rule sets and configurations to optimize query performance, cost modeling, and workload forecasting. These combinatorial, multi-dimensional problem sets are too complex for humans to configure, and are perfect machine learning problems. Resources such as disk, RAM, and CPU are well characterized, query history is well understood, and data distribution can be defined. We could see 10x step-ups in query performance, cost, and resource utilization, and never see another nested loop join again.

7. Data platform and engineering teams don’t want to be DBAs, DevOps, or SREs. They want their systems and services to just work, out of the box, and not have to think about resources, connection pooling, cache logic, vacuuming, query planning, updating indexes, and more. Teams today want a robust set of endpoints that are easy to deploy, and just work.

8. The need for operational real-time data is driving a need for hybrid systems. Transactional systems can write new records into a table rapidly, with a high level of accuracy, speed, and reliability. An analytics system can search across a set of tables and data rapidly to find an answer. With streaming data and need for faster responsiveness in analytical systems, the idea of HTAP (hybrid transaction/analytical processing) systems are emerging — particularly for use cases that are highly operational in nature — meaning a very high level of new writes/records and more responsive telemetry or analytics on business metrics. This introduces a new architectural paradigm, where transactional and analytical data and systems start to reside much closer to each other, but not together.

A New Category of Databases

A new category of cloud database companies is emerging, effectively deconstructing the traditional database monolith stack into core layered services; storage, compute, optimization, query planning, indexing, functions, and more. Companies like ReadySet, Hasura, Xata, Ottertune, Apollo, Polyscale, and others are examples of this movement and quickly becoming the new developer standard.

These new unbundled databases are focused on solving the hard problems of caching, indexes, scale, and availability, and beginning to remove the trade-off between performance and guarantees. Fast databases, always-on, handling mass scale, and data-aware, blurring the lines between the traditional divisions between operational and analytical systems. The future looks bright.


Welcome to the New Database Era was originally published on TechCrunch.

Source: https://techcrunch.com/2022/06/01/this-is-the-beginning-of-the-unbundled-database-era/

Running Through Walls: Culture and Connection

Venrock partner Brian Ascher speaks with Nicole Alvino, co-founder and CEO of Firstup, about effective workforce communication both digitally and in person. Alvino highlights fostering employee connection through authentic and transparent communication, using the power of video to drive engagement, and discusses expectations for corporate social responsibility. Plus, Alvino reveals a non-obvious lesson learned – employees want to hear from their peers (not just their leaders).

  • Subscribe:

Want more? Here’s the latest on Running Through Walls.

Venrock partner Brian Ascher speaks with Nicole Alvino, co-founder and CEO of Firstup, about effective workforce communication both digitally and in person. Alvino highlights fostering employee connection through authentic and transparent communication, using the power of video to drive engagement, and discusses expectations for corporate social responsibility. Plus, Alvino reveals a non-obvious lesson learned – employees want to hear from their peers (not just their leaders).
  1. Culture and Connection
  2. High Will Over High Skill
  3. Course Correcting for Long-Term Success
  4. Be Comfortable Being Uncomfortable
  5. Public Health & Pandemic Premonitions

2022 Healthcare Prognosis

When we published our 2021 survey results, vaccines had yet to be broadly released, Tiger Global was still a growth investor and SPACs were en vogue (at least the raising part…). 2021 respondents overwhelmingly correctly predicted a booster would be available before vaccine protection waned, SPACs would return to being a niche product, and that a return to office would include a hybrid approach. However, the Delta and Omicron variants threw a curveball at us and subsequently forced mask mandates back into effect. Covid continues to be unsettling with BA.2 now taking hold and resulting in Philadelphia reinstating an indoor mask mandate on April 11, 2022.

Three months into 2022, the world seems even more unpredictable. While Covid cases have fallen in most of the world from Omicron peaks, there are worrisome signs in China and a giant reservoir of unvaccinated people which could give rise to more variants. Russian aggression and hostilities understandably are top of mind given the worldwide human, economic and political ramifications. Once again, we asked our network of experts for their opinions on the latest trends in health tech, world issues, and the outlook for Covid-19.

Our commentary on the most interesting findings from this year’s survey, followed by the full results, can be found here. Thank you to the hundreds of healthcare experts who took the time to share their views and opinions with us.

Aligning Incentives for Improving Diagnostic Excellence

This article first appeared in JAMA Network. It is co-authored with Ezekiel J. Emanuel, MD, PhD.

Diagnostic excellence is a priority of both patients and individual clinicians, yet does not seem to be afforded the same attention by health care systems. Autopsy data from 2 Swedish hospitals revealed that 30% of 2410 cases had clinically significant undiagnosed diseases.1 A review of methods used to estimate the rate of diagnostic error suggested that diagnostic errors are more common than medication errors.2 Even for conditions that are considered “easy” to detect, such as hypertension, 1 estimate suggested that 10% of US adults may have high blood pressure that is undiagnosed.3 Failure to make diagnoses expeditiously leads to prolonged uncertainty and may result in costly unnecessary tests and procedures, delayed treatment, and increased risk of morbidity and mortality. If diagnoses are forgone in favor of empirical treatment, patients may receive ineffective or potentially harmful treatments. Given this, a key question is why physicians would take shortcuts on diagnostic workups and rely on guesses and intuition when initiating treatments when it is possible to make accurate diagnoses and deliver evidence-based care.

An important factor that may be contributing to these issues regarding diagnosis is incentives. The US fee-for-service health system pays for tests and treatments, not for diagnostic reasoning or accuracy, and does not make a distinction about whether payment is for the appropriate diagnostic tests or whether treatment selections are based on a correct diagnosis. Virtually all activities related to testing and treatment, including those associated with errors, are reimbursable. Fee-for-service reimbursement rewards empirical treatment; even if the initial treatment is incorrect, the next treatment is reimbursable too. It also rewards nonparsimonious diagnostic workups and consultations. A fee-for-service system does not pay more when decision support tools are used and does not consider timeliness of diagnosis and initiation of efficacious treatment in how it pays. Unlike purchases of other goods and services, there are no refunds for ordering the wrong service or executing it poorly, only additional charges.

Innovation and knowledge translation are slow in health care, and the lack of incentives may impede the uptake and use of potentially beneficial advancements for diagnosis. For the past 20 years, there have been claims that artificial intelligence (AI) will surpass physicians, particularly for diagnosis. Over this time, there have been many attempts to create automated tools to support clinician decision-making, but few have gained widespread adoption. Radiology decision support tools that read mammograms have no discernable effect on radiologist productivity, which makes use of these tools economically irrational.4 A similar pattern has been observed with digital pathology diagnostic tools.5 Several attempts have been made for the integration of diagnostic decision support tools into electronic medical records for emergency physicians and primary care physicians; most of these attempts have been limited by low usage by clinicians because the tools have not improved productivity or quality.6

A major challenge to developing new technologies to improve diagnostic quality is poorly aligned economic incentives. To become a viable business, a company needs to create a diagnostic tool that is highly sensitive and specific and also improves clinical care enough to generate demand at a high enough price to make it attractive to investors. To command a price sufficient to support commercializing, a diagnostic innovation also needs to generate a large and rapid economic return that typically exceeds what a customer is willing to pay.

Tests that predict the chemotherapy responsiveness of various cancers (such as Oncotype DX, MammaPrint, and Prosigna) were created and adopted because they could potentially generate large savings for payers in the costs of chemotherapy. Because clinicians are often not sharing in the economic value created by these innovations, widespread adoption is often costly and slow. This makes it more expensive and slower to commercialize new technologies and ultimately leads to less investment in innovative technologies that could improve the speed and accuracy of diagnoses.

Even technologies that are likely to change or refine diagnoses, like genetic testing for cancer, have not been widely adopted outside academic cancer centers. Genetic testing improves diagnostic precision and prognostication and often should alter treatment decisions. However, in 1 study of patients in California and Georgia, only 25% of the 77 000 predominantly White women diagnosed as having breast cancer and 31% of 6000 predominantly White women diagnosed as having ovarian cancer had undergone genetic testing as part of their diagnostic workups despite guidelines that recommend testing for most patients.7

While there may be promising breakthroughs in diagnostic speed and accuracy through technologies like genetic testing and AI, these tools also may reduce the incentive for clinicians to improve their diagnostic acumen. Just as advances in electronic stethoscopes and mobile echocardiography reduce the need for physicians to hone their auscultation skills, improvement in genetics- and AI-driven decision supports may reduce the incentive for clinicians to maintain some diagnostic skills.8

Creating incentives that improve the quality of diagnosis should be a priority. Most important, rapid and accurate diagnoses should be financially rewarded. One approach to accomplish this objective could involve providing reimbursement for use of clinical decision support, requiring complete diagnostic workups before initiating elective treatments, penalizing nonparsimonious outpatient workups and empirical treatment, and not providing reimbursement for serial trial-and-error approaches in cases for which a definitive diagnosis can be rendered. One policy could be to institute a reimbursement modifier for treatment that is initiated without the diagnostic tests delineated in professional society guidelines.9 For instance, for rheumatological diseases like rheumatoid arthritis, polymyalgia rheumatica, and giant cell arteritis, there might be a reimbursement modifier that leads to higher reimbursement for patients for whom a diagnosis is made before empirical treatment with steroids.

An additional approach for improving the speed and accuracy of diagnoses, advocated by the National Academy of Medicine, would be to track diagnostic success and publicly report performance.10 This approach could work similarly to how readmission rates are reported for surgeons. Clinicians could have their diagnostic performance reported on the dimensions of accuracy and timeliness for the most common diseases they treat. This could be done in primary care for chronic diseases like type 2 diabetes, hypercholesterolemia, and hypertension because these conditions could often be documented but undiagnosed from screening laboratory tests or vital sign data, often are not treated in accordance with clinical guidelines, and often have suboptimal clinical outcomes. For specialists, reporting could be based on completing diagnostic workups efficiently and quickly and instituting guideline consistent care plans, for example, appropriate use of advanced imaging for back pain and guideline-consistent use of interventions like spine surgery, implantable defibrillators, and antibiotics. Similar to measures such as emergency department arrival to balloon time for acute myocardial infarction, measures could be developed to assess timeliness of accurate diagnoses for patients presenting for other situations, like sepsis, substance abuse treatment, HIV infection, and initial cancer diagnosis, for which the speed of initiating the correct treatment positively affects outcomes.

Similarly, it would be informative to report on how well clinicians perform for uncommon diagnoses based on relative complication rates and costs for these patients compared with expert diagnosticians for these diseases. For this to be possible, new investments will be needed develop diagnostic quality measures. While the Centers for Medicare & Medicaid Services has developed more than 1400 quality metrics, most are process-oriented measures. Process integrity is not a good measure of quality if the diagnosis is incorrect. Fortunately, reporting on metrics like these should be feasible if data exchange standards are enforced because electronic medical records and medical claims offer a longitudinal data set including diagnostic tests and results, when diagnoses are made, how diagnoses change over time, treatments recommended and delivered to patients, and clinical outcomes.

For many aspects of health care, results are directly related to what is paid for. While making the correct diagnosis quickly and cost-effectively is a fundamental tenet of medicine, paradoxically, diagnostic accuracy may not matter economically and data on clinician performance is neither collected nor transparent. To succeed at improving diagnostic performance in the US, a most important first step will require a focus on aligning the economic incentives to reward more accurate and timely diagnoses and substantially improving the ability to assess current diagnostic performance and opportunities for improvement.

Key Points for Diagnostic Excellence

  1. Fee-for-service reimbursement does not reward diagnostic excellence.
  2. Misaligned economic incentives undermine the development and adoption of technologies that can improve diagnostic quality.
  3. Innovations that improve diagnostic quality also need to generate economic value to gain adoption.
  4. Changing economic incentives to reward diagnostic excellence along with developing metrics and reporting performance could lead to improvement in diagnostic quality.

Back to top

Article Information

Corresponding Author: Bob Kocher, MD, USC Schaeffer Center, 635 Downey Way, Verna and Peter Dauterive Hall (VPD), Los Angeles, CA 90089 (bkocher@venrock.com).

Published Online: April 11, 2022. doi:10.1001/jama.2022.4594

Conflict of Interest Disclosures: Dr Kocher reported that he is a partner at the venture capital firm Venrock and invests in health care technology and services businesses; is on the boards of Devoted Health, Lyra Health, Aledade, Need Health, Virta Health, Sitka Health, Accompany Health, and Premera Blue Cross; and is a board observer at SmithRx, Public Health Company, Stride Health, and Suki. Neither he nor Venrock have any current investments in clinical diagnostics or clinical decision support businesses. Dr Emanuel reported personal fees, nonfinancial support, or both from companies, organizations, and professional health care meetings and being a venture partner at Oak HC/FT; a partner at Embedded Healthcare LLC, ReCovery Partners LLC, and COVID-19 Recovery Consulting; and an unpaid board member of Village MD and Oncology Analytics. Dr Emanuel owns no stock in pharmaceutical, medical device companies, or health insurers. No other disclosures were reported.

Additional Contributions: We thank Daniel Yang, MD, Karen Cosby, MD, and Harvey Fineberg, MD, PhD, of the Gordon and Betty Moore Foundation, for their feedback and intellectual contributions. No compensation was received.

References

1. Friberg  N, Ljungberg  O, Berglund  E,  et al.  Cause of death and significant disease found at autopsy.   Virchows Arch. 2019;475(6):781-788. doi:10.1007/s00428-019-02672-z

2. Graber  ML.  The incidence of diagnostic error in medicine.   BMJ Qual Saf. 2013;22(suppl 2):ii21-ii27. doi:10.1136/bmjqs-2012-001615

3. Department of Health and Human Services. Million Hearts: undiagnosed hypertension. Accessed January 19, 2022. https://millionhearts.hhs.gov/tools-protocols/undiagnosed-hypertension.html

4. Lee  CI, Khodyakov  D, Weidmer  BA,  et al.  Radiologists’ perceptions of computerized decision support: a focus group study from the Medicare Imaging Demonstration Project.   AJR Am J Roentgenol. 2015;205(5):947-955.

5. Khanduja  A, Tang  C. Beyond digital: understanding barriers in transforming pathology from digital to computational. The Pathologist. Published November 3, 2021. Accessed February 4, 2022. https://thepathologist.com/inside-the-lab/beyond-digital

6. Khairat  S, Marc  D, Crosby  W, Al Sanousi  A.  Reasons for physicians not adopting clinical decision support systems: critical analysis.   JMIR Med Inform. 2018;6(2):e24. doi:10.2196/medinform.8912

7. Kurian  AW, Ward  KC, Howlader  N,  et al.  Genetic testing and results in a population-based cohort of breast cancer patients and ovarian cancer patients.   J Clin Oncol. 2019;37(15):1305-1315. doi:10.1200/JCO.18.01854

8. Montinari  MR, Minelli  S.  The first 200 years of cardiac auscultation and future perspectives.   J Multidiscip Healthc. 2019;12:183-189. doi:10.2147/JMDH.S193904

9. Berenson  R, Singh  H.  Payment innovations to improve diagnostic accuracy and reduce diagnostic error.   Health Aff (Millwood). 2018;37(11):1828-1835. doi:10.1377/hlthaff.2018.0714

10. McGlynn  EA, McDonald  KM, Cassel  CK.  Measurement is essential for improving diagnosis and reducing diagnostic error: a report from the Institute of Medicine.   JAMA. 2015;314(23):2501-2502. doi:10.1001/jama.2015.13453

Source: http://bobkocher.org

Running Through Walls: High Will Over High Skill 

Company success is predicated on the people, according to Daversa Partners founder and CEO, Paul Daversa. Venrock partner Bryan Roberts chats with Daversa about the nuances of recruiting impactful senior-level leadership candidates who are a fit for the needs of the company. They cover hiring high will over high skill, understanding that candidates are stage sensitive and ways to avoid terminal hires. Daversa also discusses the intentionality of Dreamscape, the challenges, and the impact they’ve made thus far.

  • Subscribe:

Want more? Here’s the latest on Running Through Walls.

Venrock partner Brian Ascher speaks with Nicole Alvino, co-founder and CEO of Firstup, about effective workforce communication both digitally and in person. Alvino highlights fostering employee connection through authentic and transparent communication, using the power of video to drive engagement, and discusses expectations for corporate social responsibility. Plus, Alvino reveals a non-obvious lesson learned – employees want to hear from their peers (not just their leaders).
  1. Culture and Connection
  2. High Will Over High Skill
  3. Course Correcting for Long-Term Success
  4. Be Comfortable Being Uncomfortable
  5. Public Health & Pandemic Premonitions

Orchestrating the Modern Data Platform, Venrock’s Investment into Astronomer

The backstory of our 2020 Series A lead investment into Astronomer.

— originally posted at ethanjb.com —

Update since we invested (March 2022)

So much has happened since we led the Series A for Astronomer. Here is a brief snippet:

The backstory of our 2020 Series A lead investment into Astronomer.

Ten years ago, the modern enterprise stack looked quite a bit different. Teams of network admins managing data centers, running one application per server, deploying monolithic services, through waterfall processes, with staged releases managed by an entire role labeled “release manager.”

Today, we have multi and hybrid clouds, serverless services, continuous integration, and deployment, running infrastructure-as-code, with DevOps and SRE fighting to keep up with the rapid scale.

Companies are building more dynamic, multi-platform, complex infrastructures than ever. We see the ‘-aaS’ of the application, data, runtime, and virtualization layers. Modern architectures are forcing extensibility to work with any number of mixed and matched services, and fully managed services, effectively leaving operations and scale to the service providers, are becoming the golden standard.

With limited engineering budgets and resource constraints, CTOs and VP Engs are increasingly looking for ways to free up their teams. They’re moving from manual, time-consuming, repetitive work to programmatic workflows, where infrastructure and services are written as code and abstracted into manageable operators such as SQL or DAGs, owned by developers.

If the 2010s represented a renaissance for what we can build and deliver, the 2020s have begun to represent a shift to how we build and deliver, with a focused intensity on infrastructure, data, and operational productivity.

Every company is now becoming a data company.

The last 12 months, in particular, have been a technology tipping point for businesses in the wake of remote work. Every CIO and CTO has been burdened with the increasing need to leverage data for faster decision making, increased pressures of moving workloads to the cloud, and the realizations of technology investments as a competitive advantage in a digital world. The digital transformation we’ve seen play out over the last few years just compressed the next five years of progress into one.

With that, the data infrastructure has become a focal point in unlocking new velocity and scale. Data engineering teams are now the fastest-growing budget within engineering, and in many organizations, and fastest-growing budget, period.

Every company is now becoming a data company, with the data infrastructure at heart.

Building on a brittle data infrastructure

To best understand an enterprise data infrastructure, any company that relies on multiple data sources to power their business or make critical business decisions needs to rely on some form of data infrastructure and pipelines. These are systems and sequences of tasks that move data from one system to another, transform, process, and store the data for use. Metrics aggregations, instrumentation, experimentation, derived data generation, analytics, machine learning feature computation, business reporting, dashboards, etc, all require the automation of data processes and tasks to process and compute data into required formats.

Traditional approaches to building data infrastructures leveraged heavy ETL (extract, transform, load) tools (Informatica, SAS, Microsoft, Oracle, Talend) built on relational databases that require difficult, time-consuming, and labor-intensive rules and config files. With the introductions of modern application stacks and the massive increase in data and processing required, traditional ETL systems have become overly brittle and slow, unable to keep up with the need for increased scale, modularity, and agility.

The slightly more modern approach to scaling up data pipelines relied on batch processing of jobs through the use of static scripts or jobs that define schedulers to kick off specific tasks, i.e. pulling data from a particular source > running a computation > aggregating with another source > populating the updated aggregations to a data warehouse.

Static by design, data engineers must define relationships between steps in a job, pre-define the expected durations based on the worst-case scenario, and define the run-time schedules, hoping the pipelines run as expected.

But static scripts become more brittle as the scale of dependencies increases. If job stalls or errors for any reason, i.e. a server goes down, an exception is found in a data set, a job takes longer than the expected duration, or a dependent task is stalled, both the data engineering and devops teams have to spend countless hours manually identifying, triaging, and restarting jobs.

There are no systems to programmatically retry, queue, prevent overlapping jobs, enforce timeouts, report errors and metrics in a machine-readable way. Further exacerbating the problem, walking into the office every morning, the first question a data engineer asks is “did all of my jobs run?”, and there is no single source of truth to answer their question.

As the jobs and pipelines grow in number and complexity, data engineering and devops spend most of their time manually monitoring, triaging, and reconfiguring their pipeline to keep data flowing to support the business, resulting in more energy spent on the underlying platforms vs actually running the data pipelines.

Meet Apache Airflow

In 2015, a project at Airbnb, aptly named Airflow, focused on solving the brittle data pipeline problem, replacing cron jobs and legacy ETL systems with a workflow orchestration based infrastructure-as-code, allowing users to programmatically author, schedule, and monitor data pipelines. Believing that when workflows are defined as code, they become more maintainable, versionable, testable, collaborative, and performant.

Airflow today is the largest and most popular data infrastructure open source project in the world. Adopted by the Apache foundation in 2016, promoted to a top-level project in 2019, and as of this writing, has over 25K Github stars, 1900 contributors, over 8M downloads per month, and is one of the most energetic and passionate open source communities we have ever seen. Just surpassing Apache Spark in contributors, and Apache Kafka in stars and contributors. Airflow is used by thousands of organizations and hundreds of thousands of data engineers.

Airflow has been reinforced by the community as the standard and leader in data pipeline orchestration. With a mass migration off legacy ETL systems, and the increasing popularity of the Airflow project, companies from F500 to the most emerging brands of all sizes and segments are migrating to Airflow and replacing the need for data engineering and devops teams to spend most of their time managing and maintaining their data pipelines instead of building and scaling up new data pipelines. Handing back the control of pipelines to data engineering from DevOps, removing hours of debugging for failed or slow jobs, expanding the capability of the teams to perform more complex and performant pipelines, and run more jobs, faster, creating real business value.

And with the recent release of Airflow 2.0, the team introduced capabilities that go beyond workflow orchestration, to job and task execution, replacing the many, many tasks that teams today overly rely on heavy ETL processes vs moving them into Airflow, and the start of powering mission-critical operational use cases in addition to the core analytics needs.

Meet Astronomer, the modern orchestration platform

Astronomer unifies your distributed data ecosystem through a modern orchestration platform

In most organizations using Airflow today, it has become one of the most important pieces of infrastructure and changes how data engineering teams operate.

But like most popular open-source solutions, Airflow was designed by the community, for the community. Lacking the necessary enterprise capabilities as it proliferates across an organization such as cloud-native integrations, flexibility to deploy across the various environment and infrastructure setups, high availability, and uptime, performance monitoring, rights access, and security. Teams have to rely on homegrown solutions to solve these challenges as they scale up their installations to enable airflow-as-a-service to internal customers.

That’s until we met Joe and Ry.

Joe, just finishing his tour of duty as CEO at Alpine Data, previously SVP Sales at Greenplum/Pivotal, and Ry, a top contributor to Airflow, saw an early but promising open-source project, with little roadmap and project direction, that could become the central data plane for the data infrastructure. They dedicated themselves to reinvigorating the project, bringing life and focus to its original intentions, realizing that data orchestration and workflow weren’t a part of the data pipeline or infrastructure but was the core of it. They were convinced that Airflow needed to be brought to the enterprises, and there Astronomer was born.

They saw what Airflow could enable for organizations of all sizes; an enterprise-grade solution that was cloud-native, secure, and easy to deploy across any infrastructure or environment, cloud or customers own. Solving the key immediate challenges enterprises face when deploying Airflow and needing high availability, testing and robustness, and extensibility into any infrastructure setup.

Taking it one step further, deploying Astro Runtime, the managed airflow service, engineered for the cloud, as one integrated, manged platform. Giving enterprises complete visibility into their data universe, including lineage and metadata, through one single pane of glass. Start in less than an hour, scale to millions of tasks.

The result has been astro*nomical. Customers of all sizes and walks, looking to Astronomer to solve the challenges when trying to scale around availability, robustness, extensibility into their infrastructure, security, and support.

Today their customer base span every industry, segment, and vertical, from the top F500s to the most emerging brands, drawn to the power of the Airflow project and community, and confidence in the enterprise Astronomer platform.

Leading the Series A investment into Astronomer

We’ve long believed the data pipeline was central to the entire enterprise data stack; that workflow and orchestration was the meta-layer responsible for speed, resiliency, and capability of the data infrastructure, with the underlying primitives (connectors, storage, transformations, etc) easily replaced or augmented. As we see the needs of the enterprise increasingly shift from analytical use cases to power the business-critical operational needs, the orchestration layer will prove to be greater than the parts it executes.

As a result, we were fortunate to have been able to lead Astronomer’s Series A in April 2020, joining the board along with our friends at Sierra Ventures. Since then we’ve had the good fortune of being able to welcome our good friends Scott Yara and Sutter Hill who led their Series B, and Insight Partners, Salesforce Ventures, and Meritech with our Series C.

Exciting announcements are always sweeter with one more thing, so we also welcome Laurent, Julien, and the rest of the Datakin team to Astronomer!

This is just the beginning, and we can’t want to share what’s next.


Investing in Astronomer was originally published on ethanjb.com.

Source: https://www.ethanjb.com/

Unlocking the Modern Data Infrastructure, Venrock’s investment into Decodable

Announcing our Seed and Series A investments into Decodable

Beginnings of the real-time data enterprise

Data is rapidly transforming every industry. Brick & mortar stores are now turning into micro-distribution centers through real-time inventory management. Organizations are delivering 10x customer experience building up-to-the-second 360 customer views. Complex logistics services having real-time views into operational performance are delivering more on-time than ever before. Data is allowing enterprises to respond faster to changes in their business unlike ever before, and the volume of that data is increasing exponentially.

Just having the data stored in your data warehouse is no longer enough.

Real-time data isn’t for the faint of heart

Today, real-time has become the central theme in every data strategy.

Whether you are trying to understand user behavior, updating inventory systems as purchases occur across multiple stores, log monitoring to prevent an outage, or connecting on-prem services to the cloud, the majority of use cases can be enabled by filtering, restructuring, parsing, aggregating, or enriching data records. Then coding these transformations into pipelines to drive microservices, machine learning models, operational workflows, or populate datasets.

But building the underlying real-time data infrastructure to power them has always been another story.

What originally seemed like a rather straightforward deployment of a few Kafka APIs quickly became an unwieldy hardcore distributed systems problem.

This meant that instead of focusing on building new data pipelines and creating business value, data engineers toil in the time-sink of low-level Java and C++; agonizing over frameworks, data serialization, messaging guarantees, distributed state checkpoint algorithms, pipeline recovery semantics, data backpressure, and schema change management. They’re on the hook for managing petabytes of data daily, at unpredictable volumes and flow rates, across a variety of data types, requiring low-level optimizations to handle the compute demands, with ultra-low processing latencies. Systems must run continuously, without any downtime, flexing up/down with volume, without a quiver of latency.

Many Ph.D. theses have been written on this topic, with many more to come.

The underlying infrastructure needed to disappear

We’ve long believed that the only way to unlock real-time data was the underlying infrastructure needed to disappear. Application developers, data engineers, and data scientists should be able to build and deploy a production pipeline in minutes, using industry-standard SQL, without worrying about distributed system theory, message guarantees, or proprietary formats. It had to just work.

Introducing Decodable

Peeking under the hood of the Decodable platform

We knew we’d found it when we met Eric Sammer and Decodable.

We first met Eric while he was the VP & Distinguished Engineer at Splunk, overseeing the development of their real-time stream processing and infrastructure platform. He joined Splunk through the acquisition of Rocana, as the co-founder and CTO, which was building the ‘real-time data version of Splunk’. Eric was an early employee at Cloudera and even wrote the O’Reilly book on Hadoop Operations (!). A long way of saying Eric was one of the most sought out and respected thought leaders in real-time, distributed systems.

His realization ran deep. Enterprises would be able to deploy real-time data systems at scale only after:

  1. The underlying, low-level infrastructure effectively disappeared, along with the challenges of managing availability, reliability, and performance
  2. The real-world complexities of real-time data such as coordinating schema changes, testing pipelines against real data, performing safe deployments, or just knowing their pipelines are alive and healthy, were solved under the hood
  3. It just worked out of the box by writing SQL

Eric started Decodable with the clarity that developers and teams want to focus on building new real-time applications and capabilities, free from the heavy lifting needed to build and manage infrastructure.. ​​

Abstracted away from infrastructure, Decodable’s developer experience needed to be simple and fast — create connections to data sources and sinks, stream data from and to source(s) and sink, and write SQL for the transformation. Decodable uses existing tools and processes, within your existing data platform, across clouds and data infrastructure.

Powered by underlying infrastructure that is invisible and fully managed; Decodable has no nodes, clusters, or services to manage. It can run in the same cloud providers and regions as your existing infrastructure, leaving the platform engineering, DevOps, and SRE to Decodable.

It was simple, easy to build and deploy, within minutes not days or weeks, with no proprietary formats, and with just SQL.

No more low-level code and stitching together complex systems. Build and deploy pipelines in minutes with SQL.

Partnering with Eric & Decodable

Enterprises need to build differentiation to unlock the value of their data, not in the underlying infrastructure to power it. However, that infrastructure is crucial and needs to be powered by a platform that just works as if you had a 25+ data engineering team dedicated to its uptime and performance. It needs to abstract the complexity and allow developers to create new streams in minutes with just SQL.

Eric is building the future of the data infrastructure, and today the platform has moved into general availability!

So we’re thrilled to partner with Eric and announce our Seed and Series A investments into Decodable, co-led with our good friends at Bain Capital. We couldn’t be more excited to support Eric to unlock the power of real-time data.


Unlocking the Modern Data Infrastructure, Venrock’s investment into Decodable was originally published on Medium.

Source: https://ethanjb.medium.com/

Running Through Walls: Course Correcting for Long-Term Success

Venrock partner Andrew Gottesdiener speaks with Jigar Raythatha, former CEO of Constellation Pharmaceuticals, about lessons learned from his time at Constellation. As Constellation has evolved from an early-stage company to being acquired by MorphoSys, Raythatha dives into its many life-cycles, his heavy involvement in the recruitment process, and the decision to conduct a PIPE financing. Raythatha also shares the characteristics to look for in an investor syndicate and the challenges of scaling a public company.

  • Subscribe:

Want more? Here’s the latest on Running Through Walls.

Venrock partner Brian Ascher speaks with Nicole Alvino, co-founder and CEO of Firstup, about effective workforce communication both digitally and in person. Alvino highlights fostering employee connection through authentic and transparent communication, using the power of video to drive engagement, and discusses expectations for corporate social responsibility. Plus, Alvino reveals a non-obvious lesson learned – employees want to hear from their peers (not just their leaders).
  1. Culture and Connection
  2. High Will Over High Skill
  3. Course Correcting for Long-Term Success
  4. Be Comfortable Being Uncomfortable
  5. Public Health & Pandemic Premonitions

Investment Follow-Up: Announcing Atom Computing’s Series B

Congratulations to Ben, Jonathan, Rob, and the team at Atom Computing on the close of their $60M Series B, and welcome to our new friends at Third Point Ventures and Prime Movers Labs to the team!

Since announcing Atom Computing’s seed round in 2018, we thought it would be fun to follow up on the original thesis to our investment compared to where we are today.

Atom Computing’s 100-qubit quantum computer, Phoenix

We’ve long been believers in the near-term reality of quantum computing. The promise of quantum computers will unlock a new dimension of computing capabilities that are not possible today with classical computers, regardless of the size, compute power, or parallelization.

We wrote in 2020 about the progress to setting the stage for a commercial era of quantum computers and drivers accelerating the age of commercial-ready quantum computing.

From risk analysis, monte carlo simulations, determining chemical ground states, to dynamic simulations, FeMoCo, image/pattern recognition, and more, fundamental questions with economic impacts in the tens to hundreds of billions of dollars will be unlocked thanks to quantum computing.

Our original hypothesis into QC came from the belief that regardless of Moore’s law, classical computers are fundamentally limited by a sequential processing architecture. Even with breakthroughs in SoC’s, 3D integrated memory, optical interconnects, ferroelectrics, ASIC’s, beyond CMOS, and more, sequential processing means ‘time’ is the uncontrollable constraint.

For complex, polynomial, or combinatorial problem sets, with answers that are probabilistic in nature, sequential processing is just not a feasible approach due to the sheer amount of time it would take to process. Where quantum computing, even with a few hundreds of qubits, can begin to offer transformational capabilities.

But the challenges in quantum computing had traditionally stemmed from lack of individual qubit control, sensitivity to environmental noise, limited coherence times, limited total qubit volume, overbearing physical resource requirements, and limited error correction. All required precursors to bringing quantum computing into a real-world domain.

This understanding helped us develop a clear thesis to what a scalable architecture would need to look like when evaluating potential investments (re-pasted from our original seed post):

  1. An architectural approach that could scale to a large number of individually controlled and maintained qubits.
  2. Could demonstrate long coherence times in order to maintain and feedback on the quantum entangled state for long enough to complete calculations.
  3. Designed to have limited sensitivities to environmental noise and thus simplify the problem of maintaining quantum states.
  4. Could scale up the number of qubits without needing to scale up the physical resources (i.e. cold lines, physical wires, isolation methods) required to control every incremental qubit.
  5. Could scale-up qubits in both proximity and volume in a single architecture to eventually support a million qubits.
  6. Could scale-up without requiring new manufacturing techniques, material fabrication, or yet to be invented technologies.
  7. The system could be error corrected at scale with a path to sufficient fault tolerance in order to sustain logical qubits to compute and validate the outputs.

We originally led Atom Computing’s seed in 2018 based on the belief that we were backing the world-class team, and that the architectural approach on neutral atoms would be the right building block for building a scalable system.

Three Years Later…

Atom Computing is the first to build nuclear spin qubits out of optically-trapped atoms. Demonstrating the fastest time to 100 qubits ever, in under 2 years since founding, still a high watermark for the QC industry in qubits and time.

They demonstrated the longest ever coherence time for a quantum computer with multiple qubits, at over 40s of qubit coherence time, compared to the next best only in milliseconds and microseconds. Cementing that neutral atoms produce the highest quality qubits; with higher fidelity, longer coherence time, and ability to execute independent gates in parallel compared to any other approach.

Long coherence is a critical precursor to error correction and eventual fault tolerance.

They demonstrated scaling up to 100 qubits wirelessly controlled in a free space of less than 40μm. No physical lines or resources for each qubit, no dilution chambers, or isolation requirements. In the coming years they’ll show over 100,000 qubits in the same space as a single qubit in a superconducting approach.

And they proved they could recruit a world-class team of executives to bring the Atom Computing platform to the market, as Rob Hays joined as our CEO from Lenovo & Intel, Denise Ruffner joined as our Chief Business Officer from IonQ and IBM Quantum, and Justin Ging joined as our Chief Product Officer from Honeywell Quantum.

The heart of Phoenix, where qubits are made

They proved they could build a machine with a large number of individually controlled and maintained atoms. With the longest coherence time on record. Without the need for complex isolation chambers or resources. Could scale up quickly and densely. And required no new manufacturing, materials, or physics.

As the team begins to bring forward the Atom Computing platform with our second-generation system to run breakthrough commercial use-cases, the next three years will be even more exciting*.

We couldn’t be more proud of the team and what they’ve accomplished. The last three years have demonstrated a lot of first in the quantum computing industry. Can’t wait to share what happens in the next 3 years.


Investment Follow-Up: Announcing Atom Computing’s Series B was originally published on Medium.

Source: https://ethanjb.medium.com/

Running Through Walls: Be Comfortable Being Uncomfortable

Venrock partner Racquel Bracken speaks with Markus Renschler, President and CEO of Cyteir Therapeutics, to discuss the journey that led him to Cyteir, and the company’s path to going public. Renschler dives into an important lesson his parents instilled in him from a young age, “be comfortable with being uncomfortable,” and explains how this lesson has played an important role in his career journey. As a practicing hematologist oncologist, Renschler highlights the value of people skills and how developing those skills can positively influence leadership style, recruiting, and even fundraising. Renschler shares some of his highs and lows along the drug development process, provides advice for young clinicians entering the industry, and how his experience as a practicing physician influences the decisions he makes as a CEO. 

  • Subscribe:

Want more? Here’s the latest on Running Through Walls.

Venrock partner Brian Ascher speaks with Nicole Alvino, co-founder and CEO of Firstup, about effective workforce communication both digitally and in person. Alvino highlights fostering employee connection through authentic and transparent communication, using the power of video to drive engagement, and discusses expectations for corporate social responsibility. Plus, Alvino reveals a non-obvious lesson learned – employees want to hear from their peers (not just their leaders).
  1. Culture and Connection
  2. High Will Over High Skill
  3. Course Correcting for Long-Term Success
  4. Be Comfortable Being Uncomfortable
  5. Public Health & Pandemic Premonitions