9 Experts Answer Your Top Data Science & Machine Learning Questions
Recently, I had the honor of speaking with a number of the world’s most influential thought-leaders in the fields of data science, data analytics, machine learning and digital transformation. This group of prominent data technologists was more than happy to answer a wide variety of question on topics ranging from the fast-evolving area of unified governance and preparing for General Data Protection Regulations (GDPR) to transformative hybrid data management technologies, and of course, data science and machine learning.
Join this incredible group of experts at Fast Track You Data – Live from Munich on Thursday, June 22nd, 2017. Attend this unique global event in person or online -- all for free! You’ll learn more about all of the topics discussed here during the main event, breakout sessions and demos, plus have the opportunity to join in the conversation and connect with these renowned data pioneers who can help you understand how to build a data-driven strategy to outsmart your competition.
There will be an additional opportunity to chat with a number of these thought-leaders, as well as fellow data enthusiasts, at the Fast Track Your Data CrowdChat on Tuesday, June 20th, 2017 at 1:00 PM (EDT).
Now let’s meet our panel of experts:
Follow him on Twitter @craigbrownphd
Christopher S. Penn, VP Marketing Technology, SHIFT Communications and authority on digital marketing, marketing technology, thought-leader, speaker, and author. His latest book is Leading Innovation: Building a Scalable, Innovative Organization.
Follow him on Twitter @cspenn
Dez Blanchfield, Chief Data Scientist at GaraGuru and strategic leader in business & digital transformation with with 25 years of experience in the IT industry developing strategy and implementing business initiatives.
Follow him on Twitter @dez_blanchfield
Dion Hinchcliffe, VP and Principle Analyst at Constellation Research and Commentator at ZDNet, as well as a world renowned Digital Thought Leader, CXO Advisor, Professional Speaker, and Author specializing in digital transformation.
Follow him on Twitter @dhinchcliffe
Jennifer Shin, Founder of 8 Path Solutions, a data science, analytics, and technology company. She is also on the faculty of the Data Science Graduate Program at UC Berkeley, the Data Analytics MS Advisory Board at CUNY SPS, and the Data Science Committee for the Grace Hopper Conference.
Follow her at @8pathsolutions
Follow him @joe_caserta
Lillian Pierson, P.E., Data Science Trainer & Coach at Data-Mania, as well as Technical Consultant, Engineer, LinkedIn Learning Instructor and Author of 3 technical books by Wiley & Sons Publishers: Data Science for Dummies, Big Data / Hadoop for Dummies (Dell Edition), and Managing Big Data Workflows for Dummies (BMC Edition).
Follow her at @BigDataGal
Ronald Van Loon, Director Adversitement, where he is helping data-driven companies generate business value as a globally recognized Top 10 Big Data, Data Science, IoT, and BI Influencer.
Follow him at @ronald_vanloon
Follow him at @sardire
Following is a transcript of the the interview conducted with our expert data science panel (it has been edited for clarity and brevity):
Aylee Nielsen: Thank you all so much for joining me today, I’d like to start off with questions on a subject matter that I know you are all very familiar with – Data Science and Machine Learning. Could any of you share with us how you believe interactive, collaborative, cloud-based environments, like DSX for example, are transforming the field of data science?
Jennifer Shin: Data science teams can get up and running faster than ever before by leveraging cloud-based platforms, which eliminates the need to set up servers, configure settings, and deploy tools that are required for collaborating in real-time.
Aylee Nielsen: So speed to action has clearly become the most significant competitive advantage. Now can someone describe the key components of a successful data science platform and how organizations can best take advantage of said platform?
Lillian Pierson: In general, data science platforms act as centralized integration pieces, where the work of data scientists, data engineers, and application developers can be streamlined and shared. Successful data science platforms should have lots of hooks that allow companies to plug-in and play, combining the specific suite of data technologies that meets the organizations need. The collaborative nature of these platforms aims to reduce workflow redundancies, so that staff can build on top of each other’s work rather than each recreating the wheel from scratch for each project. An organization that wants to bring their data science onto such a system should start first by forming a strategic plan. Before going into implementation of any sort, the first step is always to narrow down what data, systems, technologies, security protocols, and staff should be put in place.
Aylee Nielsen: Yes, collaboration and flexibility are key, and I think it's important that we also mention scalability. So as an organization grows, what are the most important factors in enabling data science teams to collaborate successfully?
Jennifer Shin: Organization, communication, and processes are important for any collaboration, especially among data science teams. Without these factors in place, data scientists run the risk of recreating duplicate work, missing deadlines, and writing unnecessary code.
Aylee Nielsen: Those are great points Jennifer, and they represent challenges that I think we’re all too familiar with whether we’re data scientist or not, and ideas we all need to abide by to be successful in and outside of the workplace. So then what specifically are the biggest challenges organizations are facing in operationalizing data science?
Jennifer Shin: Operationalizing data science is a challenge for any organization that lacks the infrastructure or support to processing data efficiently. While larger organizations struggle to overcome internal inefficiencies that can delay deployment, smaller organizations need to be cognizant of the impact of technical limitations. For instance, a data science team whose skill sets do not include an expertise in data engineering can run the risk of slowing down processing times by implementing code created during development that has not been optimized for release in production.
Aylee Nielsen: Truly putting your data to good use is almost like mastering an art form then, you need to have the best tools, the latest expertise, and strike a fine balance in which your optimizing for both the size and goals of your organization. Can you also explain what organizations need to do make machine learning accessible and easy to implement across the business?
Jennifer Shin: For machine learning algorithms to be accessible, an organization must standardize the methodology and store the code in a central repository. Establishing a predefined approach produces comparable results across the organization while controlling the code source ensures all implementations are consistent and up-to-date.
Aylee Nielsen: Yes! I think all of the data scientist here would agree consistent, standardize methodology is key for machine learning success. Can anyone share some examples they are familiar with on how organizations are successfully using data science as a competitive advantage?
Lillian Pierson: A use case that I’ve been jazzed up about this summer is Transport for London (TfL), the public transportation provider in London, England. TfL built a big data analytics system that predicts future travel demand throughout London. With this system, they’re able to plan and construct infrastructure preemptively so that future demand is smoothly and adequately met. As a by-product of this mission, current TfL users are enjoying improved customer service, real-time response, and personalization.
Aylee Nielsen: That’s fantastic! And something I believe we’re in dire need of here in the US! Most public transportation experiences leave a lot to be desired, we could definitely use a big data analytics upgrade in virtually every US city I’ve lived in or visited. I love that use case Lillian, does anyone have any other examples of how we can use data science and machine learning to re-imagine the customer experience?
Dez Blanchfield: One of the greatest disruptions in the consumer market today is the customers desire for a ‘Celebrity Experience’. Data science and machine learning offer new routes to greater insights with which to quantify and transform efforts to re-imagine business capabilities.
Aylee Nielsen: Great example, and while we’re on the topic of disruptive tech, let’s touch on AI. Steve I know this is your area of expertise, would you mind sharing some ways in which enterprise-ready deep learning frameworks simplify the intersection of BI and AI?
Steve Ardire: I’d be happy to, my answer here ties back to both the question about making machine learning accessible, and re-imagine the customer experience with data science as well. Neuro-linguistic programming (NLP) and machine learning are crucial elements of any intelligent system, but you also need a dynamic knowledge base to manage new data especially when there's a high degree of data heterogeneity and complexity. This means having Knowledge Representation, this is how things are related to each other with contextual associations, and Automated Reasoning, which allows a system to “fill in the blanks”. Both are imperative for creation of intelligent apps to use data science to re-imagine the customer experience for augmented intelligence where machines and humans work together. Doing so simplifies the intersection of BI and AI and boosts productivity where the key benefit is smarter decisions. Knowledge workers can understand complex issues faster, answer difficult questions, and solve problems more effectively. They can be more productive spending more time on interpreting insights and taking critical action.
Aylee Nielsen: Thank you for such a comprehensive answer Steve! I think this would be a great time to tie in the topic of unified governance. Would anyone like to describe for us the role and significance of machine learning and data science to a successful unified governance strategy?
Dion Hinchcliffe: Getting a handle on the vast flows of data that are exponentially growing in the enterprise right now is one of the top challenges organizations face today. Simply put, it's quite difficult to actually know what your data collectively knows – and leverage it to strategic business benefit – without some way of better organizing our vast and disparate data assets. This is what machine learning, combined with data science, can help us do, by continuously sifting through and making sense of our data repositories and flows into a more comprehensible whole, which can then be wielded far more effectively.
Chris Penn: I’d like to add that governance is entirely about making sure we know who is doing what, when, and why. A successful governance strategy means we are organizationally aligned and we have given thoughtful consideration to what could go wrong and how we operate. When it comes to the role and significance of machine learning and data science in governance, these technologies promise to help us achieve alignment to our strategy faster. Through the use of natural language processing, we can determine – at scale – how well or poorly aligned our actions are to our governance policies. We can find gaps faster and close them more effectively by using machines to analyze what we do, rather than rely on humans alone.
Aylee Nielsen: Very well put! And as you both know, unified governance is evolving rapidly, so how are we actually seeing innovations in this area enable organizations to effectively address new compliance requirements?
Dion Hinchcliffe: The global regulatory and compliance landscape continues to get more complex, making it a challenge to stay within corporate risk parameters. Machine learning aids are one major area where we can use the unblinking gaze of automated compliance bots, for example, to enforce compliance requirements in near real-time.
Aylee Nielsen: Compliance bots, that’s one way to break through the compliance barriers that can bring business to a halt and seriously slow innovation. How about metadata? Can it play a role in addressing compliance challenges and in what ways does capturing metadata help organizations understand their data better?
Dion Hinchcliffe: Metadata is the story behind our data, about its structure, source, nature and even meaning. Capturing good metadata is fundamental requirement not just to use data effectively but to protected it, benefit fully from it, and enhance it for others. A superior metadata strategy is instrumental to unleashing a universe of good downstream effects in today's digital organization. Those effects include innovating quickly and efficiently to address new compliance requirements as they arise. A flexible, collaborative, open approach to metadata development can help solve interoperability challenges and enable organization to more effectively address compliance requirements.
Aylee Nielsen: I’m glad you brought up interoperability challenges, that’s an important topic I was hoping we’d get to address. How significant are the interoperability challenges faced in today’s climate of disparate technologies that all come with their own metadata storage or management strategies? And how are open source governance capabilities enabling data scientist to overcome these challenges?
Dion Hinchcliffe: The good news is that data interoperability is not nearly the problem that it once was. Data conversation, translation, and transformation capabilities have come a long way in recent years and is nearly democratized. What's far more critical is open availability of sourcing and sharing of data across multiple organizations. The growing issue is the excessive hoarding of data into private repositories where it can provide only limited value compared to vital data that has been set free. Organizations must intelligently open their data sets more readily than many are currently doing today so that they – and the rest of the world – are able to reap the most benefit.
Aylee Nielsen: And how important would you say an open, peer-to-peer approach, with a business-friendly interface, is for the successful collaborative development of metadata?
Dion Hinchcliffe: It's perhaps the most important way to unleash the nearly limitless power of data within the enterprise. Open data sharing, whenever and wherever possible, has led to the co-creation of countless breakthroughs in business, industry, and science over the years, many case examples of which I summarize in my book Social Business By Design. We must not lose sight of this reality, as it is one of the more direct routes to value creation we currently know, as we create vast new big data foundations and ecosystems for the future.
Aylee Nielsen: Yes, that is such an imperative idea, there are many here and around the world who share your enthusiasm for the open data movement. I’m very happy to see IBM demonstrate a commitment to open data environments by offering publicly available data sets through IBM Analytics Exchange, an open catalog of data sets that can be used for analysis or integrated into applications. IBM’s continued investment in Apache Spark and Hadoop, as well as many other open source initiatives and communities, is a very exciting part of the the business. I would, however, like to take a bit of a step back and ensure sure we touch on the topic of GDPR. Why is unified governance so essential for gaining better business insights and enabling compliance with many complex regulations and the law, such as GDPR or HIPAA?
Ronald Van Loon: Without unified data governance, businesses are not able to comply with law regulations that are redefining client personal data usage. They risk potential data breaches, penalties, and loss of client trust. Without client consent to access their data, companies cannot use personal client information in order to gain business insights and improve the customer experience.
Aylee Nielsen: And just in case any of our readers are not familiar with the General Data Protection Regulations (GDPR) could someone please explain these new EU regulations to anyone who might not be aware?
Chris Penn: GDPR is a game-changing piece of legislation authorized into law by the European Union. What GDPR effectively says is that citizens own their personal information, their data, as private property. Organizations must treat that private property with the same respect as they would treat that consumers house or car. That means that organizations may not simply take someone's data and resell it or share it without express permission to do so, in the same way that someone could not walk on your property and just drive away with your car. This has wide-ranging implications for not only compliance, but also fields like marketing and advertising which rely on the sharing of personal data. Organizations will need to significantly rebuild their governance policies, data storage, and marketing operations to comply with GDPR. The penalties for failure to comply are significant; the first violation is millions of euros or 4% of a company's annual revenue. Subsequent violations reach astronomical financial penalties and may also include criminal charges for key company executives. However, the most important part of GDPR is that it is extraterritorial. Anywhere an EU citizen goes – online or offline – they bring GDPR with them. Here's an easy test to see if GDPR applies to you: Check your web analytics. Do you receive any website traffic from the European Union? If the answer is yes, if even one visitor comes from the European Union, GDPR applies to you.
Ronald Van Loon: That’s a great overview Chris, I’d like to add just one more example consumers can relate to well; let’s say you decide to contact Apple to ask how they’re using your personal data because you frequently shop online on their site, and use iTunes. You tell them that they can no longer use your data because you won’t be using their services anymore, and request for them to send your personal information to Spotify instead. Now Spotify can use your personal data to start making customized music recommendations for you. You also contact Spotify and limit how they use your data, and for what purpose.
Aylee Nielsen: Thank you Chris and Ronald both! I just learned something new from each of you. I didn’t realize how far reaching and immense the implications of these regulations would be until you put in into context with those tangible examples. Ronald, I would like to hear a bit more from you on the topic seeing as the majority of your business is in the EU, as well as your home. How are the companies you work with preparing for GDPR? Do you feel most organizations in the EU are ready for the 2018 deadline?
Ronald Van Loon: Most organizations aren’t adequately prepared, and should see this as an opportunity to begin managing their data properly. GDPR makes it even more imperative for companies to implement data and analytics solutions that help them effectively analyze, classify, and manage their data. They need to have the technologies, processes, and advanced data and analytics capabilities in place to support proper data governance and management, and better provide a positive customer experience across channels.
Aylee Nielsen: And how is GDPR impacting organizations currently? What do you think the long-term impacts might be? How about the global impact?
Ronald Van Loon: Currently, organizations need to begin preparation measures regarding their data management. In the long term, there’s an opportunity to differentiate your organization from your competition, and secure a competitive advantage by gaining client consent to use personal data and improve the customer experience. GDPR increases awareness of the value of personal data, giving customers more control over their own data, which is becoming a “currency” in this digitally driven era.
Aylee Nielsen: Personally, as a consumer, I’m very excited to see how GDPR plays out on a global scale. At this junction in time it seems imperative that we safeguard citizen data and empower people to take back control of their personal data. Before we start wrapping up I would like to ask a few more questions related to hybrid data management. Could someone kick it off by letting us know how you define hybrid data management in today's cloud-driven world? As well as what use cases are best suited for a hybrid approach and why?
Dez Blanchfield: Hybrid data management is no longer an oddity, it is the norm – it must be part of your organizational DNA. We now live in multi-homed cloud, data-driven world. It is unlikely that any modern organization would have a use case which does not require hybrid data management.
Aylee Nielsen: So if hybrid data management must be in your DNA, what challenges are organizations facing as they all inevitability transition to hybrid data management systems?
Dr. Craig Brown: I see two primary challenges for organizations transitioning to hybrid data management systems. The first challenge is related to development and the second challenge is related to the integration of the hybrid solution. The development challenges are most often caused by a skill set deficiency. It is taking more time than projected to get the development teams up to speed on the code changes needed to convert existing code connecting to traditional data sources to the code needed to connect these applications to the hybrid data management systems. Integrating the capabilities offered by the hybrid data management systems, which is why the transition began, is challenging also considering a few unknown factors tend to surface during the transition. The biggest elephant in the room is “Data Architecture” challenges. Data Architecture from traditional database management systems to hybrid data management systems are a tough cookie to crack for many organizations. Reverse engineering doesn’t always get the engine running. Poor data quality, inefficient data models, missing data, data formats are just a few of the factors that contribute to this challenges. These are the more common contributing factors of challenges, faced by organizations, into hybrid data management systems.
Aylee Nielsen: Even with some pretty formidable challenges facing most organizations looking to make these transitions many are moving full speed ahead, so what are the catalysts are responsible for driving them to adopt hybrid data management technologies?
Joe Caserta: Hybrid data management adoption is fast becoming the standard for analytics-driven companies. Now with big data, IoT, and fierce competition by modern businesses, speed-to-value is the biggest catalyst to abandon old data protocols and instead evolve your data infrastructure to keep a laser-focus on maximizing the value of your data and analytic insights. To handle massive volumes of unstructured data generated at unprecedented through-puts, a hybrid data management ecosystem in mandatory to run any modern enterprise. NoSQL, elastic scalability, in-memory processing and machine learning are now prerequisites to accommodate the relentless demand for low-latency analytic insights to support customer-centric business applications. A trusted SQL engine must still be at the heart of any enterprise data management system to ensure data governance for interactive arbitrary queries.
Aylee Nielsen: Dez, you mentioned just before that hybrid data management technologies are now the norm and must be a part of your organizations DNA, however, even with the catalyst Joe mentioned that quickly propelling these technologies forward, there still seem to be many daunting challenges, where are we right now in the adoption cycle?
Dez Blanchfield: Despite the Cambrian explosion like increase in data types and sources of now being generated and made available, most organizations are still at the very earliest stages of the adoption cycle.
Fortunately, as is the case with Cloud Computing, Big Data, Analytics, Cognitive Computing and AI, Hybrid Data Management is for the most part, being baked right into new technologies and services by vendors and service providers.
Aylee Nielsen: And how are investments in new open-source technologies like Spark transforming database products like DB2?
Joe Caserta: Spark has organically become the de facto standard for data preparation, data analytics and data science. Spark’s open-source framework is ideal for processing high-speed, high volume data for both structured and unstructured formats. Supporting Python, SQL, Scala and R, Spark allows data engineers, data analysts and data scientists to use their language of choice within the same toolset, to create a seamless data pipeline from raw data ingestion to finished data assets and insights.
Over the last decade, the data community has gone full-circle from single system vendor-lock, to a myriad of cobbled together open-source products, to an integrated data ecosystem that incorporates best-of-bread technologies in a single platform on the cloud. Modern companies need decoupled functional capabilities within a tightly coupled infrastructure. Vendors that can offer Spark, elasticity, high-speed data ingestion in with a governed interactive SQL engine will be the winners for years to come.
Aylee Nielsen: Let’s wrap it up with a question on the benefits organizations can achieve through hybrid data management approaches?
Dr. Craig Brown: Organizations stand to achieve a number of key goals provided through the implementation of hybrid data management systems. First of all, they can expect see cost reductions with regard to both infrastructure and software license provisioning associated with data storage and data access. Data becomes much easier to access because the location and movement of data is easily managed and controlled, and as data volumes grow exponentially the costs of moving the data is drastically reduced. Faster data performance can also be achieved as more data is generated and stored the storage systems must perform and be able to quickly process this data to keep the pipelines operating in full-throttle. Active and archive data accessibility will also become faster and more efficient as the organization is enabled to quickly access data and metadata when needed from archived data sets stored on cheaper cold storage, regardless of where it is and which operating system is requesting the file.
Aylee Nielsen: Thank you so much to each and every one of you for sharing your expertise and incredible insights, I’ve gain a lot to think about today and I hope our readers have too! I can’t wait to see everyone once again in Munich at Fast Track You Data on June 22nd.
Join the Fast Track Your Data CrowdChat to continue this discussion with fellow data enthusiasts, and several of the experts feature in the interview above on Tuesday, June 20th, 2017 at 1:00 PM (EDT). Be the first to hear exciting new announcements from IBM and learn much more about all of the topics discussed here during the main event, breakout sessions and demos at Fast Track You Data – Live from Munich on Thursday, June 22nd, 2017. You’ll also hear form Hilary Mason, Co-founder of Fast Forward Labs, Data Scientist in Residence at Accel and is the former Chief Scientist at Bitly, Rob Thomas, General Manager of the IBM Analytics Platform, and Marc Altshuller, General Manager of IBM Business Analytics. This event will help you understand how to make the most of your data by learning to deploy data where it’s needed, adapting it to your changing needs, and allowing for integration of multiple platforms, languages and workloads. We hope to see you there – in person or online!