Introduction
In this book, we take a practical approach to data analysis with R and Python. With relative ease, we can answer questions about particular datasets, produce models, and export visualizations. For this reason, R is an excellent choice for rapid prototyping and analytics since it is a domain-specific language designed for statistical data analysis, and it does its job well.
In this book, we will take a look at a different approach to analytics that is more geared towards production environments and applications. The data science pipeline of hypothesis, acquisition, cleaning and munging, analysis, modeling, visualization, and application is not a clean and linear process by any means. Moreover, when the analysis is meant to be reproducible at scale in an automated fashion, many new considerations and requirements enter into the picture. Thus, many data applications require a broader toolkit. This toolkit should still provide rapid prototyping, be generally available on all systems, and provide full support for a range of computing operations, including network operations, data operations, and scientific operations. Given these requirements, Python becomes a clear contender as the tool of choice for application-oriented analyses.
Python is an interpreted language (sometimes referred to as a scripting language), much like R. It requires no special IDE or software compilation tools and is therefore as fast as R to develop with and prototype. Like R, it also makes use of C shared objects to improve computational performance. Additionally, Python is a default system tool on Linux, Unix, and macOS X machines and is available for Windows too. Python is loaded with batteries which means that the standard library is widely inclusive of many modules from multiprocessing to compression toolsets. Python is a flexible computing powerhouse that can tackle any domain problem. If you find yourself in need of libraries that are outside of the standard library, Python also comes with a package manager (like R) that allows the download and installation of other code bases.
Python's computational flexibility means that some analytical tasks take more lines of code than their counterpart in R. However, Python does have the tools that allow it to perform the same statistical computing. This leads to an obvious question: When do we use R over Python and vice versa? This chapter attempts to answer this question by taking an application-oriented approach to statistical analyses.