Hands-on data science and Python machine learning perform data mining and machine learning efficiently using Python and Spark

This book covers the fundamentals of machine learning with Python in a concise and dynamic manner. It covers data mining and large-scale machine learning using Apache Spark. About This Book Take your first steps in the world of data science by understanding the tools and techniques of data analysis...

Full description

Bibliographic Details
Main Author: Kane, Frank
Format: eBook
Language:English
Published: Birmingham, UK Packt Publishing 2017
Subjects:
Online Access:
Collection: O'Reilly - Collection details see MPG.ReNa
LEADER 13794nmm a2200613 u 4500
001 EB001923512
003 EBX01000000000000001086414
005 00000000000000.0
007 cr|||||||||||||||||||||
008 210123 ||| eng
020 |a 9781523112227 
020 |a 1787280225 
020 |a 9781787280229 
050 4 |a QA76.73.P98 
100 1 |a Kane, Frank 
245 0 0 |a Hands-on data science and Python machine learning  |b perform data mining and machine learning efficiently using Python and Spark  |c Frank Kane 
260 |a Birmingham, UK  |b Packt Publishing  |c 2017 
300 |a 1 volume  |b illustrations 
505 0 |a Some MLlib Capabilities -- Special MLlib data types -- The vector data type -- LabeledPoint data type -- Rating data type -- Decision Trees in Spark with MLlib -- Exploring decision trees code -- Creating the SparkContext -- Importing and cleaning our data -- Creating a test candidate and building our decision tree -- Running the script -- K-Means Clustering in Spark -- Within set sum of squared errors (WSSSE) -- Running the code -- TF-IDF -- TF-IDF in practice -- Using TF- IDF -- Searching wikipedia with Spark MLlib -- Import statements -- Creating the initial RDD -- Creating and transforming a HashingTF object -- Computing the TF-IDF score -- Using the Wikipedia search engine algorithm -- Running the algorithm -- Using the Spark 2.0 DataFrame API for MLlib -- How Spark 2.0 MLlib works -- Implementing linear regression -- Summary -- Chapter 10: Testing and Experimental Design -- A/B testing concepts -- A/B tests -- Measuring conversion for A/B testing -- How to attribute conversions -- Variance is your enemy -- T-test and p-value -- The t-statistic or t-test -- The p-value -- Measuring t-statistics and p-values using Python -- Running A/B test on some experimental data -- When there's no real difference between the two groups -- Does the sample size make a difference? -- Sample size increased to six-digits -- Sample size increased seven-digits -- A/A testing -- Determining how long to run an experiment for -- A/B test gotchas -- Novelty effects -- Seasonal effects -- Selection bias -- Auditing selection bias issues -- Data pollution -- Attribution errors -- Summary -- Index 
505 0 |a Intro -- Copyright -- Credits -- About the Author -- www.PacktPub.com -- Customer Feedback -- Table of Contents -- Preface -- Chapter 1: Getting Started -- Installing Enthought Canopy -- Giving the installation a test run -- If you occasionally get problems opening your IPNYB files -- Using and understanding IPython (Jupyter) Notebooks -- Python basics -- Part 1 -- Understanding Python code -- Importing modules -- Data structures -- Experimenting with lists -- Pre colon -- Post colon -- Negative syntax -- Adding list to list -- The append function -- Complex data structures -- Dereferencing a single element -- The sort function -- Reverse sort -- Tuples -- Dereferencing an element -- List of tuples -- Dictionaries -- Iterating through entries -- Python basics -- Part 2 -- Functions in Python -- Lambda functions -- functional programming -- Understanding boolean expressions -- The if statement -- The if-else loop -- Looping -- The while loop -- Exploring activity -- Running Python scripts -- More options than just the IPython/Jupyter Notebook -- Running Python scripts in command prompt -- Using the Canopy IDE -- Summary -- Chapter 2: Statistics and Probability Refresher, and Python Practice -- Types of data -- Numerical data -- Discrete data -- Continuous data -- Categorical data -- Ordinal data -- Mean, median, and mode -- Mean -- Median -- The factor of outliers -- Mode -- Using mean, median, and mode in Python -- Calculating mean using the NumPy package -- Visualizing data using matplotlib -- Calculating median using the NumPy package -- Analyzing the effect of outliers -- Calculating mode using the SciPy package -- Some exercises -- Standard deviation and variance -- Variance -- Measuring variance -- Standard deviation -- Identifying outliers with standard deviation -- Population variance versus sample variance -- The Mathematical explanation 
505 0 |a Chapter 7: More Data Mining and Machine Learning Techniques -- K-nearest neighbors -- concepts -- Using KNN to predict a rating for a movie -- Activity -- Dimensionality reduction and principal component analysis -- Dimensionality reduction -- Principal component analysis -- A PCA example with the Iris dataset -- Activity -- Data warehousing overview -- ETL versus ELT -- Reinforcement learning -- Q-learning -- The exploration problem -- The simple approach -- The better way -- Fancy words -- Markov decision process -- Dynamic programming -- Summary -- Chapter 8: Dealing with Real-World Data -- Bias/variance trade-off -- K-fold cross-validation to avoid overfitting -- Example of k-fold cross-validation using scikit-learn -- Data cleaning and normalisation -- Cleaning web log data -- Applying a regular expression on the web log -- Modification one -- filtering the request field -- Modification two -- filtering post requests -- Modification three -- checking the user agents -- Filtering the activity of spiders/robots -- Modification four -- applying website-specific filters -- Activity for web log data -- Normalizing numerical data -- Detecting outliers -- Dealing with outliers -- Activity for outliers -- Summary -- Chapter 9: Apache Spark -- Machine Learning on Big Data -- Installing Spark -- Installing Spark on Windows -- Installing Spark on other operating systems -- Installing the Java Development Kit -- Installing Spark -- Spark introduction -- It's scalable -- It's fast -- It's young -- It's not difficult -- Components of Spark -- Python versus Scala for Spark -- Spark and Resilient Distributed Datasets (RDD) -- The SparkContext object -- Creating RDDs -- Creating an RDD using a Python list -- Loading an RDD from a text file -- More ways to create RDDs -- RDD operations -- Transformations -- Using map() -- Actions -- Introducing MLlib 
505 0 |a Polynomial regression -- Implementing polynomial regression using NumPy -- Computing the r-squared error -- Activity for polynomial regression -- Multivariate regression and predicting car prices -- Multivariate regression using Python -- Activity for multivariate regression -- Multi-level models -- Summary -- Chapter 5: Machine Learning with Python -- Machine learning and train/test -- Unsupervised learning -- Supervised learning -- Evaluating supervised learning -- K-fold cross validation -- Using train/test to prevent overfitting of a polynomial regression -- Activity -- Bayesian methods -- Concepts -- Implementing a spam classifier with Naïve Bayes -- Activity -- K-Means clustering -- Limitations to k-means clustering -- Clustering people based on income and age -- Activity -- Measuring entropy -- Decision trees -- Concepts -- Decision tree example -- Walking through a decision tree -- Random forests technique -- Decision trees -- Predicting hiring decisions using Python -- Ensemble learning -- Using a random forest -- Activity -- Ensemble learning -- Support vector machine overview -- Using SVM to cluster people by using scikit-learn -- Activity -- Summary -- Chapter 6: Recommender Systems -- What are recommender systems? -- User-based collaborative filtering -- Limitations of user-based collaborative filtering -- Item-based collaborative filtering -- Understanding item-based collaborative filtering -- How item-based collaborative filtering works? -- Collaborative filtering using Python -- Finding movie similarities -- Understanding the code -- The corrwith function -- Improving the results of movie similarities -- Making movie recommendations to people -- Understanding movie recommendations with an example -- Using the groupby command to combine rows -- Removing entries with the drop command -- Improving the recommendation results -- Summary 
505 0 |a Analyzing standard deviation and variance on a histogram -- Using Python to compute standard deviation and variance -- Try it yourself -- Probability density function and probability mass function -- The probability density function and probability mass functions -- Probability density functions -- Probability mass functions -- Types of data distributions -- Uniform distribution -- Normal or Gaussian distribution -- The exponential probability distribution or Power law -- Binomial probability mass function -- Poisson probability mass function -- Percentiles and moments -- Percentiles -- Quartiles -- Computing percentiles in Python -- Moments -- Computing moments in Python -- Summary -- Chapter 3: Matplotlib and Advanced Probability Concepts -- A crash course in Matplotlib -- Generating multiple plots on one graph -- Saving graphs as images -- Adjusting the axes -- Adding a grid -- Changing line types and colors -- Labeling axes and adding a legend -- A fun example -- Generating pie charts -- Generating bar charts -- Generating scatter plots -- Generating histograms -- Generating box-and-whisker plots -- Try it yourself -- Covariance and correlation -- Defining the concepts -- Measuring covariance -- Correlation -- Computing covariance and correlation in Python -- Computing correlation -- The hard way -- Computing correlation -- The NumPy way -- Correlation activity -- Conditional probability -- Conditional probability exercises in Python -- Conditional probability assignment -- My assignment solution -- Bayes' theorem -- Summary -- Chapter 4: Predictive Models -- Linear regression -- The ordinary least squares technique -- The gradient descent technique -- The co-efficient of determination or r-squared -- Computing r-squared -- Interpreting r-squared -- Computing linear regression and r-squared using Python -- Activity for linear regression 
653 |a Data mining / fast 
653 |a Data Mining 
653 |a Spark (Electronic resource : Apache Software Foundation) / fast 
653 |a Machine learning / http://id.loc.gov/authorities/subjects/sh85079324 
653 |a Python (Computer program language) / fast 
653 |a Python (Computer program language) / http://id.loc.gov/authorities/subjects/sh96008834 
653 |a Artificial intelligence / fast 
653 |a Artificial intelligence / http://id.loc.gov/authorities/subjects/sh85008180 
653 |a Artificial Intelligence 
653 |a Spark (Electronic resource : Apache Software Foundation) / http://id.loc.gov/authorities/names/no2015027445 
653 |a COMPUTERS / Programming / General / bisacsh 
653 |a Intelligence artificielle 
653 |a Machine learning / fast 
653 |a Apprentissage automatique 
653 |a artificial intelligence / aat 
653 |a Data mining / http://id.loc.gov/authorities/subjects/sh97002073 
653 |a Python (Langage de programmation) 
653 |a Exploration de données (Informatique) 
653 |a Machine Learning 
041 0 7 |a eng  |2 ISO 639-2 
989 |b OREILLY  |a O'Reilly 
776 |z 1523112220 
776 |z 9781787280229 
776 |z 9781523112227 
776 |z 1787280225 
776 |z 9781787280748 
856 4 0 |u https://learning.oreilly.com/library/view/~/9781787280748/?ar  |x Verlag  |3 Volltext 
082 0 |a 005.133 
520 |a This book covers the fundamentals of machine learning with Python in a concise and dynamic manner. It covers data mining and large-scale machine learning using Apache Spark. About This Book Take your first steps in the world of data science by understanding the tools and techniques of data analysis Train efficient Machine Learning models in Python using the supervised and unsupervised learning methods Learn how to use Apache Spark for processing Big Data efficiently Who This Book Is For If you are a budding data scientist or a data analyst who wants to analyze and gain actionable insights from data using Python, this book is for you. Programmers with some experience in Python who want to enter the lucrative world of Data Science will also find this book to be very useful, but you don't need to be an expert Python coder or mathematician to get the most from this book.  
520 |a Based on Frank's successful data science course, Hands-On Data Science and Python Machine Learning empowers you to conduct data analysis and perform efficient machine learning using Python. Let Frank help you unearth the value in your data using the various data mining and data analysis techniques available in Python, and to develop efficient predictive models to predict future results. You will also learn how to perform large-scale machine learning on Big Data using Apache Spark. The book covers preparing your data for analysis, training machine learning models, and visualizing the final data analysis. Style and approach This comprehen .. 
520 |a What You Will Learn Learn how to clean your data and ready it for analysis Implement the popular clustering and regression methods in Python Train efficient machine learning models using decision trees and random forests Visualize the results of your analysis using Python's Matplotlib library Use Apache Spark's MLlib package to perform machine learning on large datasets In Detail Join Frank Kane, who worked on Amazon and IMDb's machine learning algorithms, as he guides you on your first steps into the world of data science. Hands-On Data Science and Python Machine Learning gives you the tools that you need to understand and explore the core topics in the field, and the confidence and practice to build and analyze your own machine learning models. With the help of interesting and easy-to-follow practical examples, Frank Kane explains potentially complex topics such as Bayesian methods and K-means clustering in a way that anybody can understand them.