Statistics for data science leverage the power of statistics for data analysis, classification, regression, machine learning, and neural networks

Get your statistics basics right before diving into the world of data science About This Book No need to take a degree in statistics, read this book and get a strong statistics base for data science and real-world programs; Implement statistics in data science tasks such as data cleaning, mining, an...

Full description

Bibliographic Details
Main Author: Miller, James D.
Format: eBook
Language:English
Published: Birmingham, UK Packt Publishing 2017
Subjects:
Online Access:
Collection: O'Reilly - Collection details see MPG.ReNa
Table of Contents:
  • Variance
  • ANOVA
  • Noise
  • Noisy data
  • Weak and strong learners
  • Weak to strong
  • Model bias
  • Training and prediction time
  • Complexity
  • Which way?
  • Back to boosting
  • How it started
  • AdaBoost
  • What you can learn from boosting (to help) your database
  • Using R to illustrate boosting methods
  • Prepping the data
  • Training
  • Ready for boosting
  • Example results
  • Summary
  • Chapter 11: Database Classification using Support Vector Machines
  • Database classification
  • Data classification in statistics
  • Guidelines for classifying data
  • Common guidelines
  • Definitions
  • Definition and purpose of an SVM
  • The trick
  • Feature space and cheap computations
  • Drawing the line
  • More than classification
  • Downside
  • Reference resources
  • Predicting credit scores
  • Using R and an SVM to classify data in a database
  • Moving on
  • Summary
  • Chapter 12: Database Structures and Machine Learning
  • Data structures and data models
  • Data structures
  • Data models
  • What's the difference?
  • Relationships
  • Machine learning
  • Overview of machine learning concepts
  • Key elements of machine learning
  • Representation
  • Evaluation
  • Optimization
  • Types of machine learning
  • Supervised learning
  • Unsupervised learning
  • Semi-supervised learning
  • Reinforcement learning
  • Most popular
  • Applications of machine learning
  • Machine learning in practice
  • Understanding
  • Preparation
  • Learning
  • Interpretation
  • Deployment
  • Iteration
  • Using R to apply machine learning techniques to a database
  • Understanding the data
  • Preparing
  • Data developer
  • Understanding the challenge
  • Cross-tabbing and plotting
  • Summary
  • Index
  • Step 4: Accuracy and error
  • Summary
  • Chapter 7: Regularization for Database Improvement
  • Statistical regularization
  • Various statistical regularization methods
  • Ridge
  • Lasso
  • Least angles
  • Opportunities for regularization
  • Collinearity
  • Sparse solutions
  • High-dimensional data
  • Classification
  • Using data to understand statistical regularization
  • Improving data or a data model
  • Simplification
  • Relevance
  • Speed
  • Transformation
  • Variation of coefficients
  • Casual inference
  • Back to regularization
  • Reliability
  • Using R for statistical regularization
  • Parameter Setup
  • Summary
  • Chapter 8: Database Development and Assessment
  • Assessment and statistical assessment
  • Objectives
  • Baselines
  • Planning for assessment
  • Evaluation
  • Development versus assessment
  • Planning
  • Data assessment and data quality assurance
  • Categorizing quality
  • Relevance
  • Cross-validation
  • Preparing data
  • R and statistical assessment
  • Questions to ask
  • Learning curves
  • Example of a learning curve
  • Summary
  • Chapter 9: Databases and Neural Networks
  • Ask any data scientist
  • Defining neural network
  • Nodes
  • Layers
  • Training
  • Solution
  • Understanding the concepts
  • Neural network models and database models
  • No single or main node
  • Not serial
  • No memory address to store results
  • R-based neural networks
  • References
  • Data prep and preprocessing
  • Data splitting
  • Model parameters
  • Cross-validation
  • R packages for ANN development
  • ANN
  • ANN2
  • NNET
  • Black boxes
  • A use case
  • Popular use cases
  • Character recognition
  • Image compression
  • Stock market prediction
  • Fraud detection
  • Neuroscience
  • Summary
  • Chapter 10: Boosting your Database
  • Definition and purpose
  • Bias
  • Categorizing bias
  • Causes of bias
  • Bias data collection
  • Bias sample selection
  • Cover
  • Copyright
  • Credits
  • About the Author
  • About the Reviewer
  • www.PacktPub.com
  • Customer Feedback
  • Table of Contents
  • Preface
  • Chapter 1: Transitioning from Data Developer to Data Scientist
  • Data developer thinking
  • Objectives of a data developer
  • Querying or mining
  • Data quality or data cleansing
  • Data modeling
  • Issue or insights
  • Thought process
  • Developer versus scientist
  • New data, new source
  • Quality questions
  • Querying and mining
  • Performance
  • Financial reporting
  • Visualizing
  • Tools of the trade
  • Advantages of thinking like a data scientist
  • Developing a better approach to understanding data
  • Using statistical thinking during program or database designing
  • Adding to your personal toolbox
  • Increased marketability
  • Perpetual learning
  • Seeing the future
  • Transitioning to a data scientist
  • Let's move ahead
  • Summary
  • Chapter 2: Declaring the Objectives
  • Key objectives of data science
  • Collecting data
  • Processing data
  • Exploring and visualizing data
  • Analyzing the data and/or applying machine learning to the data
  • Deciding (or planning) based upon acquired insight
  • Thinking like a data scientist
  • Bringing statistics into data science
  • Common terminology
  • Statistical population
  • Probability
  • False positives
  • Statistical inference
  • Regression
  • Fitting
  • Categorical data
  • Classification
  • Clustering
  • Statistical comparison
  • Coding
  • Distributions
  • Data mining
  • Decision trees
  • Machine learning
  • Munging and wrangling
  • Visualization
  • D3
  • Regularization
  • Assessment
  • Cross-validation
  • Neural networks
  • Boosting
  • Lift
  • Mode
  • Outlier
  • Predictive modeling
  • Big Data
  • Confidence interval
  • Writing
  • Summary
  • Chapter 3: A Developer's Approach to Data Cleaning
  • Understanding basic data cleaning
  • Common data issues
  • Contextual data issues
  • Cleaning techniques
  • R and common data issues
  • Outliers
  • Step 1
  • Profiling the data
  • Step 2
  • Addressing the outliers
  • Domain expertise
  • Validity checking
  • Enhancing data
  • Harmonization
  • Standardization
  • Transformations
  • Deductive correction
  • Deterministic imputation
  • Summary
  • Chapter 4: Data Mining and the Database Developer
  • Data mining
  • Common techniques
  • Visualization
  • Cluster analysis
  • Correlation analysis
  • Discriminant analysis
  • Factor analysis
  • Regression analysis
  • Logistic analysis
  • Purpose
  • Mining versus querying
  • Choosing R for data mining
  • Visualizations
  • Current smokers
  • Missing values
  • A cluster analysis
  • Dimensional reduction
  • Calculating statistical significance
  • Frequent patterning
  • Frequent item-setting
  • Sequence mining
  • Summary
  • Chapter 5: Statistical Analysis for the Database Developer
  • Data analysis
  • Looking closer
  • Statistical analysis
  • Summarization
  • Comparing groups
  • Samples
  • Group comparison conclusions
  • Summarization modeling
  • Establishing the nature of data
  • Successful statistical analysis
  • R and statistical analysis
  • Summary
  • Chapter 6: Database Progression to Database Regression
  • Introducing statistical regression
  • Techniques and approaches for regression
  • Choosing your technique
  • Does it fit?
  • Identifying opportunities for statistical regression
  • Summarizing data
  • Exploring relationships
  • Testing significance of differences
  • Project profitability
  • R and statistical regression
  • A working example
  • Establishing the data profile
  • The graphical analysis
  • Predicting with our linear model
  • Step 1: Chunking the data
  • Step 2: Creating the model on the training data
  • Step 3: Predicting the projected profit on test data
  • Step 4: Reviewing the model