Welcome

Hello, my name is Dhun (Dune) Sheth and I am an experienced data analyst with a proven track record in leading and managing teams within an agile framework. My educational background is rooted in mathematics, statistics, and data science methodologies, holding a Bachelor of Applied Science in Engineering Mathematics, Statistics, and Finance from the University of Toronto (Class of June 2020). I am currently pursuing a Master of Data Science at the University of British Columbia (Expected Graduation: April 2024).

My technical skills encompass machine learning, advanced regression, and proficiency in languages such as Python, R, and SQL. I am well-versed in cloud platforms like AWS, Google Cloud and Snowflake, and I have hands-on experience with data analytics tools such as Google Analytics and Adobe Analytics.

In my professional journey, I've demonstrated expertise in web and business analytics, leading technical implementations, managing deployment pipelines, and contributing to data-driven decision-making processes. I've collaborated with cross-functional teams in dynamic environments and showcased my abilities in automating processes, ensuring efficient operations.

Feel free to explore my diverse skill set and accomplishments showcased in various projects, such as analyzing US Health Insurance Charges and conducting Survey Response Analysis using machine learning algorithms.

Want to connect with me?

  1. Email: shethdhun1997@gmail.com
  2. LinkedIn: LinkedIn Profile

Thank you for taking the time to learn more about my professional journey.

Skills

  • Statistical Modeling

    • Generalized Linear Model
    • Generalized Additive Model
    • Smoothing and Splines
    • Kernal Density Estimation

    Packages: statsmodels, scipy, sklearn

    Clustering

    • Hierarchical Clustering
    • K-Means / Partition Clustering
    • Mixture Models
    • Principal Component Analysis (Dimensionality Reduction)

    Packages: stats, mclust

  • Classification

    • Linear Discriminant Analysis
    • Quadratic Discriminant Analysis
    • K-Nearest Neighbors
    • Random Forest

    Packages: MASS, randomForest, Ranger

    Regression

    • Linear Regression
    • Ridge/Lasso Regression
    • Gradient Boosting Machine

    Packages: glmnet, gbm

    Data Analytics

    • Adobe / Google Analytics
    • Tableau / Python Dash
    • Salesforce
  • Languages

    • Python (Pandas, NumPy)
    • R
    • SQL / NoSQL
    • HTML / CSS / JS

    Tools

    • Github / Github Actions
    • Bash
    • Heroku
    • JIRA / Kanbanize
    • Travis CI

    Cloud

    • AWS Cloud Practitioner
    • AWS EMR / Hadoop
    • Snowflake
    • Google Cloud

Projects

Chicago Bike Sharing Dash Application

App Link
Github Project Repo

The following project is a dashboard created using Python Dash, and deployed on Heroku. The goal is to help visualize the ride data to help business stakeholders answer key decisions including:

  1. What stations are the most popular?
  2. When are they popular?
  3. Are there particular stations that customers usually start their rides at and end at?
  4. What is the bike availability at "hot" stations, how can we better re-shuffle bike inventory to maximize bike availability at stations?

Challenges:
The biggest challenge with the project was dealing with 1 year's worth of data (more than 3 million rows). Some techniques used to reduce the memory load included pre-calculating metrics for plotting, multi-indexing, and storing data as parquet files for efficient storage. In addition, with Heroku deployment, storing files on GitHub in LFS format, and reading them directly helped speed up app deployment. However, ultimately, due to the sheer size, rather than using 1 years worth of data, a reduced amount of 4 months had to be used to stay under the Heroku 512mb memory limit to prevent the app from crashing.

For the improved application, I decided it was more important to view all data rather than having a "fancy" date selector. As a result, even though more than 5 million rows of data was used, due to efficient pre-processing, less memory was needed and the app was successfully launched with an improved UI.

NASA API R Wrapper

Github Project Repo

How vast and amazing is space? This project was meant to bring us a little closer to space by creating an R package to wrap 3 NASA APIs:

  1. Near Earth Objects - NEO API

    This function will let the user search for NEO’s based on start/end date and give asteroid specific information.

  2. Earth Polychromatic Imaging Camera - EPIC API

    This function will allow the user to request images and metadata from the daily imagery collected by DSCOVR's Earth Polychromatic Imaging Camera (EPIC) instrument.

  3. NASA Rover Images API

    This wrapper function will allow the user to request a photo from one of the three NASA rovers on Mars.

US Health Insurance Charges Analysis

Project Analysis

The goal of this analysis was to identify the best model to predict customer health insurance charges based on age, sex, BMI, # of children, smoker status, and region. The data set was split into 2/3 for training and 1/3 for testing. A Linear, Lasso, Tree, Random Forest and Boosted models were fit on the data. Based on the test MSE, the boosted model had the lowest error, however, the random forest had a lower cross-validated MSE estimate. Additional analysis should be done with a larger test set, in order to determine if the random forest or boosted model should be used in order to attain the highest prediction accuracy.

Golf Companion Python Package (Deployed to PyPi)

Github Project Repo
PyPi Documentation

Created and deployed a golf tracking python package. The package has 1 main function used to play a round of golf, which includes picking a preset golf course or uploading your own golf course scorecard, adding players to a round, tracking the score for each player, and assisting players in picking the right club based on target yardage and player skill. The package also has 2 independent sub-packages which can be called independently of the main function, the first being to simply track score, and the second is recommending a club based on the player profile and target yardage.
Continuous integration and deployment was implemented using Travis CI, a test suit with several test functions that is executed automatically every time a new commit occurs, and the changes are only merged to main if all tests pass successfully.


Fully Autonomous Can Sorting Machine

This was a fully autonomous can sorting machine which my two teammates and I had built for a design course, AER201, in our 2nd year. The goal of the project was to sort four types of cans; aluminum pop can with tab, aluminum pop can without tab, tin soup can with label and tin soup can without label. Twelve assorted cans had to be sorted within three minutes. The first step in the cycle was for the "claw" to pick up a single can from the hopper and drop it into our "sensing channel". Here, we used a magnetic switch to determine if the can in the channel was a tin or aluminum can. If it was determined to be an aluminum can, we then used a carefully calibrated laser sensor to check if there was a tab on the aluminum can or not. It did not matter if the can was inverted or not because even if it was, the can stood a little taller because of the tab, which was then picked up by the laser sensor. If the can was determined to be a tin can, then we used a colour sensor to check if the can had a label or not. Once the type of can was determined, we used a ramp on a servo to guide the can into the correct bin. This cycle was repeated until all the cans were sorted.

Lifeguard Chair & Honey Bottling

This project was done in my first year for a design course called Praxis II. The project for this course was broken up into two parts. This first part, my team and I had to go into the downtown community, approach local establishments and try to identify a problem they may be encountering. We then had to take our findings and write a request for proposal (RFP) which outlined the objectives, stakeholders and requirements. The second part, we were given a problem and we had make a solution and present it during a showcase.

For the first part of the design course, we decided to work with University Settlement, a community-based social service center in Downtown Toronto. After talking with the recreation supervisor and looking at the facility, we learned that the lifeguard chairs there were amongst the most outdated of equipment. The chairs had protruding bolts and made it difficult for the lifeguards to move and react to situations. As a result, we decided on the opportunity for redesigning the lifeguard chair. We outlined potential stakeholders and requirements, as well as metrics to measure and compare potential proposals.

During the second part of the course, we were given a problem associated with bottling honey. At the time of the project, majority of honey was bottled using a regular bucket with a closing valve outfitted at the bottom of the bucket. The first issue we noticed was that the honey below the valve would not come out and so it would need to be manually removed which wasted time and created a mess. The second issue was found when some testing was done with the closing valve. After a jar of honey was filled and the valve was closed, honey would continue to drip, which wasted honey and also created a mess. To resolve these issues, we proposed a funnel like shape with the valve at the bottom, which allowed all the honey to be packaged the same way. Also, we found a lesser known valve called a "perfection valve", and after some testing, concluded it was better than the closing valve because honey did not drip after it was closed.

About me

Education

Most recently, I am partipating in the Master of Data Science program at the University of British Columbia. The program is an accelerated 10-month course covering the entire data science value chain. It equiped me with skills to extract, analyze, and transform data into actionable insights, emphasizing effective communication of findings. The program featured condensed courses, a capstone project for practical application, and integration of real-world data sets.

I attended J. Clarke Richardson Collegiate in Ajax, ON, where I focused on S.T.E.M courses and explored accounting and finance. Engaged in extracurriculars like Student Council, Volleyball, and Badminton. After high school, I chose the University of Toronto for Engineering Science due to its vibrant atmosphere and versatile program. I later specialized in Mathematics, Statistics, and Finance, driven by a passion for math and business.

Interests

Outside of work and education, I love to play golf, I try to go as many weekends as I can spare during the season. In the winters, I enjoy snowboarding and exploring new mountains. I try and stay fit through various sports like volleyball, basketball, and swimming, but I am also an avid film enthusiast.

Education History

Institution Degree Graduation
University of British Columbia Master of Data Science

The MDS program at UBC is designed to prepare professionals for careers in data science by providing them with a strong foundation in statistical methods, machine learning, and data analysis. The program spans one year and is structured to cover both theoretical concepts and practical applications.

Curriculum:
The curriculum is likely to cover a range of topics essential for data scientists, including:
Foundations of Data Science: This may include courses on statistical methods, probability, and exploratory data analysis.
Machine Learning: Students are likely to study various machine learning algorithms and techniques for both supervised and unsupervised learning.
Data Management: Courses on data wrangling, cleaning, and storage may be included to ensure students can handle real-world datasets effectively.
Big Data Technologies: Depending on the program, there might be coverage of tools and technologies used for handling and analyzing large datasets.
Data Visualization: Understanding how to communicate insights effectively through visualizations is a crucial skill, and programs often include courses on data visualization techniques.
Capstone Projects: Many data science programs, including MDS at UBC, may include a capstone project where students apply their skills to real-world problems. This hands-on experience is valuable for building a portfolio and gaining practical insights.

Skills Developed:
Upon completing the MDS program at UBC, students are expected to develop a diverse set of skills, including:
Programming: Proficiency in languages such as Python or R, which are commonly used in data science for analysis and modeling.
Statistical Analysis: Understanding statistical concepts and applying them to make informed decisions.
Machine Learning: Knowledge and hands-on experience with a variety of machine learning algorithms for tasks such as classification, regression, clustering, and more.
Data Management: Skills in cleaning, wrangling, and managing data to ensure it is suitable for analysis.
Data Visualization: Ability to create clear and effective visualizations to communicate findings.
Problem-Solving: Developing a problem-solving mindset to approach real-world challenges using data-driven methods.
Communication: Effectively communicating complex findings to both technical and non-technical stakeholders.
Expected Grad. April 2024
University of Toronto BASc in Engineering Science | Major in Mathematics, Statistics and Finance

During my first two years in engineering science, my courses were spread over usual engineering courses as well as the maths and sciences, such as, Calculus, Linear Algebra, Thermodynamics, Fluid-dynamics, and Engineering Design. In my third and fourth years, the courses were more focused on mathematics, statistics and finance, such as, Fixed Income Securities, Financial Optimization, Financial Trading Strategy, and Stochastic Methods.
Grad. June 2020
J. Clarke Richardson Collegiate Graduated high school with honor roll. Courses were spread over S.T.E.M and business courses. I was also part of various extra-curricular activities and teams including Student Council, Volleyball and Badminton.  Grad. June 2015

Master of Data Science (MDS) Course Breakdown

Course Term Description
Computing Platforms for Data Science
(DATA-530)
Block 1 Installation and configuration of data science software. Advanced data analysis using Excel. Analysis of data using libraries in R, Python, and cloud services.
Programming for Data Science
(DATA-531)
Block 1 Programming in R and Python including iteration, decisions, functions, data structures, and libraries that are important for data exploration and analysis.
Scripting and Reporting
(DATA-541)
Block 1 Command line scripting including bash and Linux/Unix. Reporting and visualization.
Modelling and Simulation I
(DATA-580)
Block 1 Pseudorandom number generation, testing and transformation to other discrete and continuous data types. Introduction to Poisson processes and the simulation of data from predictive models, as well as temporal and spatial models.
Algorithms and Data Structures
(DATA-532)
Block 2 How to choose and use appropriate algorithms and data structures such as lists, queues, stacks, hash tables, trees and graphs to solve data science problems. Key concepts include recursion, searching and sorting, and asymptotic complexity.
Databases and Data Retrieval
(DATA-540)
Block 2 How to use and query relational SQL and NoSQL databases for analysis. Experience with SQL, JSON, and programming with databases.
Privacy, Security and Professional Ethics
(DATA-553)
Block 2 The legal, ethical, and security issues concerning data, including aggregated data. Proactive compliance with rules and, in their absence, principles for the responsible management of sensitive data. Case studies.
Predictive Modelling
(DATA-570)
Block 2 Introduction to regression for Data Science, including: simple linear regression, multiple linear regression, interactions, mixed variable types, model assessment, simple variable selection, k-nearest-neighbours regression.
Collaborative Software Development
(DATA-533)
Block 3 How to exploit practices from collaborative software development techniques in data scientific workflows. Appropriate use of abstraction and classes, the software life cycle, unit testing / continuous integration, quality control, version control, and packaging for use by others.
Data Collection
(DATA-543)
Block 3 Fundamental techniques in the collection of data. Focus will be devoted to understanding the effects of randomization, restrictions on randomization, repeated measures, and blocking on the model fitting.
Resampling and Regularization
(DATA-571)
Block 3 Resampling techniques and regularization for linear models, including Bootstrap, jackknife, cross-validation, ridge regression, and lasso.
Modelling and Simulation II
(DATA-581)
Block 3 Markov chains and their applications, for example, queueing and Markov Chain Monte Carlo.
Web and Cloud Computing
(DATA-534)
Block 4 Web scraping, RESTful APIs, cloud computing services and tools. Introduction to distributed computing and big data technologies.
Data Visualization
(DATA-542)
Block 4 Principles and practice of effective data visualization. Creating visual representations of data to support exploration, understanding, and communication. Use of a variety of tools and technologies.
Statistical Inference
(DATA-572)
Block 4 Statistical concepts and methods for making inferences from data. Topics include estimation, hypothesis testing, and regression analysis.
Text Analytics
(DATA-582)
Block 4 Processing and analyzing large-scale textual data. Exploration of text mining techniques, natural language processing, and sentiment analysis.
Advanced Machine Learning
(DATA-535)
Block 5 Advanced topics in machine learning, including ensemble methods, neural networks, deep learning, and reinforcement learning. Practical applications and model interpretation.
Time Series Analysis
(DATA-550)
Block 5 Statistical methods for analyzing time-ordered data. Topics include autoregressive integrated moving average (ARIMA) models, seasonal decomposition, and forecasting.
Capstone Project I
(DATA-590)
Block 5 First part of the capstone project. Identifying a real-world problem, formulating a project plan, and developing initial prototypes. Ethical considerations in data science.
Health Data Analytics
(DATA-583)
Block 5 Analysis of health data, including electronic health records, medical imaging, and wearable sensor data. Application of data science techniques to health-related problems.
Capstone Project II
(DATA-591)
Block 6 Continuation of the capstone project. Implementation, testing, and final presentation of the project. Ethical considerations and impact assessment.
Special Topics in Data Science
(DATA-599)
Block 6 Exploration of emerging topics and trends in data science. Guest lectures, case studies, and discussions on current issues in the field.