Data Scientist available for C2C roles
Name: Janaki
Visa: OPT
Location: MA
Rate: $60/hr on c2c
Professional Summary:
· Having 5+ Years of Experience as Data Scientist who undertakes complex assignments, meets tight deadlines, and delivers superior performance. Practical knowledge in data Analytics and Optimization Applies string analytical skills to inform senior management of key trends identified in the data.
· Experienced working on Hadoop Framework and its ecosystem like HDFS, MapReduce, Yarn, Spark, Hive, impala, Sqoop and Oozie, Kafka
· Experience in developing data pipelines using AWS services including EC2, S3, Redshift, Glue, Lambda functions, Step functions, CloudWatch, SNS, DynamoDB, SQS.
· Experience with relational and non-relational databases such as MySQL, SQL, Oracle, MongoDB, Cassandra and PostgreSQL.
· Familiar with data architecture including data ingestion pipeline design, Hadoop information architecture, data modeling.
· Experience in Azure Analytics Data Engineer with a passion for developing, and expanding high-performing data pipelines for use in Data Analysis.
· Implemented ETL routines using Azure Data Factory to load transformed data into the Data Warehouse.
· Designed, developed and maintained of Inform Diagnostics Azure Data Lake (ADL) and Enterprise Data Warehouse.
· Experience in working with Azure DevOps build release pipelines for release deployments.
· Fully managed service for composing data storage, processing, and movement services into streamlined, scalable, and reliable data production pipelines using Azure Data Factory service.
· Experience with Microsoft Azure or other cloud platforms such as Amazon Web Services.
· Experience in working with business intelligence and data warehouse software, including SSAS, Pentaho, Cognos Database, Amazon Redshift, or Azure Data Warehouse.
· Good experience on programming languages Python, Scala.
· Strong experience in Extraction, Transformation and Loading (ETL) data from various sources into Data Warehouses and Data Marts using Informatica Power Center (Repository Manager, Designer, Workflow Manager, Workflow Monitor, Metadata Manger), Power Exchange, Power Connect as ETL tool on Oracle, DB2 and SQL Server Databases.
· Building the logical and Physical data model for snowflake as per the changes required.
· virtual warehouse sizing for Snowflake for different type of workloads
· Experience in developing Python ETL jobs run on AWS services and integrating with enterprise systems like Enterprise logging and alerting, enterprise configuration management and enterprise build and versioning infrastructure.
· Have experience in Apache Spark, Spark Streaming, Spark SQL and NoSQL databases like HBase, Cassandra, and MongoDB.
· Experience in analysing data using Hadoop Ecosystem including HDFS, Hive, Spark, Spark Streaming, MLLib, Nifi, Elastic Search, Kibana, Kafka, HBase, Zookeeper, PIG, Sqoop, Flume.
· Experience in BI/DW solution (ETL, OLAP, Data mart), Informatica, BI Reporting tool like Tableau and QlikView and experienced leading the team of application, ETL, BI developers, Testing team
· Ability to communicate with the Business Analysts, Data Modelers, and Solution Architects when it comes to converting the ETL design into specific development activities
· Strong Experience with tools such as Microsoft SSIS, SSRS, Python, or another similar ETL tool
· Manage the integration of key external data sources into our enterprise data repositories for use by all analytics teams and report writers across the enterprise
· Used various transformations like Filter, Expression, Sequence Generator, Update Strategy, Joiner, Stored Procedure, and Union to develop robust mappings in the Informatica Designer.
· Experience in using Terraform for building AWS infrastructure services like EC2, Lambda and S3.
· Fully managed service for composing data storage, processing, and movement services into streamlined, scalable, and reliable data production pipelines using Azure Data Factory service.
Technical Skills:
Big Data
HDFS, MapReduce, Hive, Pig, Kafka, Sqoop, Flume, Oozie, and Zookeeper, Ambari, Nifi, spark
No SQL Databases
Hbase, Cassandra, MongoDB
Languages
C, Python, Java, J2EE, PL/SQL, Pig Latin, HiveQL, Unix shell scripts, R Programming
Java/J2EE Technologies
Applets, Swing, JDBC, JNDI, JSON, JSTL, RMI, JMS, Java Script, JSP, Servlets, EJB, JSF, JQuery
Frameworks
MVC, Struts, Spring, Hibernate
Operating Systems
Sun Solaris, HP-UNIX, RedHat Linux, Ubuntu Linux and Windows XP/Vista/7/8
Web Technologies
HTML, DHTML, XML, AJAX, WSDL, SOAP
Web/Application servers
Apache Tomcat, WebLogic, JBoss
Databases
Oracle 9i/10g/11g, DB2, SQL Server, MySQL, Teradata, Snowflake.
Tools and IDE
Eclipse, NetBeans, Toad, Maven, ANT, Hudson, Sonar, JDeveloper, Assent PMD, DB Visualizer, Power Bi, Tableau
Version Control
GIT, IntelliJ, Eclipse
Cloud
AWS, Azure
Professional Experience:
MassDOT, MA Jan 2021 - Current
Data Scientist
Project: Analyzing chatbot fallouts (text data) based on various criteria using ML/NLP techniques
Responsibilities:
· Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
· Demonstrated strength in Unix shell scripting, python/pyspark scripting, data modeling, ETL development, and data warehousing
· Hands on experience in Spark and Spark Streaming creating RDD's, applying operations - Transformation and Actions.
· Experience in maintaining data warehouse systems and working on large scale data transformation using EMR, Hadoop, Hive, or other Big Data technologies
· Experience with ETL, Data Modeling, Data integration and working with large-scale datasets. Extremely proficient in writing efficient SQL on and working with large data volumes.
· Interface with customers, gathering requirements and delivering complete data solutions.
· Experienced in using Matplotlib and Seaborn libraries in Python for visualization.
· Performed computing tasks of data extracting and feature engineering on Oracle distributed system
· Created Hive External tables to stage data and then move the data from Staging to main tables.
· Objective of this project is to build a data lake as a cloud-based solution in AWS using Apache Spark.
· Worked on managing and reviewing Hadoop log files. Tested and reported defects in an Agile Methodology perspective.
· Created Data Pipelines as per the business requirements and scheduled it using Oozie Coordinators.
· Performed multiple aspects involved in the development lifecycle -design, cloud engineering (Infrastructure, network, security, and administration), ingestion, data modeling, testing, CICD pipelines, performance tuning, deployments, consumption, BI, Prod support.
· Experienced with Kafka custom connectors, creating publishers, consumers, and consumer groups.
· Created solutions to transform data from various sources and loaded it into Snowflake and created a data lake.
· Understanding of Federal Security Control Frameworks, I.e. NIST 800-53, CSF, Pub1075 & FTI requirements
· Current Certification on “Hands on Snowflake WebUI Essentials” – Snowflake.
· Creation of best practices and standards for data pipelining and integration with Snowflake data warehouses.
· Design and implement secure data pipelines into a Snowflake data warehouse from on-premises and cloud data sources
· Design and implement API to extract data from various sources
· Used Git for code base maintenance and JIRA for task tracking and monitoring.
· Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time and persists into Cassandra.
· Implemented DLP controls.
· Working knowledge of Varonis DatAdvantage, MS Excel and Microsoft O365 suite.
Environment: Hadoop, Cloudera, Talend, Scala, Spark, HDFS, Hive, Pig, Sqoop, DB2, SQL, Linux, Yarn, NDM, Informatica, AWS, MS Visio, Windows & Microsoft Office, Snowflake
Fidelity Investments Jan 18 – Dec 20
Data Scientist
Project1: Analyzing customer purchase behavior based various products
Description: Develop a model to understand the customer purchase behavior (specifically, purchase amount) against various products of different categories using purchase summary and customer demographics (age, gender, marital status, location), which will help to create personalized offer for customers against different products.
Responsibilities:
· Implementing data models, algorithms to machine learning solutions as a member of data scientist team.
· Experience leading large-scale data warehousing and analytics projects, including using AWS technologies – S3, EC2, Data-pipeline and other big data technologies.
· Implemented data migration and data engineering solutions using Azure products and services like Azure Data Lake Storages, Azure Data Factory, Azure Databricks etc. and traditional data warehouse tools.
· Developed Reports & Dashboards using Tableau in a Snowflake Data Warehouse.
· Worked with master data management and data quality tools such as Talend.
· Established analytics capabilities using technologies such Snowflake, RedShift, or other cloudbased analytics platforms.
· Implementing very large-scale data intelligence solutions around Snowflake Data Warehouse model over multiple datasets using whereScape or Data Lake. Migrating other databases to Snowflake.
· Worked with Snowflake cloud data warehouse and AWS S3 bucket for integrating data from multiple source system which include loading nested JSON formatted data into snowflake table also implemented Snowflake Clone
· Performed data exploration using data.table and ggplot2
· Data manipulation is performed to create new variables, revalue existing variable and treat missing values to make our data ready for modeling stage
· Performed data transformation method for rescaling by normalizing the variables
· Delivered Interactive visualizations/dashboards using ggplot to present analysis outcomes in terms of patterns, anomalies and predictions
· Explain results on complex models simply and understandably to stakeholders and experienced on story telling using data visualization tools
· Modeled statistical algorithms against data sets and deployed predictive models using R-Studio to provide solutions
· Exploring the power of different machine learning algorithms in H2O by performing Multiple Regression, GBM and Random Forest
· The performance of the model is checked using h2o.perfomance and random forest predict more accurate results
· Experience in similar data discovery and data migration to the cloud efforts
· Supported data migration from on-prem file servers to Microsoft OneDrive
Project2: Detect the spam or legitimate emails
Description: The eMail Spam collection is a set of spam tagged messages that have been selected for eMail Spam research. It contains one set of eMail messages in English, for which we need to develop a machine learning model to predict the spam or ham emails in the data.
Responsibilities:
· Imported the data using pandas in Python and get the summary of data using info function
· Used Matplotlib and Seaborn libraries in Python for visualization
· Imported the CountVectorizer using sklearn package and fitted it on the dataset to transform the text into binary format
· Data splitted into testing and training sets using train_test_split
· Naive bayes machine learning model has been performed on the training set
· The naive bayes classifier is used to develop a predictive model on test dataset results and build a confusion matrix to determine the model accuracy, precision, and recall
· Created solutions to transform data from various sources and loaded it into Snowflake and created a data lake.
· Worked closely with source data applications teams and product owners and designed, implemented, and supported analytics solutions that provide insights to make better decisions.
· Implemented data migration and data engineering solutions using Azure products and services like Azure Data Lake Storages, Azure Data Factory, Azure Databricks etc. and traditional data warehouse tools.
· Worked with Hadoop infrastructure to storage data in HDFS storage and use Spark / HIVE SQL to migrate underlying SQL codebase in Azure.
· Used Polybase for ETL/ELT process with Azure Data Warehouse to keep data in Blob Storage with almost no limitation on data volume.
· Analyzed results, understood business requirements and removed false positives.
· Built processes for a methodical and secure migration.
· Built Azure Data Warehouse Table Data sets for Power BI Reports.
· Builded out Data Lake ingestion and staging to corporate data sources for Analytics purposes using Azure Data Factory.
· Worked with Hadoop infrastructure to storage data in HDFS storage and use Spark / HIVE SQL to migrate underlying SQL codebase in Azure.
· Used Polybase for ETL/ELT process with Azure Data Warehouse to keep data in Blob Storage with almost no limitation on data volume.
· Built Azure Data Warehouse Table Data sets for Power BI Reports.
Project3: Credit card Fraud Detection
Responsibilities:
· Heatmaps and count plots were generated using seaborn and matplotlib packages and outliers were detected and removed
· Dimensionality Reduction With t-SNE for Visualization, feature scaling and data normalization is performed using standard scaler
· Developed credit card risk modeling to enhance existing risk scorecards or marketing analytics modeling
· Different models like Naive Bayes, Logistic Regression, Random Forest and Support Vector Machine were trained on the dataset
· Among them Random Forest model was selected to get higher degree of comprehensiveness with better performance
· Predictive modelling was developed on test data and classification report were generated to identify the fraud transactions
· Responsible for using Spark and Hive for data transformation and intensive use of Spark Sql to perform analysis of vast data stores and uncover insights.
· Implemented ingestion pipelines to migrate ETL to Hadoop using Spark Streaming and Oozie workflow. Loaded unstructured data into Hadoop distributed File System (HDFS).
· Conducted POC's on migrating to Spark and Spark-Streaming using KAFKA to process live data streams and Compared Spark performance with Hive and SQL.
· Experienced in writing SQL queries in performing ETL techniques and strong understanding of data warehousing.
· Highly analytical and process-oriented with in-depth knowledge of database types; research methodologies; and big data capture, mining, manipulation, and visualization.
· Experienced working on Hadoop Framework and its ecosystem like HDFS, MapReduce, Yarn, Spark, Hive, impala, Sqoop and Oozie.
· Experience in data ingestion using Spark, Sqoop and Kafka.
· Experienced in Spark Programming with Scala and Python.
· Expertise on Spark streaming (Lambda Architecture), Spark SQL, Tuning and Debugging the Spark Cluster (MESOS).
· Getting in touch with the Junior developers and keeping them updated with the present cutting-Edge technologies like Hadoop, Spark, Spark SQL.
Project4: Predict whether a bank customer will retire or not based up on 401K savings and age
Responsibilities:
· Data was visualized and understand the insights using seaborn package
· Performed normalization on the dataset using standard scaler function
· Developed different models to identify the best one and found support vector machine (SVM) works better
· ML model was used to get the predicted values
· Created a confusion matrix to look at the FP, TP, FN and TN
· Classification report was generated to explain the results accuracy
· Formulated predictive models to forecast product category wise order volumes, season wise color and style choices so that departmental buyers can make educated, and data driven decisions
· Worked with Different data types in R such as vectors, lists, matrices, arrays and data frames
· Read and wrote data from and to various .csv, .xml, json files both in R-studio and Python IDE (Anaconda)
Environment: Spark, YARN, HIVE, Pig, Scala, Mahout, NiFi, Python, Hadoop, Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB) Dynamo DB, Kibana, NOSQL, Sqoop.
Zeta Global June 16 – Dec 17
Jr.Data Scientist
· To determine the sales demand developed time series forecasting model to observe the trend on sales in given period
· Usually boosting algorithms will improve the results compared to other models and found GBM is the best to predict the purchase amount
· Connect and support data warehouse that feed business analytics tools such as Tableau
· Applied various machine learning algorithms and statistical modeling techniques like decision trees, Naïve Bayes, Principal Component Analysis, regression models, Artificial Neural Network, clustering, SVM to identify Volume using scikit-learn package in R and Python
· Used Tableau and Gretl for Business Intelligence tasks (data cleaning, sanitizing, analyzing and creating dashboards) and presented before the clients
· Used different types of charts, including but not limited to pie chart, bar diagram, square diagram, heat maps and others to create dashboard to assist the management to understand and decide local and international distribution on vendors, manufacturing units, warehouses, cost effective logistics and others
· Built collaborative relationships with cross functional team members
· Performed preprocessing, curation and transformations on unstructured text data
· Developed machine learning and deep learning models using state-of-the-art NLP techniques for sentiment analysis and topic modeling
· Established post-migration controls to limit ongoing risks of unauthorized access and loss of confidential data in OneDrive
Project: To determine the reasons behind unusual churn rate experience:
Responsibilities:
· Imported the data into R using fread and converting the categorical variables into factors
· Data was split into testing and training set using catools package
· Feature scaling was performed on data because training an artificial neural network is highly compute intensive, so there is going to be a lot of computer actions and besides parallel considerations
· ANN model was performed in R using h2o package and to establish a connection to the model to initialize the h2o instance used init function
· Model was developed using deep learning function and rectifier activation was used to reduce the linearity of a model
· Performed the forward and backward propagation of a model on training set
· Predict function was used to identify the values for the test set using the classifier model
· Caret package was used to build a confusion matrix to determine the model accuracy, precision and recall.
Environment: Spark, YARN, HIVE, Pig, Scala, Mahout, NiFi, Python, Hadoop, Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB) Dynamo DB, Kibana, NOSQL, Sqoop.
Education Details:
Vignan Lara Institute of technology & Sciences, Vadlamudi, AP, India.
B.Tech – Computer science – Dec 2014.
Warm Regards,
Naveen, Team lead -Sales
Techsmart Global INC.,
666 Plainsboro Rd, Suite 1116, Plainsboro, NJ 08536.
Phone: 732-798-7574