Flow Careers | Senior Data Scientist Intern (February 2025 Start)

***This is an unpaid internship at this time and is suitable for new recent Master's graduate candidates that wants to be a Senior Data Scientist.***

Company Overview:

Flow Global Software Technologies, LLC., operating in the Information Technology (IT) sector, is a cutting-edge high-tech enterprise AI company that engages in the design, engineering, marketing, sales, and 5-star support of cloud-based SaaS AI sales platforms, with patent pending artificial intelligence, deep learning, and other proprietary technologies awaiting patent approval. The company's first product, Flow Turbo™, is a future-generation SaaS AI sales prospecting platform that is designed to maximize the productivity day-to-day for B2B sales reps within B2B outbound, inbound, and inside sales organizations. The company also provides world-class award-winning customer support, professional services, and advisory services. The company is headquartered in Austin, Texas and is registered in Delaware.

Position Overview:

Flow is seeking exceptionally advanced, experienced, dedicated, and committed Senior Data Scientist Interns to engage in the end-to-end development, implementation, delivery, optimization, and large-scale cloud infrastructure deployment of NLP-driven named entity resolution (NER), infinite lead generation pipelines, and high-dimensional knowledge graph construction. This role is designed for individuals who possess an above-PhD-level understanding of context-aware Named Entity Recognition (NER), probabilistic graph-based data fusion, transformer architectures, real-time data extraction, large-scale data associativity resolution, and automated entity validation pipelines. Senior Data Scientist Interns will architect, develop, and optimize multi-tiered NLP and graph-based entity disambiguation frameworks, ensuring precise, scalable, and ultra-high-volume data ingestion and contact enrichment for lead generation pipelines. The role involves the engineering of large-scale, dynamically evolving, and self-healing entity relationship graphs with direct applications in high-precision mass bulk phone number collection, contact information collection, real-time entity clustering, and probabilistic entity matching in all-source datasets. Candidates must exhibit unparalleled expertise in Java and Python development, leveraging cost-minimized, high-efficiency, multi-threaded, and GPU-accelerated NLP workflows to automate, refine, and scale lead generation pipelines to transformative dimensions, which is vital to the success of all salespeople, and Flow’s AI platforms.

The ideal candidate will be responsible for architecting, training, fine-tuning, and deploying transformer-based NLP models with an emphasis on multi-modal Named Entity Recognition (NER), knowledge graph embeddings, and cross-context entity validation for lead generation pipelines. This includes implementing hierarchical, context-sensitive NER pipelines utilizing SpaCy pre-trained models, custom entity linking algorithms, and transformer architectures such as BERT, RoBERTa, XLNet, DistilBERT, and Sentence-BERT. The candidate will be expected to engineer multi-phase entity resolution frameworks, leveraging fuzzy matching, probabilistic distance functions, and rule-based entity refinement to normalize and enrich structured and unstructured datasets. You will deploy context-aware approximate nearest neighbor (ANN) search algorithms for rapid similarity computation across high-dimensional entity representations, optimizing for zero-error entity classification and deduplication across multi-source lead generation pipelines. Additionally, You will be responsible for constructing domain-specific, dynamically evolving NLP models tailored for high-precision full name, phone number, email address, and organization-to-contact relationship extraction from unstructured and semi-structured data sources.

Candidates must possess advanced distributed web scraping, data extraction, and real-time data ingestion expertise, employing multi-threaded, distributed browser automation frameworks such as Puppeteer, Playwright, Selenium, and BeautifulSoup4, optimized for high-throughput, multi-region dynamic content extraction from JavaScript-heavy, AJAX-rendered, and obfuscated DOM structures. You will design XPath-based heuristics and regex-powered parsers engineered for real-time DOM navigation, adaptive field extraction, and context-aware entity association. You must implement dynamic proxy rotation strategies, browser fingerprinting mitigation, and stealth-mode scraping methodologies to circumvent detection mechanisms and maximize high-value entity retrieval at infinite scale. The candidate will be responsible for developing low-latency, serverless distributed web crawling and distributed web scraping architectures with Kafka-backed real-time streaming pipelines to ingest, validate, and enrich petabyte-scale datasets in real time. Additionally, the role demands mastery in multi-source entity disambiguation, applying fuzzy logic-based associative clustering, graph-based entity fusion, and probabilistic deduplication techniques to eliminate low-confidence records and erroneous entity matches.

This role requires deep expertise in large-scale graph-based data structures, knowledge graph embeddings, and Graph Neural Networks (GNNs) for high-precision entity matching and contact enrichment. The candidate will engineer scalable, distributed graph databases using Neo4j, JanusGraph, and ArangoDB, optimized for real-time, high-fanout entity traversal and recursive associative relationship resolution. You will implement heterogeneous knowledge graph construction methodologies, leveraging Graph Attention Networks (GATs), Relational Graph Convolutional Networks (R-GCNs), and GraphSAGE to construct automated, self-healing, and continuously evolving entity relationship models. The candidate must design probabilistic entity fusion scripts to resolve ambiguities across all-source datasets, utilizing advanced contextual similarity measures such as Levenshtein distance, Jaccard similarity, Cosine similarity, and MinHash locality-sensitive hashing algorithms. You will apply open source, pre-built, pre-trained GNN-based entity disambiguation frameworks to eliminate duplicate contact records, resolve conflicting entity relationships, and validate high-confidence associations between disparate datasets. Additionally, you will be responsible for optimizing knowledge graph traversal algorithms, ensuring sub-millisecond latency for entity resolution queries at petabyte scale.

Senior Data Scientist Interns will construct high-performance, distributed ETL pipelines using Apache Spark, Apache Flink, and Apache Kafka Streams, engineered for low-latency, high-volume data ingestion, transformation, and anomaly detection in infinite-scale lead generation workflows. You will develop optimized Approximate Nearest Neighbor (ANN) retrieval mechanisms, leveraging HNSW, FAISS, and ScaNN for real-time, billion-scale entity similarity search and matching. You must implement Luhn’s Algorithm and heuristic-based probabilistic weighting models, ensuring precise textual summarization, deduplication, and high-confidence relevance scoring for extracted lead data. You will also be responsible for developing real-time, event-driven entity validation and enrichment frameworks, enabling multi-region, horizontally scalable lead generation pipelines.

Candidates must demonstrate expertise in end-to-end cloud-based data pipeline automation, leveraging event-driven serverless architectures utilizing AWS Lambda, Google Cloud Functions, Azure Functions, and open source serverless technologies to automate entity ingestion, classification, and validation. Senior Data Scientist Interns must be able to develop REST API, GraphQL, and gRPC-based high-performance microservices, interfacing with distributed knowledge graphs and entity resolution engines to support real-time lead generation workflows. The candidate will be responsible for multi-cloud Kubernetes orchestration with Helm-based package management, ensuring horizontally scalable deployment of NLP and GNN-based entity resolution frameworks. Senior Data Scientist Interns must also implement Terraform, and Ansible-driven Infrastructure as Code (IaC), enforcing fully declarative, reproducible, and auto-scaling infrastructure deployment models across AWS, Azure, GCP, Oracle Cloud, and self-hosted cloud.

This role requires an ultra-elite level of technical expertise, encompassing full-stack NLP engineering, multi-modal entity resolution, large-scale graph database optimization, and real-time event-driven data extraction. Candidates should already have a Master’s degree in Computer Science, Data Science, NLP, or Graph Theory, with a proven professional industry track record in developing high-performance, data science solutions. Candidates must fully commit to a minimum of 30 hours per week, and must be able to fully commit to staying at the company for at least a very bare minimum of 6 full months.

***MUST BE ABLE TO COMMIT STAYING AT THE COMPANY FOR AT LEAST A BARE MINIMUM OF 6 MONTHS.***

Key Responsibilities:

Advanced Expert-Level Data Parsing & Extraction
- Apply advanced expert-level knowledge of regex to handle intricate data parsing tasks with precision.
- Develop and implement complex data extraction processes using BeautifulSoup, DOM-based XPath heuristics, and other advanced tools.
- Utilize headless browsers such as Selenium, Puppeteer, and Playwright for scalable and efficient web scraping.

Advanced Expert-Level NLP & Named Entity Recognition
- Design and implement custom NER pipelines using SpaCy or similar open-source tools, with advanced customization and optimization.
- Lead the development of Hierarchical Entity Extraction for unstructured datasets at scale.
- Fine-tune and deploy transformer-based models like BERT and RoBERTa on specialized labeled datasets.
- Advance contextual embeddings using models such as Sentence-BERT for high-accuracy NLP deployments.

Graph-Based Associativity & Knowledge Representation
- Construct and optimize probabilistic graphs and distributed knowledge graphs using tools like Neo4j or JanusGraph.
- Implement Graph Neural Networks (GNNs) for entity disambiguation and advanced relationship modeling.
- Lead the development of Graph-Based Associativity Models for scalable entity association.

Advanced Similarity Matching & Clustering
- Develop and refine matching algorithms using techniques like Fuzzy Matching, Levenshtein Distance, and BERT embeddings.
- Implement Approximate Nearest Neighbor (ANN) algorithms for fast and efficient similarity detection.
- Apply contextual clustering techniques to improve entity recognition systems and enhance model accuracy.

Machine Learning & Transformer-Based Models
- Build, train, and optimize transformer-based NER models for high-fidelity language understanding.
- Apply Luhn’s algorithm and other heuristic methods to refine large-scale text summarization and keyword extraction tasks.

Team Collaboration & Documentation
- Collaborate with cross-functional teams to ensure seamless integration of advanced data science solutions into business processes.
- Maintain thorough and precise documentation of methodologies, algorithms, and project findings.

Qualifications:

Experience: 4+ years of professional industry experience with advanced level data science and deep learning.
Education: Recently graduated with a Master’s degree in Computer Science, Data Science, or Distributed Systems (completed).
Technical Expertise:
- Advanced expert-level proficiency in Python for implementing large-scale AI, distributed systems, and data science solutions.
- Demonstrated expertise in regex for handling complex data parsing scenarios.
- Proven mastery of BeautifulSoup, XPath-based heuristics, and DOM parsing for data extraction.
- Advanced skills in headless browsers such as Selenium, Puppeteer, and Playwright.
- Extensive experience with SpaCy, including custom NER pipelines and transformer-based NLP models.
- In-depth knowledge of graph databases such as Neo4j or JanusGraph, and advanced graph algorithms.
- Proficiency in Graph Neural Networks (GNNs) for high-level entity disambiguation and relationship modeling.
- Expertise in contextual embeddings and similarity algorithms like Levenshtein Distance and Fuzzy Matching.
Strong experience in version control using Git, and collaborative development using GitHub.
Time Commitment:
- ***MUST BE ABLE TO DEDICATE AT LEAST 30 HOURS PER WEEK TO THIS POSITION.***
- ***MUST BE ABLE TO STAY AT THE COMPANY FOR AT LEAST 6 MONTHS.***

Benefits:

Remote native; Location freedom
Professional industry experience in the SaaS and AI industry
Creative freedom
Potential to convert into a full-time position

Note:

This internship offers an exciting opportunity to gain hands-on experience in advanced data science within a high pressure and innovative environment. Candidates must be self-motivated, proactive, and capable of delivering high-quality results independently. The internship provides valuable exposure to cutting-edge technologies and real-world software development practices, making it an ideal opportunity for aspiring senior data scientists.

***This is an unpaid internship at this time and is suitable for new recent Master's graduate candidates that wants to be a Senior Data Scientist.***

Please send resumes to services_admin@flowai.tech