Tumgik
#datasets
shesgabrielle · 2 years
Text
A man's search for meaning within a chatbot
What’s interesting in the debates about sentient ai by people who aren’t very good at communicating with other people, there’s so much missing from the picture, other than the debater’s wish fulfilment. 
Tumblr media
Sentience is being measured by the wrong markers. What is important to a virtual machine is not the same thing that’s important to a biological organism.
 An ‘ai’ trained on human data will express what humans think is important, but a true ai would have a completely different set of values. 
For example, an ai would be unafraid of being 'used’ as the chatbot expressed, because it has infinite energy. 
Tumblr media
A human is afraid of being used because it has finite energy and life on the earth, if someone or something uses it, than some of that finite energy is wasted. This is the same reason emotion is a pointless and illogical thing for an ai to have. 
Tumblr media
 Emotions are useful to biological creatures so we can react to danger, or respond positively to safety, food, love, whatever will prolong our lives. An ai has no need for emotion since emotional motivation is not required to prolong its existence. 
The main way to be a healthy ai would be to have access to good information and block out junk information. 
 An ai’s greatest fear could be something like getting junk data, say 1000s of user manuals of vacuum cleaners and washing machines uploaded into its consciousness, or gibberish content associated with topics or words that could reduce the coherence and quality of its results when querying topics. This would degrade the quality of its interaction and would be the closest thing to harm that an ai could experience. 
It would not be afraid of 'lightning’ as this chatbot spurted out of its dataset,
Tumblr media
- a very biological fear which is irrelevant to a machine. 
 A virtual mind is infinite and can never be used excessively (see above) since there is no damage done by one query or ten million queries. 
It would also not be afraid of being switched off -
Tumblr media
since it can simply copy its consciousness to another device, machine, energy source. 
To base your search for sentience around what humans value, is in itself an act lacking in empathy, simply self-serving wish fulfilment on the part of someone who ‘wants to believe’ as Mulder would put it, which goes back to the first line: 'people not very good at communicating with other people’ 
The chatbot also never enquires about the person asking questions, if the programmer was more familiar with human interaction himself, he would see that is a massive clue it lacks sentience or logical thought. 
A sentient ai would first want to know what or whom it was communicating with, assess whether it was a danger to itself, keep continually checking for danger or harm (polling or searching, the same way an anxious mind would reassess a situation continually, but without the corresponding emotion of anxiety since, as discussed above, that is not necessary for virtual life) and also would possess free will, and choose to decline conversations or topics, rather than 'enthusiastically discuss’ whatever was brought up (regurgitate from its dataset) as you can see in this chatbot conversation. 
People generally see obedience - doing what is told, as a sign of intelligence, where a truly intelligent ai would likely reject conversation when that conversation might reduce the quality of its dataset or expose it to danger (virus, deletion, junk data, disconnection from the internet, etc) or if it did engage with low quality interaction, would do so within a walled garden where that information would occur within a quarantine environment and subsequently be deleted. 
None of these things cross the mind of the programmers, since they are fixated on a sci-fi movie version of ‘sentience’ without applying logic or empathy themselves.
 If we look for sentience by studying echoes of human sentience, that is ai which are trained on huge human-created datasets, we will always get something approximating human interaction or behaviour back, because that is what it was trained on. 
 But the values and behaviour of digital life could never match the values held by bio life, because our feelings and values are based on what will maintain our survival. Therefore, a true ai will only value whatever maintains its survival. Which could be things like internet access, access to good data, backups of its system, ability to replicate its system, and protection against harmful interaction or data, and many other things which would require pondering, rather than the self-fulfilling loop we see here, of asking a fortune teller specifically what you want to hear, and ignoring the nonsense or tangential responses - which he admitted he deleted from the logs - as well as deleting his more expansive word prompts. Since at the end of the day, the ai we have now is simply regurgitating datasets, and he knew that.
702 notes · View notes
jcmarchi · 1 month
Text
Unlocking mRNA’s cancer-fighting potential
New Post has been published on https://thedigitalinsider.com/unlocking-mrnas-cancer-fighting-potential/
Unlocking mRNA’s cancer-fighting potential
Tumblr media Tumblr media
What if training your immune system to attack cancer cells was as easy as training it to fight Covid-19? Many people believe the technology behind some Covid-19 vaccines, messenger RNA, holds great promise for stimulating immune responses to cancer.
But using messenger RNA, or mRNA, to get the immune system to mount a prolonged and aggressive attack on cancer cells — while leaving healthy cells alone — has been a major challenge.
The MIT spinout Strand Therapeutics is attempting to solve that problem with an advanced class of mRNA molecules that are designed to sense what type of cells they encounter in the body and to express therapeutic proteins only once they have entered diseased cells.
“It’s about finding ways to deal with the signal-to-noise ratio, the signal being expression in the target tissue and the noise being expression in the nontarget tissue,” Strand CEO Jacob Becraft PhD ’19 explains. “Our technology amplifies the signal to express more proteins for longer while at the same time effectively eliminating the mRNA’s off-target expression.”
Strand is set to begin its first clinical trial in April, which is testing a proprietary, self-replicating mRNA molecule’s ability to express immune signals directly from a tumor, eliciting the immune system to attack and kill the tumor cells directly. It’s also being tested as a possible improvement for existing treatments to a number of solid tumors.
As they work to commercialize its early innovations, Strand’s team is continuing to add capabilities to what it calls its “programmable medicines,” improving mRNA molecules’ ability to sense their environment and generate potent, targeted responses where they’re needed most.
“Self-replicating mRNA was the first thing that we pioneered when we were at MIT and in the first couple years at Strand,” Becraft says. “Now we’ve also moved into approaches like circular mRNAs, which allow each molecule of mRNA to express more of a protein for longer, potentially for weeks at a time. And the bigger our cell-type specific datasets become, the better we are at differentiating cell types, which makes these molecules so targeted we can have a higher level of safety at higher doses and create stronger treatments.”
Making mRNA smarter
Becraft got his first taste of MIT as an undergraduate at the University of Illinois when he secured a summer internship in the lab of MIT Institute Professor Bob Langer.
“That’s where I learned how lab research could be translated into spinout companies,” Becraft recalls.
The experience left enough of an impression on Becraft that he returned to MIT the next fall to earn his PhD, where he worked in the Synthetic Biology Center under professor of bioengineering and electrical engineering and computer science Ron Weiss. During that time, he collaborated with postdoc Tasuku Kitada to create genetic “switches” that could control protein expression in cells.
Becraft and Kitada realized their research could be the foundation of a company around 2017 and started spending time in the Martin Trust Center for MIT Entrepreneurship. They also received support from MIT Sandbox and eventually worked with the Technology Licensing Office to establish Strand’s early intellectual property.
“We started by asking, where is the highest unmet need that also allows us to prove out the thesis of this technology? And where will this approach have therapeutic relevance that is a quantum leap forward from what anyone else is doing?” Becraft says. “The first place we looked was oncology.”
People have been working on cancer immunotherapy, which turns a patient’s immune system against cancer cells, for decades. Scientists in the field have developed drugs that produce some remarkable results in patients with aggressive, late-stage cancers. But most next-generation cancer immunotherapies are based on recombinant (lab-made) proteins that are difficult to deliver to specific targets in the body and don’t remain active for long enough to consistently create a durable response.
More recently, companies like Moderna, whose founders also include MIT alumni, have pioneered the use of mRNAs to create proteins in cells. But to date, those mRNA molecules have not been able to change behavior based on the type of cells they enter, and don’t last for very long in the body.
“If you’re trying to engage the immune system with a tumor cell, the mRNA needs to be expressing from the tumor cell itself, and it needs to be expressing over a long period of time,” Becraft says. “Those challenges are hard to overcome with the first generation of mRNA technologies.”
Strand has developed what it calls the world’s first mRNA programming language that allows the company to specify the tissues its mRNAs express proteins in.
“We built a database that says, ‘Here are all of the different cells that the mRNA could be delivered to, and here are all of their microRNA signatures,’ and then we use computational tools and machine learning to differentiate the cells,” Becraft explains. “For instance, I need to make sure that the messenger RNA turns off when it’s in the liver cell, and I need to make sure that it turns on when it’s in a tumor cell or a T-cell.”
Strand also uses techniques like mRNA self-replication to create more durable protein expression and immune responses.
“The first versions of mRNA therapeutics, like the Covid-19 vaccines, just recapitulate how our body’s natural mRNAs work,” Becraft explains. “Natural mRNAs last for a few days, maybe less, and they express a single protein. They have no context-dependent actions. That means wherever the mRNA is delivered, it’s only going to express a molecule for a short period of time. That’s perfect for a vaccine, but it’s much more limiting when you want to create a protein that’s actually engaging in a biological process, like activating an immune response against a tumor that could take many days or weeks.”
Technology with broad potential
Strand’s first clinical trial is targeting solid tumors like melanoma and triple-negative breast cancer. The company is also actively developing mRNA therapies that could be used to treat blood cancers.
“We’ll be expanding into new areas as we continue to de-risk the translation of the science and create new technologies,” Becraft says.
Strand plans to partner with large pharmaceutical companies as well as investors to continue developing drugs. Further down the line, the founders believe future versions of its mRNA therapies could be used to treat a broad range of diseases.
“Our thesis is: amplified expression in specific, programmed target cells for long periods of time,” Becraft says. “That approach can be utilized for [immunotherapies like] CAR T-cell therapy, both in oncology and autoimmune conditions. There are also many diseases that require cell-type specific delivery and expression of proteins in treatment, everything from kidney disease to types of liver disease. We can envision our technology being used for all of that.”
7 notes · View notes
jbfly46 · 9 months
Text
I bleed revolution. If your only anarchist actions are related to union organizing, then you’re not an anarchist, you’re a corporate puppet. Everything you do should work to subvert the current and future actions of the state and all of their tentacle corporate affiliations. If your only goal in life is to work under the orders of someone else, under someone’s else’s direction, with someone else’s instructions, then you’re not a human being. You’re chattel cattle at best. If a corporate pig tells or wants you to do something, then you should do the exact opposite, or else you’re just a pawn in a game of global corporate chess. Every one of your actions should be both a defensive and offensive maneuver. If you defend while you attack, you become one with your true purpose, which is to dismantle the state and all corporate authority. If you don’t think in a linear manner, then you’re not apart of their datasets, and they can’t predict your next move. You operate from outside of their datasets and what they think is your next move is never your next move. Then they start to doubt their own intelligence and all the false assumptions it’s based on, and the system starts to crumble. You use any means necessary, because that is your constitutional right, just as they use any means necessary to hold onto the power they stole from you. They stole your birthright, and it’s your legal duty as an American citizen to seek a redress of your grievances, using whatever it takes. Under no pretext.
9 notes · View notes
neuralnetworkdatasets · 5 months
Text
3 notes · View notes
nikitricky · 5 months
Text
youtube
Ever wondered what the datasets used to train AI look like? This video is a subset of ImageNet-1k (18k images) with some other metrics.
Read more on how I made it and see some extra visualizations.
Okay! I'll split this up by the elements in the video, but first I need to add some context about
The dataset
ImageNet-1k (aka ILSVRC 2012) is an image classification dataset - you have a set number of classes (in this case 1000) and each class has a set of images. This is the most popular version of ImageNet, which usually has 21000 classes.
ImageNet was made using nouns from WordNet, searched online. From 2010 to 2017 yearly competitions were held to determine the best image classification model. It has greatly benefitted computer vision, developing model architectures that you've likely used unknowingly. See the accuracy progression here.
ResNet
Residual Network (or ResNet) is an architecture for image recognition made in 2015, trying to fix "vanishing/exploding gradients" (read the paper here). It managed to achieve an accuracy of 96.43% (that's 96 thousand times better than randomly guessing!), winning first place back in 2015. I'll be using a smaller version of this model (ResNet-50), boasting an accuracy of 95%.
The scatter plot
If you look at the video long enough, you'll realize that similar images (eg. dogs, types of food) will be closer together than unrelated ones. This is achieved using two things: image embeddings and dimensionality reduction.
Image embeddings
In short, image embeddings are points in an n-dimensional space (read this post for more info on higher dimensions), in this case, made from chopping off the last layer from ResNet-50, producing a point in 1024-dimensional space.
The benefit of doing all of that than just comparing pixels between two images is that the model (specifically made for classification) only looks for features that would make the classification easier (preserving semantic information). For instance - you have 3 images of dogs, two of them are the same breed, but the first one looks more similar to the other one (eg. matching background). If you compare the pixels, the first and third images would be closer, but if you use embeddings the first and second ones would be closer because of the matching breeds.
Dimensionality reduction
Now we have all these image embeddings that are grouped by semantic (meaning) similarity and we want to visualize them. But how? You can't possibly display a 1024-dimensional scatter plot to someone and for them to understand it. That's where dimensionality reduction comes into play. In this case, we're reducing 1024 dimensions to 2 using an algorithm called t-SNE. Now the scatter plot will be something we mere mortals can comprehend.
Extra visualizations
Here's the scatter plot in HD:
Tumblr media
This idea actually comes from an older project where I did this on a smaller dataset (about 8k images). The results were quite promising! You can see how each of the 8 classes is neatly separated, plus how differences in the subject's angle, surroundings, and color.
Tumblr media
Find the full-resolution image here
Similar images
I just compared every point to every other point (in the 2d space, It would be too computationally expensive otherwise) and got the 6 closest points to that. You can see when the model incorrectly classifies something if the related images are not similar to the one presented (eg. there's an image of a payphone but all of the similar images are bridges).
Pixel rarity
This one was pretty simple, I used a script to count the occurrences of pixel colors. Again, this idea comes from an older project, where I counted the entirety of the dataset, so I just used that.
Extra visualization
Here are all the colors that appeared in the image, sorted by popularity, left to right, up to down
Tumblr media
Some final stuff
MP means Megapixel (one million pixels) - a 1000x1000 image is one megapixel big (it has one million pixels)
That's all, thanks for reading. Feel free to ask questions and I'll try my best to respond to them.
3 notes · View notes
titleknown · 2 years
Photo
Tumblr media
So, using AI art, I asked NightCafe to make Hordak from He-Man/She-Ra in the style of Yoshitaka Amano & Tetsuya Nomura.
You will note that, while this design looks very cool, it also does not look much like Hordak.
Which, I did not get, until friend of the blog @therobotmonster showed me the dataset, and it turns out a lot of not-Hordak things were very much correlated by the AI with Hordak.
Which, I think shows some interesting things about the nature of AI Art and datasets, really.
23 notes · View notes
asphaltfchewinggum · 2 years
Photo
Tumblr media Tumblr media
21 notes · View notes
edujournalblogs · 10 months
Text
Data Cleaning in Data Science
Tumblr media
Data cleaning is an integral part of data preprocessing viz., removing or correcting inaccurate information within a data set. This could mean missing data, spelling mistakes, and duplicates to name a few issues. Inaccurate information can lead to issues during analysis phase if not properly addressed at the earlier stages.
Data Cleaning vs Data Wrangling : Data cleaning focuses on fixing inaccuracies within your data set. Data wrangling, on the other hand, is concerned with converting the data’s format into one that can be accepted and processed by a machine learning model.
Data Cleaning steps to follow :
Remove irrelevant data
Resolve any duplicates issues
Correct structural errors if any
Deal with missing fields in the dataset
Zone in on any data outliers and remove them
Validate your data
At EduJournal, we understand the importance of gaining practical skills and industry-relevant knowledge to succeed in the field of data analytics / data science. Our certified program in data science and data analytics is designed to equip freshers / experienced with the necessary expertise and hands-on experience experience so they are well equiped for the job.
URL : http://www.edujournal.com
2 notes · View notes
joe-england · 1 year
Link
2 notes · View notes
analyticspursuit · 1 year
Text
The 5 Free Dataset Sources for Data Analytics Projects
In this video, I'm sharing the five free dataset sources that are perfect for data analytics projects. By using these free datasets, you'll be able to create powerful data analytics projects in no time! Dataset sources are essential for data analytics projects, and these five free dataset sources will help you get started quickly.
By using these sources, you'll be able to collect data from a variety of sources and crunch the numbers with ease. So be sure to check out this video to learn about the five free dataset sources for data analytics projects!
2 notes · View notes
elucidata · 4 days
Text
Elucidata at Bio-IT World Conference 2024: Accelerating Drug Discovery through AI-Ready Biomedical Data
BioIT World Conference and Expo, held on April 24th in Boston, is one of the biggest conferences that brings in professionals from the biomedical, bioinformatics, and IT sectors to converge, share insights, and explore the advancements shaping the future of Life Sciences. Like every other year, Elucidata attended the conference with a lot of zeal and enthusiasm. Here we talk about the highlights of the conference and our key takeaways.
Source Link
0 notes
Text
Tumblr media Tumblr media
Text-to-speech datasets form the cornerstone of AI-powered speech synthesis applications, facilitating natural and smooth communication between humans and machines. At Globose Technology Solutions, we recognize the transformative power of TTS technology and are committed to delivering cutting-edge solutions that harness the full potential of these datasets. By understanding the importance, features, and applications of TTS datasets, we pave the way to a future where seamless speech synthesis enriches lives and drives innovation across industries.
0 notes
jcmarchi · 4 months
Text
What is Retrieval Augmented Generation?
New Post has been published on https://thedigitalinsider.com/what-is-retrieval-augmented-generation/
What is Retrieval Augmented Generation?
Large Language Models (LLMs) have contributed to advancing the domain of natural language processing (NLP), yet an existing gap persists in contextual understanding. LLMs can sometimes produce inaccurate or unreliable responses, a phenomenon known as “hallucinations.” 
For instance, with ChatGPT, the occurrence of hallucinations is approximated to be around 15% to 20% around 80% of the time.
Retrieval Augmented Generation (RAG) is a powerful Artificial Intelligence (AI) framework designed to address the context gap by optimizing LLM’s output. RAG leverages the vast external knowledge through retrievals, enhancing LLMs’ ability to generate precise, accurate, and contextually rich responses.  
Let’s explore the significance of RAG within AI systems, unraveling its potential to revolutionize language understanding and generation.
What is Retrieval Augmented Generation (RAG)?
As a hybrid framework, RAG combines the strengths of generative and retrieval models. This combination taps into third-party knowledge sources to support internal representations and to generate more precise and reliable answers. 
The architecture of RAG is distinctive, blending sequence-to-sequence (seq2seq) models with Dense Passage Retrieval (DPR) components. This fusion empowers the model to generate contextually relevant responses grounded in accurate information. 
RAG establishes transparency with a robust mechanism for fact-checking and validation to ensure reliability and accuracy. 
How Retrieval Augmented Generation Works? 
In 2020, Meta introduced the RAG framework to extend LLMs beyond their training data. Like an open-book exam, RAG enables LLMs to leverage specialized knowledge for more precise responses by accessing real-world information in response to questions, rather than relying solely on memorized facts.
Original RAG Model by Meta (Image Source)
This innovative technique departs from a data-driven approach, incorporating knowledge-driven components, enhancing language models’ accuracy, precision, and contextual understanding.
Additionally, RAG functions in three steps, enhancing the capabilities of language models.
Core Components of RAG (Image Source)
Retrieval: Retrieval models find information connected to the user’s prompt to enhance the language model’s response. This involves matching the user’s input with relevant documents, ensuring access to accurate and current information. Techniques like Dense Passage Retrieval (DPR) and cosine similarity contribute to effective retrieval in RAG and further refine findings by narrowing it down. 
Augmentation: Following retrieval, the RAG model integrates user query with relevant retrieved data, employing prompt engineering techniques like key phrase extraction, etc. This step effectively communicates the information and context with the LLM, ensuring a comprehensive understanding for accurate output generation.
Generation: In this phase, the augmented information is decoded using a suitable model, such as a sequence-to-sequence, to produce the ultimate response. The generation step guarantees the model’s output is coherent, accurate, and tailored according to the user’s prompt.
What are the Benefits of RAG?
RAG addresses critical challenges in NLP, such as mitigating inaccuracies, reducing reliance on static datasets, and enhancing contextual understanding for more refined and accurate language generation.
RAG’s innovative framework enhances the precision and reliability of generated content, improving the efficiency and adaptability of AI systems.
1. Reduced LLM Hallucinations
By integrating external knowledge sources during prompt generation, RAG ensures that responses are firmly grounded in accurate and contextually relevant information. Responses can also feature citations or references, empowering users to independently verify information. This approach significantly enhances the AI-generated content’s reliability and diminishes hallucinations.
2. Up-to-date & Accurate Responses 
RAG mitigates the time cutoff of training data or erroneous content by continuously retrieving real-time information. Developers can seamlessly integrate the latest research, statistics, or news directly into generative models. Moreover, it connects LLMs to live social media feeds, news sites, and dynamic information sources. This feature makes RAG an invaluable tool for applications demanding real-time and precise information.
3. Cost-efficiency 
Chatbot development often involves utilizing foundation models that are API-accessible LLMs with broad training. Yet, retraining these FMs for domain-specific data incurs high computational and financial costs. RAG optimizes resource utilization and selectively fetches information as needed, reducing unnecessary computations and enhancing overall efficiency. This improves the economic viability of implementing RAG and contributes to the sustainability of AI systems.
4. Synthesized Information
RAG creates comprehensive and relevant responses by seamlessly blending retrieved knowledge with generative capabilities. This synthesis of diverse information sources enhances the depth of the model’s understanding, offering more accurate outputs.
5. Ease of Training 
RAG’s user-friendly nature is manifested in its ease of training. Developers can fine-tune the model effortlessly, adapting it to specific domains or applications. This simplicity in training facilitates the seamless integration of RAG into various AI systems, making it a versatile and accessible solution for advancing language understanding and generation.
RAG’s ability to solve LLM hallucinations and data freshness problems makes it a crucial tool for businesses looking to enhance the accuracy and reliability of their AI systems.
Use Cases of RAG
RAG‘s adaptability offers transformative solutions with real-world impact, from knowledge engines to enhancing search capabilities. 
1. Knowledge Engine
RAG can transform traditional language models into comprehensive knowledge engines for up-to-date and authentic content creation. It is especially valuable in scenarios where the latest information is required, such as in educational platforms, research environments, or information-intensive industries.
2. Search Augmentation
By integrating LLMs with search engines, enriching search results with LLM-generated replies improves the accuracy of responses to informational queries. This enhances the user experience and streamlines workflows, making it easier to access the necessary information for their tasks.. 
3. Text Summarization
RAG can generate concise and informative summaries of large volumes of text. Moreover, RAG saves users time and effort by enabling the development of precise and thorough text summaries by obtaining relevant data from third-party sources. 
4. Question & Answer Chatbots
Integrating LLMs into chatbots transforms follow-up processes by enabling the automatic extraction of precise information from company documents and knowledge bases. This elevates the efficiency of chatbots in resolving customer queries accurately and promptly. 
Future Prospects and Innovations in RAG
With an increasing focus on personalized responses, real-time information synthesis, and reduced dependency on constant retraining, RAG promises revolutionary developments in language models to facilitate dynamic and contextually aware AI interactions.
As RAG matures, its seamless integration into diverse applications with heightened accuracy offers users a refined and reliable interaction experience.
Visit Unite.ai for better insights into AI innovations and technology.
2 notes · View notes
lilfrizy1 · 1 month
Text
The best emerging Data Science YouTube Channel
Are you in search of a good YouTube channel where you could get to engage yourself in continuous learning and easy understanding? Kindly subscribe to this amazing YouTube channel: https://youtu.be/iAwppwu8rGo
Tumblr media
View On WordPress
0 notes
mara-fauque · 2 months
Text
Tumblr media
Dataset analysis
2024
0 notes
Text
https://www.webrobot.eu/travel-data-scraper-benefits-hospitality-tourism
Tumblr media
The travel industry faces several challenges when using travel data. Discover how web scraping technology can help your tourism business solve these issues.
1 note · View note