I have one question regarding a code I wrote to parse a data file
My code looks like this -
open($FILE, "C:\\data\\NCI sample.txt") or die "oh oh can't open";
-
-
open(OUT, ">C:\\data\\parsedNCI.txt");
-
-
while ($line = <$FILE>) {
-
if ($line =~ /ROtclserve/) {
-
print OUT "0";
-
}
-
if ($line =~ /ROtclserve|DTP_NAMES|E_CAS|E_STEREO_SPECIFIED|E_FORMULA|E_SMILES/) {
-
$line= trimwhitespace(getData($FILE));
-
print OUT "@@@@[$line]";
-
}
-
if ($line =~ /\$\$\$\$/) {
-
print OUT "@@@@@@";
-
}
-
}
-
-
sub getData() {
-
my $FH = shift;
-
my $line ="";
-
my $record = "";
-
-
while( ($line = <$FH>) && $line !~ /\>/ ) {
-
$record .= $line;
-
}
-
-
return $record;
-
}
-
-
-
sub trimwhitespace($) {
-
my $string = shift;
-
$string =~ s/^\s+//;
-
$string =~ s/\s+$//;
-
$string =~ s/\.$//;
-
$string =~ s/,$//g;
-
return $string;
-
}
-
An entry of my data looks like: -
NSC 1
-
ROtclserve09080314563D 0 0.00000 0.00000 1
-
-
15 15 0 0 0 0 0 0 0 0999 V2000
-
-2.2423 1.0418 0.0018 O 0 0 0 0 0 0 0 0 0 0 0 0
-
2.7534 -0.5594 0.0011 O 0 0 0 0 0 0 0 0 0 0 0 0
-
-1.0858 0.6711 0.0019 C 0 0 0 0 0 0 0 0 0 0 0 0
-
-0.7730 -0.7725 -0.0002 C 0 0 0 0 0 0 0 0 0 0 0 0
-
0.0079 1.6639 -0.0018 C 0 0 0 0 0 0 0 0 0 0 0 0
-
0.5032 -1.1815 -0.0005 C 0 0 0 0 0 0 0 0 0 0 0 0
-
1.2841 1.2548 -0.0021 C 0 0 0 0 0 0 0 0 0 0 0 0
-
1.5970 -0.1888 0.0014 C 0 0 0 0 0 0 0 0 0 0 0 0
-
-1.8888 -1.7853 -0.0016 C 0 0 0 0 0 0 0 0 0 0 0 0
-
-0.2208 2.7193 -0.0047 H 0 0 0 0 0 0 0 0 0 0 0 0
-
0.7319 -2.2370 0.0020 H 0 0 0 0 0 0 0 0 0 0 0 0
-
2.0838 1.9807 -0.0008 H 0 0 0 0 0 0 0 0 0 0 0 0
-
-2.1581 -2.0307 1.0257 H 0 0 0 0 0 0 0 0 0 0 0 0
-
-2.7559 -1.3699 -0.5151 H 0 0 0 0 0 0 0 0 0 0 0 0
-
-1.5597 -2.6879 -0.5165 H 0 0 0 0 0 0 0 0 0 0 0 0
-
1 3 2 0 0 0 0
-
2 8 2 0 0 0 0
-
3 5 1 0 0 0 0
-
6 8 1 0 0 0 0
-
7 8 1 0 0 0 0
-
4 6 2 0 0 0 0
-
3 4 1 0 0 0 0
-
4 9 1 0 0 0 0
-
5 7 2 0 0 0 0
-
5 10 1 0 0 0 0
-
6 11 1 0 0 0 0
-
7 12 1 0 0 0 0
-
9 13 1 0 0 0 0
-
9 14 1 0 0 0 0
-
9 15 1 0 0 0 0
-
M END
-
> <NSC>
-
1
-
-
> <DTP_NAMES>
-
p-Benzoquinone, 2-methyl- (8CI)
-
p-Toluquinone
-
Methyl-p-benzoquinone
-
Methyl-1,4-benzoquinone
-
Methylbenzoquinone
-
Methylquinone
-
Toluquinone (VAN)
-
Tolylquinone
-
WLN: L6V DVJ B1
-
1,4-Toluquinone
-
2-Methyl-p-benzoquinone
-
2-Methyl-1,4-benzoquinone
-
2-Methyl-1, 4-quinone
-
2-Methylbenzoquinone
-
2-Methylbenzoquinone-1,4
-
2-Methylquinone
-
2,5-Cyclohexadiene-1,4-dione, 2-methyl- (9CI)
-
-
> <CAS_RN>
-
553-97-9
-
-
> <ORIGIN>
-
DTP FEB 2003
-
-
> <E_UNIQUE_ID>
-
NCI-Open_09-03_NSC_1
-
-
> <E_NSC>
-
1
-
-
> <E_CAS>
-
553-97-9
-
-
> <E_NAME>
-
NSC 1
-
-
> <E_STEREO_SPECIFIED>
-
no_stereocenter
-
-
> <E_COMPOUND_TYPE>
-
normal
-
-
> <E_THREED_SOURCE>
-
CORINA 2.6
-
-
> <E_SMILES>
-
O=C1C=CC(=O)C=C1C
-
-
> <E_FORMULA>
-
C7H6O2
-
-
> <E_HASHY>
-
6CAA42E0D61B68CC
-
-
$$$$
-
However the parsed out entry looks like : -
0@@@@[15 15 0 0 0 0 0 0 0 0999 V2000
-
-2.2423 1.0418 0.0018 O 0 0 0 0 0 0 0 0 0 0 0 0
-
2.7534 -0.5594 0.0011 O 0 0 0 0 0 0 0 0 0 0 0 0
-
-1.0858 0.6711 0.0019 C 0 0 0 0 0 0 0 0 0 0 0 0
-
-0.7730 -0.7725 -0.0002 C 0 0 0 0 0 0 0 0 0 0 0 0
-
0.0079 1.6639 -0.0018 C 0 0 0 0 0 0 0 0 0 0 0 0
-
0.5032 -1.1815 -0.0005 C 0 0 0 0 0 0 0 0 0 0 0 0
-
1.2841 1.2548 -0.0021 C 0 0 0 0 0 0 0 0 0 0 0 0
-
1.5970 -0.1888 0.0014 C 0 0 0 0 0 0 0 0 0 0 0 0
-
-1.8888 -1.7853 -0.0016 C 0 0 0 0 0 0 0 0 0 0 0 0
-
-0.2208 2.7193 -0.0047 H 0 0 0 0 0 0 0 0 0 0 0 0
-
0.7319 -2.2370 0.0020 H 0 0 0 0 0 0 0 0 0 0 0 0
-
2.0838 1.9807 -0.0008 H 0 0 0 0 0 0 0 0 0 0 0 0
-
-2.1581 -2.0307 1.0257 H 0 0 0 0 0 0 0 0 0 0 0 0
-
-2.7559 -1.3699 -0.5151 H 0 0 0 0 0 0 0 0 0 0 0 0
-
-1.5597 -2.6879 -0.5165 H 0 0 0 0 0 0 0 0 0 0 0 0
-
1 3 2 0 0 0 0
-
2 8 2 0 0 0 0
-
3 5 1 0 0 0 0
-
6 8 1 0 0 0 0
-
7 8 1 0 0 0 0
-
4 6 2 0 0 0 0
-
3 4 1 0 0 0 0
-
4 9 1 0 0 0 0
-
5 7 2 0 0 0 0
-
5 10 1 0 0 0 0
-
6 11 1 0 0 0 0
-
7 12 1 0 0 0 0
-
9 13 1 0 0 0 0
-
9 14 1 0 0 0 0
-
9 15 1 0 0 0 0
-
M END]@@@@[p-Benzoquinone, 2-methyl- (8CI)
-
p-Toluquinone
-
Methyl-p-benzoquinone
-
Methyl-1,4-benzoquinone
-
Methylbenzoquinone
-
Methylquinone
-
Toluquinone (VAN)
-
Tolylquinone
-
WLN: L6V DVJ B1
-
1,4-Toluquinone
-
2-Methyl-p-benzoquinone
-
2-Methyl-1,4-benzoquinone
-
2-Methyl-1, 4-quinone
-
2-Methylbenzoquinone
-
2-Methylbenzoquinone-1,4
-
2-Methylquinone
-
2,5-Cyclohexadiene-1,4-dione, 2-methyl- (9CI)]@@@@[553-97-9]@@@@[no_stereocenter]@@@@[O=C1C=CC(=O)C=C1C]@@@@@@
-
Everything is good, BUT the "E_FORMULA" C7H6O2 is no printed out.
I really hope someone could advise me, I have been trying for an agonizingly long time.
11 1761
might be a copy and paste problem, but in the regexp there is a space between O and R:
|E_FO RMULA|
if the space is really there, remove it.
it isn't a copy and paste problem, there is no space in between the letters.
I tried replacing E_FORMULA with E_HASHY, it doesnt print out the contents after E_HASHY too.
If I remove E_SMILES, I cant print out E_FORMULA, but somehow when I add E_SMILES back in, it doesnt print out E_FORMULA.
If I get a chance later I will look over your code.
ahh.... thought I was losing it there for a minute. You split the thread while I was reading it Miller.
Anyways.... the biggest problem appeared to be in the getData() sub. Here is the modified code (change the file paths back to yours): - use strict;
-
use warnings;
-
-
open(my $FILE, "C:/NCIsample.txt") or die "oh oh can't open";
-
open(OUT, ">C:/parsedNCI.txt");
-
-
while (my $line = <$FILE>) {
-
if ($line =~ /ROtclserve/){
-
print OUT "0";
-
}
-
elsif ($line =~ /DTP_NAMES|E_CAS|E_STEREO_SPECIFIED|E_FORMULA|E_SMILES/){
-
$line = trimwhitespace(getData($FILE));
-
print OUT "@@@@[$line]";
-
}
-
elsif ($line =~ /\$\$\$\$/) {
-
print OUT "@@@@@@";
-
}
-
}
-
close($FILE);
-
close(OUT);
-
-
sub getData {
-
my $FH = shift;
-
my $record = "";
-
while ( my $line = <$FH> ){
-
next if $line =~ /\>/o;
-
return ($record) if $line =~ /^\s*$/o or (eof);
-
$record .= $line;
-
}
-
}
-
-
sub trimwhitespace {
-
my $string = shift;
-
$string =~ s/^\s+//mo;
-
$string =~ s/\s+$//mo;
-
$string =~ s/\.$//mo;
-
$string =~ s/,$//mo;
-
return $string;
-
}
-
-
output: - 0@@@@[p-Benzoquinone, 2-methyl- (8CI)
-
p-Toluquinone
-
Methyl-p-benzoquinone
-
Methyl-1,4-benzoquinone
-
Methylbenzoquinone
-
Methylquinone
-
Toluquinone (VAN)
-
Tolylquinone
-
WLN: L6V DVJ B1
-
1,4-Toluquinone
-
2-Methyl-p-benzoquinone
-
2-Methyl-1,4-benzoquinone
-
2-Methyl-1, 4-quinone
-
2-Methylbenzoquinone
-
2-Methylbenzoquinone-1,4
-
2-Methylquinone
-
2,5-Cyclohexadiene-1,4-dione, 2-methyl- (9CI)]@@@@[553-97-9]@@@@[no_stereocenter]@@@@[O=C1C=CC(=O)C=C1C]@@@@[C7H6O2]@@@@@@
The trimwhitespace() sub appears to not really be doing anything, but maybe with other input it does.
WOW
Thanks!.. It worked. Really happy. =)
The trimwhitespace helps me get rid of \n
Thanks again
ooops.. but i realise sth..
The first part of the data scanning for what comes after ROtclserve is excluded.
ooops.. but i realise sth..
The first part of the data scanning for what comes after ROtclserve is excluded.
oops, you're right, see if this works better: - use strict;
-
use warnings;
-
-
open(my $FILE, "C:/NCIsample.txt") or die "oh oh can't open";
-
open(OUT, ">C:/parsedNCI.txt");
-
while (my $line = <$FILE>) {
-
if ($line =~ /ROtclserve/){
-
my $undef = <$FILE>; # gets rid of blank line after "ROtclserve"
-
print OUT "0";
-
}
-
if ($line =~ /ROtclserve|DTP_NAMES|E_CAS|E_STEREO_SPECIFIED|E_FORMULA|E_SMILES/){
-
$line = trimwhitespace(getData($FILE));
-
print OUT "@@@@[$line]";
-
}
-
elsif ($line =~ /\$\$\$\$/) {
-
print OUT "@@@@@@";
-
}
-
}
-
close($FILE);
-
close(OUT);
-
-
sub getData {
-
my $FH = shift;
-
my $record = "";
-
while ( my $line = <$FH> ){
-
return ($record) if $line =~ /^>/o or (eof);
-
$record .= $line;
-
}
-
}
-
-
sub trimwhitespace {
-
my $string = shift;
-
$string =~ s/^\s+//mo;
-
$string =~ s/\s+$//mo;
-
$string =~ s/\.$//mo;
-
$string =~ s/,$//mo;
-
return $string;
-
}
But now, the E_FORMULA doesnt print
hehehe... tricky little file to parse. I was trying to avoid this, but I made a seperate fuction to get the data for the "ROtclserve". Not knowing your input very well I do not know how well this will work with different data. But it does now appear to return all the data: -
use strict;
-
use warnings;
-
-
open(my $FILE, "C:/NCIsample.txt") or die "oh oh can't open";
-
open(OUT, ">C:/parsedNCI.txt");
-
LOOP: while (my $line = <$FILE>) {
-
if ($line =~ /ROtclserve/){
-
my $undef = <$FILE>;
-
print OUT "0";
-
$line = trimwhitespace(getDataROt($FILE));
-
print OUT "@@@@[$line]";
-
}
-
if ($line =~ /DTP_NAMES|E_CAS|E_STEREO_SPECIFIED|E_FORMULA|E_SMILES/){
-
$line = trimwhitespace(getData($FILE));
-
print OUT "@@@@[$line]";
-
}
-
elsif ($line =~ /\$\$\$\$/) {
-
print OUT "@@@@@@";
-
}
-
}
-
close($FILE);
-
close(OUT);
-
-
sub getData {
-
my $FH = shift;
-
my $record = "";
-
while ( my $line = <$FH> ){
-
return ($record) if $line =~ /^\s*$/o or (eof);
-
$record .= $line;
-
}
-
}
-
-
sub getDataROt {
-
my $FH = shift;
-
my $record = "";
-
while ( my $line = <$FH> ){
-
return ($record) if $line =~ /^>/o;
-
$record .= $line;
-
}
-
}
-
-
sub trimwhitespace {
-
my $string = shift;
-
$string =~ s/^\s+//mo;
-
$string =~ s/\s+$//mo;
-
$string =~ s/\.$//mo;
-
$string =~ s/,$//mo;
-
return $string;
-
}
output: - 0@@@@[15 15 0 0 0 0 0 0 0 0999 V2000
-
-2.2423 1.0418 0.0018 O 0 0 0 0 0 0 0 0 0 0 0 0
-
2.7534 -0.5594 0.0011 O 0 0 0 0 0 0 0 0 0 0 0 0
-
-1.0858 0.6711 0.0019 C 0 0 0 0 0 0 0 0 0 0 0 0
-
-0.7730 -0.7725 -0.0002 C 0 0 0 0 0 0 0 0 0 0 0 0
-
0.0079 1.6639 -0.0018 C 0 0 0 0 0 0 0 0 0 0 0 0
-
0.5032 -1.1815 -0.0005 C 0 0 0 0 0 0 0 0 0 0 0 0
-
1.2841 1.2548 -0.0021 C 0 0 0 0 0 0 0 0 0 0 0 0
-
1.5970 -0.1888 0.0014 C 0 0 0 0 0 0 0 0 0 0 0 0
-
-1.8888 -1.7853 -0.0016 C 0 0 0 0 0 0 0 0 0 0 0 0
-
-0.2208 2.7193 -0.0047 H 0 0 0 0 0 0 0 0 0 0 0 0
-
0.7319 -2.2370 0.0020 H 0 0 0 0 0 0 0 0 0 0 0 0
-
2.0838 1.9807 -0.0008 H 0 0 0 0 0 0 0 0 0 0 0 0
-
-2.1581 -2.0307 1.0257 H 0 0 0 0 0 0 0 0 0 0 0 0
-
-2.7559 -1.3699 -0.5151 H 0 0 0 0 0 0 0 0 0 0 0 0
-
-1.5597 -2.6879 -0.5165 H 0 0 0 0 0 0 0 0 0 0 0 0
-
1 3 2 0 0 0 0
-
2 8 2 0 0 0 0
-
3 5 1 0 0 0 0
-
6 8 1 0 0 0 0
-
7 8 1 0 0 0 0
-
4 6 2 0 0 0 0
-
3 4 1 0 0 0 0
-
4 9 1 0 0 0 0
-
5 7 2 0 0 0 0
-
5 10 1 0 0 0 0
-
6 11 1 0 0 0 0
-
7 12 1 0 0 0 0
-
9 13 1 0 0 0 0
-
9 14 1 0 0 0 0
-
9 15 1 0 0 0 0
-
M END]@@@@[p-Benzoquinone, 2-methyl- (8CI)
-
p-Toluquinone
-
Methyl-p-benzoquinone
-
Methyl-1,4-benzoquinone
-
Methylbenzoquinone
-
Methylquinone
-
Toluquinone (VAN)
-
Tolylquinone
-
WLN: L6V DVJ B1
-
1,4-Toluquinone
-
2-Methyl-p-benzoquinone
-
2-Methyl-1,4-benzoquinone
-
2-Methyl-1, 4-quinone
-
2-Methylbenzoquinone
-
2-Methylbenzoquinone-1,4
-
2-Methylquinone
-
2,5-Cyclohexadiene-1,4-dione, 2-methyl- (9CI)]@@@@[553-97-9]@@@@[no_stereocenter]@@@@[O=C1C=CC(=O)C=C1C]@@@@[C7H6O2]@@@@@@
The problem with the code is that it is very contrived to parse the input data you posted. It may not work if some of the sections of data you are looking for change position in the file or if blank lines are missing. I am sure a more robust script could be written, but not without a complete rewrite of your code.
You may need to remove the space that this forum added to the code I posted:
E_SMI LES
there is a space between "I" and "L" in the above in line 13 of the code.
Sign in to post your reply or Sign up for a free account.
Similar topics
by: chuck amadi |
last post by:
Hi , Im trying to parse a specific users mailbox (testwwws) and output
the body of the messages to a file ,that file will then be loaded into a
PostGresql DB at some point .
I have read the...
|
by: Ram Laxman |
last post by:
Hi all,
I have a text file which have data in CSV format.
"empno","phonenumber","wardnumber"
12345,2234353,1000202
12326,2243653,1000098
Iam a beginner of C/C++ programming.
I don't know how to...
|
by: nate |
last post by:
Hello,
Does anyone know where I can find an ASP server side script written in
JavaScript to parse text fields from a form method='POST' using
enctype='multipart/form-data'? I'd also like it to...
|
by: mydejamail |
last post by:
My PHP setup is not updating the error_log, I have checked all the
error logging related settings and they have all been set.
The script is also crashing inexplicably. Whenever I call a specific...
|
by: Richard |
last post by:
Which way would you guys recommened to best parse a multiline file which contains
two fields seperated by a tab. In this case its the
linux/proc/filesystems file a sample of which I have included...
|
by: gs |
last post by:
let say I have to deal with various date format and I am give format string
from one of the following
dd/mm/yyyy
mm/dd/yyyy
dd/mmm/yyyy
mmm/dd/yyyy
dd/mm/yy
mm/dd/yy
dd/mmm/yy
mmm/dd/yy
|
by: Perks |
last post by:
Hi.
I am trying to find out if it is possible to open a pdf file from
within PHP, and parse its contents in order to extract all form
fieldnames that might have been previously setup within the...
|
by: digidave |
last post by:
I am keenly aware that my coding skills are extremely noob but please indulge me a second.. Take a look at these queries..
$sql = "SELECT DISTINCT year FROM _current_floats_config WHERE active =...
|
by: Prodian |
last post by:
Im trying to parse a recv from a telnet session then only grab certain data.
Heres an example of the recv that Im storing into a string:
Internet 204.189.124.205 0 001a.a01f.4e5a ...
|
by: Sonnysonu |
last post by:
This is the data of csv file
1 2 3
1 2 3
1 2 3
1 2 3
2 3
2 3
3
the lengths should be different i have to store the data by column-wise with in the specific length.
suppose the i have to...
|
by: Hystou |
last post by:
There are some requirements for setting up RAID:
1. The motherboard and BIOS support RAID configuration.
2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
|
by: Hystou |
last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
|
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
|
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
|
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome a new...
|
by: conductexam |
last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
|
by: adsilva |
last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
| |